develooper Front page | perl.perl5.porters | Postings from February 2012

Re: [perl #109828] PerlIO::scalar does not handle UTF-8

Thread Previous | Thread Next
From:
David Golden
Date:
February 12, 2012 18:30
Subject:
Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Message ID:
CAOeq1c9oKVsDpKxSA_6C6-BvoSJ9WE7qihzk7e59bbMp2Ro9xw@mail.gmail.com
On Sun, Feb 12, 2012 at 5:02 PM, Father Chrysostomos via RT
<perlbug-followup@perl.org> wrote:
> On Mon Feb 06 07:19:37 2012, xdaveg@gmail.com wrote:
>> Then when something wants to use that string as a source of bytes,
>> should Perl (a) just dump out whatever bytes it uses internally for
>> its implementation?  Or (b) should it convert the internal
>> representation to some standard representation?  Or (c) should it blow
>> up?
>
> (a) is what Perl currently does, as Leon Timmerman said.
>
> By (b) I presume you mean to treat \xff as \xff regardless of how it is
> stored internally, which makes sense.

Sort of.  What I meant is that (a) is "whatever we do" and (b) is "a
specific encoding".  Those are likely to be similar, but one is vague
and mutable and the other specific and fixed.  Such a promise would
persist under the usual back-compatibility rules even if we changed
the internal representation in the future for some reason. It could
also mean that we could choose give UTF-8 and not "utf8" (i.e. lax,
internal encoding) -- and would croak if we can't translate from the
internal to UTF-8.

For example, for a string with wide characters used as in in-memory
file, we could promise to translate from the internal encoding to
UTF-8 when the handle is read.  That would make it resemble a disk
file encoded in UTF-8, requiring the ":encoding(UTF-8)" flag and so
on.  Thus some function that is passed a handle to read shouldn't know
or care whether it's an in memory string or an on-disk file -- though
the *programmer* would need to know what encoding they expect to
receive given their particular application.

> An in-memory scalar could be considered a byte stream.  Or it could just
> be considered a string of characters.

My bias is strongly that it should be a byte-stream, which is why I'm
only considering how we choose to take a string of (wide) characters
and make it into a byte stream in some standard way:  (a) "whatever"
(b) "a promise" and (c) "boom!"

-- David

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About