Front page | perl.perl5.porters |
Postings from April 2012
Re: unicode question
From: Eric Brine
April 26, 2012 09:51
Re: unicode question
Message ID: CALJW-qHJ8MN1kkBKpHY9cQQRsRBXgg61E61GyvJ2fcENkub+jQ@mail.gmail.com
On Thu, Apr 26, 2012 at 4:26 AM, Linda W <email@example.com> wrote:
> If Perl interprets **STDIN**, (not an arbitrary file opened with 'open',
> standard stream'ed input from an all UTF-8 environment, then the
> should be UTF-8 encoding.
No. At best, it's only valid to assume it's UTF-8 if the handle is known to
be text, and Perl has no way of knowing that.
To do otherwise would corrupt data.
> (3) refers to the fixing (when C<< use feature 'unicode_strings'; >> is
>> used) of most instances of a collection of bug in Perl known as "The
>> Unicode Bug".
What I didn't understand was why is it fixed in 5.14 but with a use 5.12
Some instances were fixed in 5.12, some more in 5.14. Some haven't been
use open ':std' => ':locale';
Tells Perl you're expecting text encoded as per your locale. Does what you
want. Read the docs.
> The default then and now is that Perl does not mess your file handles.
> It shouldn't.
> It shouldn't mess with UTF-8 encoded STDIN/STDOUT either.
Exactly. It doesn't. You get exactly what the file contains.
> It shouldn't assume a charset that's about 20 years out of date when most
> systems default to UTF-8 encoding (Windows aside)...
It doesn't. It makes no assumption whatsoever. You get exactly what is on
the other end of the handle. That's the only sane approach. Anything else
would corrupt some data.
> Perl returns the bytes it reads from the file handle as is. If your
> file handles are expected to have text of a certain encoding, it's up to
> you to decode it or to tell Perl to decode it. Perl has no way of knowing
> whether a file handle is used to transmit text or not, and it has no way of
> knowing the encoding of that text.
> If the encoding is NOT UTF-8, yes, but I thought it perl was fully UTF-8
> compliant now...?
Perl does indeed support UTF-8 and Unicode. That doesn't mean it'll assume
something is UTF-8 when it has no way to know it is.
IF perl properly pays attention to the environment
>> ...it would corrupt data on many file handles.
> Never mentioned file handles, I'd talking <[STDIN]> and print
What do you think those are?!?!?!
> If there is an "asymmetry", in perl it IS messing with the bytes. the
> same bytes should be able to go in as come out...
asymmetry implies this isn't the case.
The same bytes do "go in as come out". There's no such implication.
The asymmetry mentioned regards Perl's behaviour when given buggy code.
Specifically, when you treat those bytes as unicode code points or vice