Front page | perl.perl5.porters |
Postings from April 2012
Re: unicode question
From: Linda W
April 26, 2012 01:26
Re: unicode question
Message ID: 4F990699.email@example.com
Eric Brine wrote:
> On Wed, Apr 25, 2012 at 9:12 PM, Linda W <firstname.lastname@example.org
> <mailto:email@example.com>> wrote:
> I read this in my 5.14 documentation (in man page, for
> perlunicode, my p#\) added).
> 1) � � �"use encoding" needed to upgrade non-Latin-1 byte strings
> � � � �By default, there is a fundamental asymmetry in Perl's Unicode
> � � � �model: implicit upgrading from byte strings to Unicode strings
> � � � �assumes that they were encoded in ISO 8859-1 (Latin-1), but
> � � � �strings are downgraded with UTF-8 encoding. �This happens
> � � � �the first 256 codepoints in Unicode happens to agree with
> (1) refers to how Perl behaves in response to bugs in user code.
??? Bugs in user code the first 256 code points don't agree! The
first 127 code points agree. But at encoding 80, you have to go to 2-byte
encoding, to save everything, -- I don't understand when you say
'downgraded', as downgrading implies a loss of information. Where as
UTF-8 can hold all of
unicode, ISO-8859-1 only holds 256 bytes, the latter half of which are not
unicode compatible because they have the high bit set.
If Perl interprets **STDIN**, (not an arbitrary file opened with 'open', but
standard stream'ed input from an all UTF-8 environment, then the assumption
should be UTF-8 encoding.
To do otherwise is going to cause problems.
> (3) refers to the fixing (when C<< use feature 'unicode_strings'; >>
> is used) of most instances of a collection of bug in Perl known as
> "The Unicode Bug".
> They are not related.
What I didn't understand was why is it fixed in 5.14 but with a use 5.12
I.e. wasn't it fixed in 5.12? If it wasn't fixed until 5.14, then why isn't
it a use 5.14 that triggers the new behavior?
> I.e. If I'm on a UTF-8 system, and my env is set for UTF-8:
> Your locale only affects Perl when C<< use locale >> is in effect, and
> even then, it doesn't affect file handles. Additionally, there is
> use open ':std' => ':locale';
> The default then and now is that Perl does not mess your file handles.
It shouldn't mess with UTF-8 encoded STDIN/STDOUT either.
It shouldn't assume a charset that's about 20 years out of date when
most systems default to UTF-8 encoding (Windows aside)...
> Perl returns the bytes it reads from the file handle as is. If your
> file handles are expected to have text of a certain encoding, it's up
> to you to decode it or to tell Perl to decode it. Perl has no way of
> knowing whether a file handle is used to transmit text or not, and it
> has no way of knowing the encoding of that text.
If the encoding is NOT UTF-8, yes, but I thought it perl was fully UTF-8
> IF perl properly pays attention to the environment
> ...it would corrupt data on many file handles.
Never mentioned file handles, I'd talking <[STDIN]> and print
If there is an "asymmetry", in perl it IS messing with the bytes. the
same bytes should be able to go in as come out... asymmetry implies
But I see there is confusion about this with others as well...