develooper Front page | perl.perl5.porters | Postings from April 2012

Re: unicode question

Thread Previous | Thread Next
From:
Linda W
Date:
April 26, 2012 01:26
Subject:
Re: unicode question
Message ID:
4F990699.5000407@tlinx.org
Eric Brine wrote:
> On Wed, Apr 25, 2012 at 9:12 PM, Linda W <perl-diddler@tlinx.org 
> <mailto:perl-diddler@tlinx.org>> wrote:
>
>     I read this in my 5.14 documentation (in man page, for
>     perlunicode, my p#\) added).
>
>     1) � � �"use encoding" needed to upgrade non-Latin-1 byte strings
>     � � � �By default, there is a fundamental asymmetry in Perl's Unicode
>     � � � �model: implicit upgrading from byte strings to Unicode strings
>     � � � �assumes that they were encoded in ISO 8859-1 (Latin-1), but
>     Unicode
>     � � � �strings are downgraded with UTF-8 encoding. �This happens
>     because
>     � � � �the first 256 codepoints in Unicode happens to agree with
>     Latin-1
>
>
> (1) refers to how Perl behaves in response to bugs in user code.
---
    ???  Bugs in user code  the first 256 code points don't agree!  The
first 127 code points agree.   But at encoding 80, you have to go to 2-byte
encoding, to save everything, -- I don't understand when you say 
'downgraded', as downgrading implies a loss of information.  Where as 
UTF-8 can hold all of
unicode, ISO-8859-1 only holds 256 bytes, the latter half of which are not
unicode compatible because they have the high bit set.

If Perl interprets **STDIN**, (not an arbitrary file opened with 'open', but
standard stream'ed input from an all UTF-8 environment, then the assumption
should be UTF-8 encoding.

To do otherwise is going to cause problems.


>
> (3) refers to the fixing (when C<< use feature 'unicode_strings'; >> 
> is used) of most instances of a collection of bug in Perl known as 
> "The Unicode Bug".
>
> They are not related.

What I didn't understand was why is it fixed in 5.14 but with a use 5.12 
statement?

I.e. wasn't it fixed in 5.12?  If it wasn't fixed until 5.14, then why isn't
it a use 5.14 that triggers the new behavior?



>
>
>     I.e. If I'm on a UTF-8 system, and my env is set for UTF-8:
>
>
> Your locale only affects Perl when C<< use locale >> is in effect, and 
> even then, it doesn't affect file handles. Additionally, there is
>
> use open ':std' => ':locale';
?!!?!?


>
> The default then and now is that Perl does not mess your file handles.
It shouldn't.

It shouldn't mess with UTF-8 encoded STDIN/STDOUT either.

It shouldn't assume a charset that's about 20 years out of date when 
most systems default to UTF-8 encoding (Windows aside)...


> Perl returns the bytes it reads from the file handle as is. If your 
> file handles are expected to have text of a certain encoding, it's up 
> to you to decode it or to tell Perl to decode it. Perl has no way of 
> knowing whether a file handle is used to transmit text or not, and it 
> has no way of knowing the encoding of that text.
----
   If the encoding is NOT UTF-8, yes, but I thought it perl was fully UTF-8
compliant now...?


>
>     IF perl properly pays attention to the environment
>
>
> ...it would corrupt data on many file handles.
---
    Never mentioned file handles, I'd talking <[STDIN]> and print 
[STDOUT/STDERR].

If there is an "asymmetry", in perl it IS messing with the bytes.   the 
same bytes should be able to go in as  come out... asymmetry implies 
this isn't
the case.


But I see there is confusion about this with others as well...








Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About