Front page | perl.perl5.porters |
Postings from April 2012
Re: unicode question
From: Brian Fraser
April 25, 2012 20:51
Re: unicode question
Message ID: CA+nL+nYo1zo8vHVOj0YSw8ygquCmE7rih67UHxK9jBO2zOo2nw@mail.gmail.com
On Wed, Apr 25, 2012 at 10:12 PM, Linda W <firstname.lastname@example.org> wrote:
> I read this in my 5.14 documentation (in man page, for perlunicode, my
> p#\) added).
> 1) "use encoding" needed to upgrade non-Latin-1 byte strings
> By default, there is a fundamental asymmetry in Perl's Unicode
> model: implicit upgrading from byte strings to Unicode strings
> assumes that they were encoded in ISO 8859-1 (Latin-1), but Unicode
> strings are downgraded with UTF-8 encoding. This happens because
> the first 256 codepoints in Unicode happens to agree with Latin-1
Side comment, but eep, are the docs still suggesting that people do 'use
encoding'? I don't think they should.
> 1.1) See "Byte and Character Semantics" for more details.
> 2) Byte and Character Semantics
> Beginning with version 5.6, Perl uses logically-wide characters to
> represent strings internally.
> 3) Starting in Perl 5.14, Perl-level operations work with characters
> rather than bytes within the scope of a "use feature
> (or equivalently "use 5.012" or higher). (This is not true if bytes
> have been explicitly requested by "use bytes", nor necessarily true
> interactions with the platform's operating system.)
> Ok. If I understand the above correctly, then starting in 5.14 -- but
> triggered by 5.012? 5.14.0 (or 5.014?)... and NOT in 5.12, (?!?! what
> happens there, if not the same, why is the trigger 5.12?)
unicode_strings did not apply to all operations in 5.12. If I recall
correctly only the regex engine was effected in 5.12? Anyway, use 5.012;
(or use v5.12;) or later (use 5.014; use 5.016: yadda) will implicitly
enable unicode_strings, as if you had explicitly done 'use feature
':5.12';', or 'use feature qw( say switch state unicode_strings );'
> Then the statement in paragraph 1 about perl having fundamental assymetric
> problems, no longer applies?
In 5.14+ under unicode_strings, that's mostly right. It's all mostly
treated as UTF-8, syscalls aside. See "The Unicode Bug" and "When Unicode
Does not Happen"
> I.e. If I'm on a UTF-8 system, and my env is set for UTF-8:
> > locale
> The the default will NOW be UTF-8??? (and is wasn't before?!?!?!
Nope. locales have nothing to do with it. Your locales will only come into
effect if you do 'use locale', or slightly more sanely in 5.16 if you do
'use locale ":not_characters"'. Moreso, after you set all those vars, what
did you the expect to be affected? The regex engine? Operations on strings?
IO? How @ARGV is decoded? How the source code is parsed? All/some?
> IF perl properly pays attention to the environment, as I thought it was
> documented to do in 5.8, great.
I have never really used 5.8, but it was my impression that the Unicode
model in 5.8.0 was different from the model of later versions. So when
someone tells you something about 5.8, you ask what sub-version, and by
which vendor : D
> If not... er..*OUCH* (that hurts) -- I've though perl was fully unicode
> in a UTF-8 env since 5.8... when I was told it was... (me<-gullible).
> So was a I idiot for drinking the koolaid instead of the fine print (a bit
> dry to quench thirst)? Is it fixed now?
We are all idiots drinking the koolaid when it comes to Unicode. In that
regard, http://stackoverflow.com/a/6163129 as well as the perlunicook
manpage (new in 5.16!) are _very_ nice resources.