Front page | perl.perl5.porters |
Postings from March 2014
Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 localesonly under 5.19.2+
Thread Previous
|
Thread Next
From:
Karl Williamson
Date:
March 28, 2014 05:08
Subject:
Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 localesonly under 5.19.2+
Message ID:
53350414.9000006@khwilliamson.com
In this post, I will just give some new insights I had today.
There are real bugs (even if the others previously mentioned aren't
regarded as such) when "$!" isn't returned with the UTF-8 flag on, and
when $! is stringified to its locale string outside of "use locale" scope.
Consider this one liner:
LC_ALL=zh_CN.utf8 ./perl -Ilib -le 'use utf8; $!=1; die "致命錯誤: $!"'
In blead, it prints, as it should,
Wide character in die at -e line 1
致命錯誤: 不允许的操作 at -e line 1
In 5.18.2 it prints this garbage instead
Wide character in die at -e line 1
致命錯誤: ä¸å
许çæä½ at -e line 1
The reason is that the program is encoded in utf8, and $! has returned
utf8 (only in the 5.18 case) without setting the utf8 flag, and so Perl
takes the bytes that form $! and upgrades those bytes into utf8 (again).
In other words, its encoding twice.
(I chose Chinese because its script could not be confused with Western
European characters, and I used Google translate, so the constant
portion of the text may not make sense; I apologize to the Chinese
speakers reading this.)
"use utf8" is not necessary for this. It could be "die "$prefix: $!"
where $prefix has its utf8 flag on.
These examples show, once again, the perils of having a scalar that's in
UTF-8, but pretending it's not, even if it's just in a die(). I claim
they conclusively show the brokenness of the 5.18 code.
Another problem with all existing versions is if the $prefix is written
in Latin1. Recall that the default character sets of Perl are ASCII,
Latin1, and full Unicode, each a superset of the previous. So someone
might in Hungarian might write
./perl -Ilib -le '$!=1; die "fatális hibát: $!"'
(apologies to the Hungarian speakers)
If this is however run in a non-Latin1 locale, like say
LC_ALL=el_GR.iso88597 ./perl -Ilib -le '$!=1; die "fatális hibát: $!"'
The first part of the string is in Latin1, and the 2nd part is in
Latin7. These are not compatible (except for their common ASCII range
and a few punctuation characters). If the terminal is set to display
Latin1, the first part looks ok, the second is garbage, and vice versa
(except the common characters will look ok in both)
There is no current way for an application to guard against this; it is
a sitting duck. $! always comes out in the underlying locale. (The
reason this doesn't show up more often, is apparently people write their
prefix messages in English, hence ASCII, and all the locales, like
88597, are supersets of ASCII.
I claim this shows the perils of having stuff appear in the underlying
locale outside the scope of 'use locale'. An unsuspecting application
that doesn't even know that locales exist can be hit by the user's
environment passing in a locale, or by any module somewhere in the tool
chain doing a setlocale().
I believe the solution is to make $! return the C locale messages
outside the scope of 'use locale', just like the other categories. By
being in such scope, the caller is indicating its willingness to handle
and be smart about locale issues. Otherwise it shouldn't have to be
exposed to them.
My recent proposal also works. That is to use the $! locale value
provided it is all ASCII. That means that a fair number of system
messages in various European languages will come out natively, but not
those that might adversely affect things like ack. The problem with
this is that the application still doesn't have control.
Note that in the messages above, that Perl itself outputs its warnings
and messages like "at -e line 1". Nobody has any control over that, and
I can't believe this fact hasn't discouraged some applications from
using Perl in non-English settings.
What part of CPAN is expecting native-language $! ? I don't know, but
given the vagaries, including some things always being in English, and
being at the mercy of the user's locale environment, I suspect not much.
Thread Previous
|
Thread Next