develooper Front page | perl.perl5.porters | Postings from March 2014

Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 localesonly under 5.19.2+

Thread Previous | Thread Next
From:
Karl Williamson
Date:
March 28, 2014 05:08
Subject:
Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 localesonly under 5.19.2+
Message ID:
53350414.9000006@khwilliamson.com
In this post, I will just give some new insights I had today.

There are real bugs (even if the others previously mentioned aren't 
regarded as such) when "$!" isn't returned with the UTF-8 flag on, and 
when $! is stringified to its locale string outside of "use locale" scope.

Consider this one liner:

LC_ALL=zh_CN.utf8 ./perl -Ilib -le 'use utf8; $!=1; die "致命錯誤: $!"'

In blead, it prints, as it should,
Wide character in die at -e line 1
致命錯誤: 不允许的操作 at -e line 1

In 5.18.2 it prints this garbage instead
Wide character in die at -e line 1
致命錯誤: 不允许的操作 at -e line 1

The reason is that the program is encoded in utf8, and $! has returned 
utf8 (only in the 5.18 case) without setting the utf8 flag, and so Perl 
takes the bytes that form $! and upgrades those bytes into utf8 (again). 
  In other words, its encoding twice.

(I chose Chinese because its script could not be confused with Western 
European characters, and I used Google translate, so the constant 
portion of the text may not make sense; I apologize to the Chinese 
speakers reading this.)

"use utf8" is not necessary for this.  It could be "die "$prefix: $!"
where $prefix has its utf8 flag on.

These examples show, once again, the perils of having a scalar that's in 
UTF-8, but pretending it's not, even if it's just in a die().  I claim 
they conclusively show the brokenness of the 5.18 code.

Another problem with all existing versions is if the $prefix is written 
in Latin1.  Recall that the default character sets of Perl are ASCII, 
Latin1, and full Unicode, each a superset of the previous.  So someone 
might in Hungarian might write

./perl -Ilib -le '$!=1; die "fatális hibát: $!"'

(apologies to the Hungarian speakers)

If this is however run in a non-Latin1 locale, like say

LC_ALL=el_GR.iso88597 ./perl -Ilib -le '$!=1; die "fatális hibát: $!"'

The first part of the string is in Latin1, and the 2nd part is in 
Latin7.  These are not compatible (except for their common ASCII range 
and a few punctuation characters).  If the terminal is set to display 
Latin1, the first part looks ok, the second is garbage, and vice versa 
(except the common characters will look ok in both)

There is no current way for an application to guard against this; it is 
a sitting duck.  $! always comes out in the underlying locale.  (The 
reason this doesn't show up more often, is apparently people write their 
prefix messages in English, hence ASCII, and all the locales, like 
88597, are supersets of ASCII.

I claim this shows the perils of having stuff appear in the underlying 
locale outside the scope of 'use locale'.  An unsuspecting application 
that doesn't even know that locales exist can be hit by the user's 
environment passing in a locale, or by any module somewhere in the tool 
chain doing a setlocale().

I believe the solution is to make $! return the C locale messages 
outside the scope of 'use locale', just like the other categories.  By 
being in such scope, the caller is indicating its willingness to handle 
and be smart about locale issues.  Otherwise it shouldn't have to be 
exposed to them.

My recent proposal also works.  That is to use the $! locale value 
provided it is all ASCII.  That means that a fair number of system 
messages in various European languages will come out natively, but not 
those that might adversely affect things like ack.  The problem with 
this is that the application still doesn't have control.

Note that in the messages above, that Perl itself outputs its warnings 
and messages like "at -e line 1".  Nobody has any control over that, and 
I can't believe this fact hasn't discouraged some applications from 
using Perl in non-English settings.

What part of CPAN is expecting native-language $! ?  I don't know, but 
given the vagaries, including some things always being in English, and 
being at the mercy of the user's locale environment, I suspect not much.


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About