develooper Front page | perl.perl5.porters | Postings from October 2013

Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 localesonly under 5.19.2+

Thread Previous | Thread Next
Karl Williamson
October 15, 2013 21:59
Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 localesonly under 5.19.2+
Message ID:
On 10/10/2013 06:12 AM, Victor Efimov via RT wrote:
> And how one should fix code below (both examples example1 and example2)
> to work same way in 5.18 and 5.20 ?
> =====
> use strict;
> use warnings;
> use Encode;
> my %Config = ( default_locale_encoding => 'UTF-8' ); # user supplied
> my $locale_encoding = eval {
> 	require I18N::Langinfo;
> 	my $enc = I18N::Langinfo::langinfo(I18N::Langinfo::CODESET());
> 	defined (find_encoding($enc)) ? $enc : undef;
> };
> $locale_encoding ||= $Config{default_locale_encoding};
> binmode STDERR, ":encoding($locale_encoding)";
> open (my $f, "<", "not_a_file") or do {
> 	die decode($locale_encoding, "$!", Encode::DIE_ON_ERR|Encode::LEAVE_SRC);
> }
> =====
> $ perl
> No such file or directory at line 15.
> $ LANG=ru_RU LANGUAGE=ru_RU:ru LC_ALL=ru_RU.utf8 perl
> Нет такого файла или каталога at line 15.
> =====
> use strict;
> use warnings;
> open (my $f, "<", "not_a_file") or do {
> 	die "$!";
> }
> =====
> $ perl
> No such file or directory at line 4.
> $ LANG=ru_RU LANGUAGE=ru_RU:ru LC_ALL=ru_RU.utf8 perl
> Нет такого файла или каталога at line 4.

What you want is for $! to work like it's in 'use bytes'.  I can change 
the patch so that it checks for 'use bytes' and if within that scope 
returns without the utf8 flag set.  You would then just need to add a 
'use bytes' to get it to work the same way it always has.

There are people who would disapprove of ever using bytes, which means 
they think the behavior you want is wrong.  I'm not one of them.  I 
think that 'use bytes' should be rare, mostly used in testing, but it 
sometimes is the easiest, clearest way of getting at the bytes that 
comprise a UTF-8-encoded character.  utf8::encode() can be used for 
that, but destroys its argument and I think its name is much less clear 
than 'use bytes'.

I have tested doing this, and it works.
> On Wed Oct 09 18:46:47 2013, wrote:
>> tl;dr
>> 0) A brief overview of how locales work with Perl is presented
>> 1) $! used to work as if it always was in the scope of both 'use locale'
>> and 'use bytes'
>> 2) The blamed commit removed the 'use bytes' component, breaking code
>> that relied on that; fixing some code that didn't.
>> 3) Many people think that 'use bytes' should be outlawed.  Thus we
>> should take a good hard look before reverting the commit and restoring
>> 'use bytes' behavior.
>> 4) $! now acts (with regard to encoding) as any other scalar does within
>> the scope of 'use locale'.  My proposal is to leave it that way when in
>> that scope.  Thus, it doesn't become an outlier that has to be treated
>> specially.
>> 5) Outside such scope: on systems that have nl_langinfo(), $! would
>> automatically be decoded to UTF-8; otherwise to English (C locale),
>> which the end user could google translate if necessary.
>> 6) An objection has been raised that this creates problems when
>> references to $! are passed, and in XS code where it gets its caller's
>> scope.  But this is no different than any variable that deals with
> locales.
>> 7) An alternative is to revert this commit (bringing back 'use bytes'
>> behavior), and to create a new variable that always fully decodes.  But
>> that doesn't help code that is in 'use locale'.  There would be no
>> variable that gives correct behavior for that situation (The behavior of
>> the current commit is that correct behavior).  Perhaps another new
>> variable would be created that does what the current commit does,
>> regardless of scope, making 3 variables.  Also, $^E also has this
>> problem, and should have the same solution applied to it as we do to $!.
>>    That would mean 4 new variables would have to be created, making 6
>> variables.  That seems overly ugly, and confusing.
>> ===================================
>> I'd like to start with a brief refresher on Perl and locales.  Every C
>> program always is running in a particular locale.  Absent a setlocale()
>> to the contrary, that locale is the "C" locale, which gives the behavior
>> described in K&R.  But a setlocale() call to something else will cause
>> many libc functions to behave differently.  Under those, theoretically:
>> 1) any particular byte in a string could mean nearly any character (or
>> portion of a character);
>> 2) the language for the text of $! could be anything;
>> 3) etc.
>> There can be single-byte locales, wide character (U16 or U32 usually)
>> locales, and varying character length locales (which UTF-8 is).  Perl
>> has never officially supported anything other than single byte locales.
>>    In practice, almost all published locales have every ASCII-range code
>> point mean the corresponding ASCII character, hence differing only in
>> non-ASCII bytes.  Perl avoids assuming this ASCII correspondence pretty
>> much as best it can.
>> One of the first things that Perl does when it starts up (with a minor
>> exception for embedded Perl, added in the 5.19 series) is to call
>> setlocale(), thus causing the libc functions to change behavior.  The
>> locale that is set is determined from the caller's environment,
>> typically using the LANG or other environment variables. Increasingly,
>> on Linux systems anyway, this is some UTF-8 locale.
>> But Perl isn't supposed to expose the underlying locale outside the
>> scope of 'use locale'.  Various patches in the 5.19 series have fixed
>> all known such leaks except for various POSIX:: functions where it
>> doesn't make sense to hide, and $!.  The rationale for the latter is
>> that $! is for the user of the program, not the programmer, and so
>> should be output in the user's language, as gleaned from his/her locale.
>> What happens if a string scalar is in some locale, and a code point that
>> requires UTF-8 is added to it?  The answer is that this is generally not
>> a good idea to do, but Perl copes by converting the scalar to UTF-8,
>> with the code points below 256 assumed to be what they mean in the
>> (single-byte) locale, even if they require 2 UTF-8 bytes to represent.
>> This means that operations that cross the 255/256 boundary in a UTF-8
>> locale are undefined.  For example, the uppercase of \xFF is \x{178}
>> normally (as in Unicode they are the SMALL and UPPER y with diaresis
>> respectively), but within the scope of 'use locale' uc("\xFF") remains
>> \xFF, because we don't know what character \xFF really represents in
>> that locale.  In just the ISO-8859 series of locales, it can be U+FF, or
>> U+040F, U+0138, U+2019, or unassigned.  (Note that if we knew that a
>> locale is UTF-8, we would know what \xFF really is, and so could treat
>> things just like non-locale Perl does).
>> That the meaning of characters is context dependent means that when
>> using locale, it generally is not a good idea to pass references to
>> variables.  Correct me if I'm wrong, but I believe this means that XS
>> code gets its caller's lexical scope with regard to this.
>> Until the commit that generated this ticket, $! returned the bytes that
>> comprise the message regardless of whether the message was in UTF-8 or
>> not.  Thus it behaved as if it were in the scope of both 'use locale'
>> and 'use bytes'.  What the commit effectively did was to remove the 'use
>> bytes' behavior, causing $! to behave as any other string scalar does
>> under 'use locale'.  Many people on this list think that we should get
>> rid of 'use bytes'; that its behavior is never desired.  (I'm not one of
>> them BTW, but I think it should be used only very rarely.)  Thus, on the
>> face of it, it is suspect that $! should behave as if it is in 'use
>> bytes', and I'm having a hard time groking the argument that we should
>> revert back to that.
>> To clarify my proposal (since Victor misunderstood it),  I propose,
>> within 'use locale' scope, leaving the behavior as the commit changed it
>> to.  $! now behaves as other variables in such scope behave; it no
>> longer is an outlier that has to be treated specially.  Outside that
>> scope, I propose to fully decode $! into Perl's internal coding
>> (essentially UTF-8).  The latter would automatically load the needed
>> modules.  If the system did not have nl_langinfo(), I now think that the
>> best thing to do is to output the message in the C locale, yielding it
>> in English, which the user could machine translate.  We are not going to
>> return undef, as Victor suggested, as that would be throwing away
>> potentially crucial information.
>> As I mentioned above, it's not a good idea to pass references to
>> locale-encoded variables.  I don't see how $! is different from other
>> locale variables in its orneriness.  It just comes with the territory.
>> The idea of reverting this commit and having another global variable
>> that does the full decoding harms code within 'use locale' scope.
>> Instead of this variable being a typical scalar there, it becomes an
>> outlier, which has to have special treatment.  We could add a third
>> variable which behaves as the current commit now does to accommodate
>> such code.  This is getting unwieldy.  Whatever behavior we decide to do
>> has to also be applied to $^E.  Now we would then have 6 variables
>> instead of 2.
>> I think my proposal is the least bad of those presented so far.
> ---
> via perlbug:  queue: perl5 status: open

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About