develooper Front page | perl.perl5.porters | Postings from October 2013

Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 localesonly under 5.19.2+

Thread Previous | Thread Next
From:
Karl Williamson
Date:
October 10, 2013 01:46
Subject:
Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 localesonly under 5.19.2+
Message ID:
525606DA.7000809@khwilliamson.com
On 09/20/2013 09:11 PM, Father Chrysostomos via RT wrote:
> On Mon Sep 16 09:05:17 2013, public@khwilliamson.com wrote:
>> On 09/09/2013 07:06 PM, Karl Williamson wrote:
>>> On 09/02/2013 05:10 PM, Victor Efimov wrote:
>>>>
>>>> 2013/9/1 Father Chrysostomos via RT <perlbug-followup@perl.org
>>>> <mailto:perlbug-followup@perl.org>>
>>>>
>>>>      A new global variable is another option.
>>>>
>>>> perhaps ${^DECODED_ERROR} ?
>>>
>>>
>>> I have come to believe that this is probably the best way forward.  That
>>> is, revert the $! change, and tell people who need it to use the new
>>> global variable which will decode as best it can on the given platform
>>> based on the locale in effect.
>>>
>>
>> In looking at this, I thought of something else.  I do believe that the
>> current behavior is correct for such a variable within the lexical scope
>> of "use locale".  But outside such scope the behavior would be to decode
>> fully, as best as practicable on the platform being run on.
>>
>> Then it occurred to me would merely changing $! (and $^E) to behave this
>> way address your issues?  It is a change in behavior from the way things
>> have alway been, but outside "use locale", it would fully decode, which
>> someone in the thread was the issue with the current fix.
>
> I was the one who implied that.  What I meant was that, if decoding
> happens unconditionally, at least one can check the Perl version to
> determine how to handle $!.  It is still backward-incompatible.  I was
> then going to suggest lexically scoping the new behaviour, but Zefram
> has already pointed out why that is not a good idea.  A new global
> variable is the best choice at this point.
>
>

tl;dr

0) A brief overview of how locales work with Perl is presented
1) $! used to work as if it always was in the scope of both 'use locale' 
and 'use bytes'
2) The blamed commit removed the 'use bytes' component, breaking code 
that relied on that; fixing some code that didn't.
3) Many people think that 'use bytes' should be outlawed.  Thus we 
should take a good hard look before reverting the commit and restoring 
'use bytes' behavior.
4) $! now acts (with regard to encoding) as any other scalar does within 
the scope of 'use locale'.  My proposal is to leave it that way when in 
that scope.  Thus, it doesn't become an outlier that has to be treated 
specially.
5) Outside such scope: on systems that have nl_langinfo(), $! would 
automatically be decoded to UTF-8; otherwise to English (C locale), 
which the end user could google translate if necessary.
6) An objection has been raised that this creates problems when 
references to $! are passed, and in XS code where it gets its caller's 
scope.  But this is no different than any variable that deals with locales.
7) An alternative is to revert this commit (bringing back 'use bytes' 
behavior), and to create a new variable that always fully decodes.  But 
that doesn't help code that is in 'use locale'.  There would be no 
variable that gives correct behavior for that situation (The behavior of 
the current commit is that correct behavior).  Perhaps another new 
variable would be created that does what the current commit does, 
regardless of scope, making 3 variables.  Also, $^E also has this 
problem, and should have the same solution applied to it as we do to $!. 
  That would mean 4 new variables would have to be created, making 6 
variables.  That seems overly ugly, and confusing.

===================================

I'd like to start with a brief refresher on Perl and locales.  Every C 
program always is running in a particular locale.  Absent a setlocale() 
to the contrary, that locale is the "C" locale, which gives the behavior 
described in K&R.  But a setlocale() call to something else will cause 
many libc functions to behave differently.  Under those, theoretically:
1) any particular byte in a string could mean nearly any character (or 
portion of a character);
2) the language for the text of $! could be anything;
3) etc.
There can be single-byte locales, wide character (U16 or U32 usually) 
locales, and varying character length locales (which UTF-8 is).  Perl 
has never officially supported anything other than single byte locales. 
  In practice, almost all published locales have every ASCII-range code 
point mean the corresponding ASCII character, hence differing only in 
non-ASCII bytes.  Perl avoids assuming this ASCII correspondence pretty 
much as best it can.

One of the first things that Perl does when it starts up (with a minor 
exception for embedded Perl, added in the 5.19 series) is to call 
setlocale(), thus causing the libc functions to change behavior.  The 
locale that is set is determined from the caller's environment, 
typically using the LANG or other environment variables. Increasingly, 
on Linux systems anyway, this is some UTF-8 locale.

But Perl isn't supposed to expose the underlying locale outside the 
scope of 'use locale'.  Various patches in the 5.19 series have fixed 
all known such leaks except for various POSIX:: functions where it 
doesn't make sense to hide, and $!.  The rationale for the latter is 
that $! is for the user of the program, not the programmer, and so 
should be output in the user's language, as gleaned from his/her locale.

What happens if a string scalar is in some locale, and a code point that 
requires UTF-8 is added to it?  The answer is that this is generally not 
a good idea to do, but Perl copes by converting the scalar to UTF-8, 
with the code points below 256 assumed to be what they mean in the 
(single-byte) locale, even if they require 2 UTF-8 bytes to represent. 
This means that operations that cross the 255/256 boundary in a UTF-8 
locale are undefined.  For example, the uppercase of \xFF is \x{178} 
normally (as in Unicode they are the SMALL and UPPER y with diaresis 
respectively), but within the scope of 'use locale' uc("\xFF") remains 
\xFF, because we don't know what character \xFF really represents in 
that locale.  In just the ISO-8859 series of locales, it can be U+FF, or 
U+040F, U+0138, U+2019, or unassigned.  (Note that if we knew that a 
locale is UTF-8, we would know what \xFF really is, and so could treat 
things just like non-locale Perl does).

That the meaning of characters is context dependent means that when 
using locale, it generally is not a good idea to pass references to 
variables.  Correct me if I'm wrong, but I believe this means that XS 
code gets its caller's lexical scope with regard to this.

Until the commit that generated this ticket, $! returned the bytes that 
comprise the message regardless of whether the message was in UTF-8 or 
not.  Thus it behaved as if it were in the scope of both 'use locale' 
and 'use bytes'.  What the commit effectively did was to remove the 'use 
bytes' behavior, causing $! to behave as any other string scalar does 
under 'use locale'.  Many people on this list think that we should get 
rid of 'use bytes'; that its behavior is never desired.  (I'm not one of 
them BTW, but I think it should be used only very rarely.)  Thus, on the 
face of it, it is suspect that $! should behave as if it is in 'use 
bytes', and I'm having a hard time groking the argument that we should 
revert back to that.

To clarify my proposal (since Victor misunderstood it),  I propose, 
within 'use locale' scope, leaving the behavior as the commit changed it 
to.  $! now behaves as other variables in such scope behave; it no 
longer is an outlier that has to be treated specially.  Outside that 
scope, I propose to fully decode $! into Perl's internal coding 
(essentially UTF-8).  The latter would automatically load the needed 
modules.  If the system did not have nl_langinfo(), I now think that the 
best thing to do is to output the message in the C locale, yielding it 
in English, which the user could machine translate.  We are not going to 
return undef, as Victor suggested, as that would be throwing away 
potentially crucial information.

As I mentioned above, it's not a good idea to pass references to 
locale-encoded variables.  I don't see how $! is different from other 
locale variables in its orneriness.  It just comes with the territory.

The idea of reverting this commit and having another global variable 
that does the full decoding harms code within 'use locale' scope. 
Instead of this variable being a typical scalar there, it becomes an 
outlier, which has to have special treatment.  We could add a third 
variable which behaves as the current commit now does to accommodate 
such code.  This is getting unwieldy.  Whatever behavior we decide to do 
has to also be applied to $^E.  Now we would then have 6 variables 
instead of 2.

I think my proposal is the least bad of those presented so far.



Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About