Front page | perl.perl5.porters |
Postings from October 2013
Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 localesonly under 5.19.2+
Thread Previous
|
Thread Next
From:
Karl Williamson
Date:
October 10, 2013 01:46
Subject:
Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 localesonly under 5.19.2+
Message ID:
525606DA.7000809@khwilliamson.com
On 09/20/2013 09:11 PM, Father Chrysostomos via RT wrote:
> On Mon Sep 16 09:05:17 2013, public@khwilliamson.com wrote:
>> On 09/09/2013 07:06 PM, Karl Williamson wrote:
>>> On 09/02/2013 05:10 PM, Victor Efimov wrote:
>>>>
>>>> 2013/9/1 Father Chrysostomos via RT <perlbug-followup@perl.org
>>>> <mailto:perlbug-followup@perl.org>>
>>>>
>>>> A new global variable is another option.
>>>>
>>>> perhaps ${^DECODED_ERROR} ?
>>>
>>>
>>> I have come to believe that this is probably the best way forward. That
>>> is, revert the $! change, and tell people who need it to use the new
>>> global variable which will decode as best it can on the given platform
>>> based on the locale in effect.
>>>
>>
>> In looking at this, I thought of something else. I do believe that the
>> current behavior is correct for such a variable within the lexical scope
>> of "use locale". But outside such scope the behavior would be to decode
>> fully, as best as practicable on the platform being run on.
>>
>> Then it occurred to me would merely changing $! (and $^E) to behave this
>> way address your issues? It is a change in behavior from the way things
>> have alway been, but outside "use locale", it would fully decode, which
>> someone in the thread was the issue with the current fix.
>
> I was the one who implied that. What I meant was that, if decoding
> happens unconditionally, at least one can check the Perl version to
> determine how to handle $!. It is still backward-incompatible. I was
> then going to suggest lexically scoping the new behaviour, but Zefram
> has already pointed out why that is not a good idea. A new global
> variable is the best choice at this point.
>
>
tl;dr
0) A brief overview of how locales work with Perl is presented
1) $! used to work as if it always was in the scope of both 'use locale'
and 'use bytes'
2) The blamed commit removed the 'use bytes' component, breaking code
that relied on that; fixing some code that didn't.
3) Many people think that 'use bytes' should be outlawed. Thus we
should take a good hard look before reverting the commit and restoring
'use bytes' behavior.
4) $! now acts (with regard to encoding) as any other scalar does within
the scope of 'use locale'. My proposal is to leave it that way when in
that scope. Thus, it doesn't become an outlier that has to be treated
specially.
5) Outside such scope: on systems that have nl_langinfo(), $! would
automatically be decoded to UTF-8; otherwise to English (C locale),
which the end user could google translate if necessary.
6) An objection has been raised that this creates problems when
references to $! are passed, and in XS code where it gets its caller's
scope. But this is no different than any variable that deals with locales.
7) An alternative is to revert this commit (bringing back 'use bytes'
behavior), and to create a new variable that always fully decodes. But
that doesn't help code that is in 'use locale'. There would be no
variable that gives correct behavior for that situation (The behavior of
the current commit is that correct behavior). Perhaps another new
variable would be created that does what the current commit does,
regardless of scope, making 3 variables. Also, $^E also has this
problem, and should have the same solution applied to it as we do to $!.
That would mean 4 new variables would have to be created, making 6
variables. That seems overly ugly, and confusing.
===================================
I'd like to start with a brief refresher on Perl and locales. Every C
program always is running in a particular locale. Absent a setlocale()
to the contrary, that locale is the "C" locale, which gives the behavior
described in K&R. But a setlocale() call to something else will cause
many libc functions to behave differently. Under those, theoretically:
1) any particular byte in a string could mean nearly any character (or
portion of a character);
2) the language for the text of $! could be anything;
3) etc.
There can be single-byte locales, wide character (U16 or U32 usually)
locales, and varying character length locales (which UTF-8 is). Perl
has never officially supported anything other than single byte locales.
In practice, almost all published locales have every ASCII-range code
point mean the corresponding ASCII character, hence differing only in
non-ASCII bytes. Perl avoids assuming this ASCII correspondence pretty
much as best it can.
One of the first things that Perl does when it starts up (with a minor
exception for embedded Perl, added in the 5.19 series) is to call
setlocale(), thus causing the libc functions to change behavior. The
locale that is set is determined from the caller's environment,
typically using the LANG or other environment variables. Increasingly,
on Linux systems anyway, this is some UTF-8 locale.
But Perl isn't supposed to expose the underlying locale outside the
scope of 'use locale'. Various patches in the 5.19 series have fixed
all known such leaks except for various POSIX:: functions where it
doesn't make sense to hide, and $!. The rationale for the latter is
that $! is for the user of the program, not the programmer, and so
should be output in the user's language, as gleaned from his/her locale.
What happens if a string scalar is in some locale, and a code point that
requires UTF-8 is added to it? The answer is that this is generally not
a good idea to do, but Perl copes by converting the scalar to UTF-8,
with the code points below 256 assumed to be what they mean in the
(single-byte) locale, even if they require 2 UTF-8 bytes to represent.
This means that operations that cross the 255/256 boundary in a UTF-8
locale are undefined. For example, the uppercase of \xFF is \x{178}
normally (as in Unicode they are the SMALL and UPPER y with diaresis
respectively), but within the scope of 'use locale' uc("\xFF") remains
\xFF, because we don't know what character \xFF really represents in
that locale. In just the ISO-8859 series of locales, it can be U+FF, or
U+040F, U+0138, U+2019, or unassigned. (Note that if we knew that a
locale is UTF-8, we would know what \xFF really is, and so could treat
things just like non-locale Perl does).
That the meaning of characters is context dependent means that when
using locale, it generally is not a good idea to pass references to
variables. Correct me if I'm wrong, but I believe this means that XS
code gets its caller's lexical scope with regard to this.
Until the commit that generated this ticket, $! returned the bytes that
comprise the message regardless of whether the message was in UTF-8 or
not. Thus it behaved as if it were in the scope of both 'use locale'
and 'use bytes'. What the commit effectively did was to remove the 'use
bytes' behavior, causing $! to behave as any other string scalar does
under 'use locale'. Many people on this list think that we should get
rid of 'use bytes'; that its behavior is never desired. (I'm not one of
them BTW, but I think it should be used only very rarely.) Thus, on the
face of it, it is suspect that $! should behave as if it is in 'use
bytes', and I'm having a hard time groking the argument that we should
revert back to that.
To clarify my proposal (since Victor misunderstood it), I propose,
within 'use locale' scope, leaving the behavior as the commit changed it
to. $! now behaves as other variables in such scope behave; it no
longer is an outlier that has to be treated specially. Outside that
scope, I propose to fully decode $! into Perl's internal coding
(essentially UTF-8). The latter would automatically load the needed
modules. If the system did not have nl_langinfo(), I now think that the
best thing to do is to output the message in the C locale, yielding it
in English, which the user could machine translate. We are not going to
return undef, as Victor suggested, as that would be throwing away
potentially crucial information.
As I mentioned above, it's not a good idea to pass references to
locale-encoded variables. I don't see how $! is different from other
locale variables in its orneriness. It just comes with the territory.
The idea of reverting this commit and having another global variable
that does the full decoding harms code within 'use locale' scope.
Instead of this variable being a typical scalar there, it becomes an
outlier, which has to have special treatment. We could add a third
variable which behaves as the current commit now does to accommodate
such code. This is getting unwieldy. Whatever behavior we decide to do
has to also be applied to $^E. Now we would then have 6 variables
instead of 2.
I think my proposal is the least bad of those presented so far.
Thread Previous
|
Thread Next