Front page | perl.perl5.porters |
Postings from October 2013
[perl #119499] $! returned with UTF-8 flag under UTF-8 locales only under 5.19.2+
Thread Previous
|
Thread Next
From:
Victor Efimov via RT
Date:
October 10, 2013 12:12
Subject:
[perl #119499] $! returned with UTF-8 flag under UTF-8 locales only under 5.19.2+
Message ID:
rt-3.6.HEAD-26210-1381407144-729.119499-15-0@perl.org
And how one should fix code below (both examples example1 and example2)
to work same way in 5.18 and 5.20 ?
===== example1.pl
use strict;
use warnings;
use Encode;
my %Config = ( default_locale_encoding => 'UTF-8' ); # user supplied
my $locale_encoding = eval {
require I18N::Langinfo;
my $enc = I18N::Langinfo::langinfo(I18N::Langinfo::CODESET());
defined (find_encoding($enc)) ? $enc : undef;
};
$locale_encoding ||= $Config{default_locale_encoding};
binmode STDERR, ":encoding($locale_encoding)";
open (my $f, "<", "not_a_file") or do {
die decode($locale_encoding, "$!", Encode::DIE_ON_ERR|Encode::LEAVE_SRC);
}
=====
$ perl example1.pl
No such file or directory at example1.pl line 15.
$ LANG=ru_RU LANGUAGE=ru_RU:ru LC_ALL=ru_RU.utf8 perl example1.pl
Нет такого файла или каталога at example1.pl line 15.
===== example2.pl
use strict;
use warnings;
open (my $f, "<", "not_a_file") or do {
die "$!";
}
=====
$ perl example2.pl
No such file or directory at example2.pl line 4.
$ LANG=ru_RU LANGUAGE=ru_RU:ru LC_ALL=ru_RU.utf8 perl example2.pl
Нет такого файла или каталога at example2.pl line 4.
On Wed Oct 09 18:46:47 2013, public@khwilliamson.com wrote:
>
> tl;dr
>
> 0) A brief overview of how locales work with Perl is presented
> 1) $! used to work as if it always was in the scope of both 'use locale'
> and 'use bytes'
> 2) The blamed commit removed the 'use bytes' component, breaking code
> that relied on that; fixing some code that didn't.
> 3) Many people think that 'use bytes' should be outlawed. Thus we
> should take a good hard look before reverting the commit and restoring
> 'use bytes' behavior.
> 4) $! now acts (with regard to encoding) as any other scalar does within
> the scope of 'use locale'. My proposal is to leave it that way when in
> that scope. Thus, it doesn't become an outlier that has to be treated
> specially.
> 5) Outside such scope: on systems that have nl_langinfo(), $! would
> automatically be decoded to UTF-8; otherwise to English (C locale),
> which the end user could google translate if necessary.
> 6) An objection has been raised that this creates problems when
> references to $! are passed, and in XS code where it gets its caller's
> scope. But this is no different than any variable that deals with
locales.
> 7) An alternative is to revert this commit (bringing back 'use bytes'
> behavior), and to create a new variable that always fully decodes. But
> that doesn't help code that is in 'use locale'. There would be no
> variable that gives correct behavior for that situation (The behavior of
> the current commit is that correct behavior). Perhaps another new
> variable would be created that does what the current commit does,
> regardless of scope, making 3 variables. Also, $^E also has this
> problem, and should have the same solution applied to it as we do to $!.
> That would mean 4 new variables would have to be created, making 6
> variables. That seems overly ugly, and confusing.
>
> ===================================
>
> I'd like to start with a brief refresher on Perl and locales. Every C
> program always is running in a particular locale. Absent a setlocale()
> to the contrary, that locale is the "C" locale, which gives the behavior
> described in K&R. But a setlocale() call to something else will cause
> many libc functions to behave differently. Under those, theoretically:
> 1) any particular byte in a string could mean nearly any character (or
> portion of a character);
> 2) the language for the text of $! could be anything;
> 3) etc.
> There can be single-byte locales, wide character (U16 or U32 usually)
> locales, and varying character length locales (which UTF-8 is). Perl
> has never officially supported anything other than single byte locales.
> In practice, almost all published locales have every ASCII-range code
> point mean the corresponding ASCII character, hence differing only in
> non-ASCII bytes. Perl avoids assuming this ASCII correspondence pretty
> much as best it can.
>
> One of the first things that Perl does when it starts up (with a minor
> exception for embedded Perl, added in the 5.19 series) is to call
> setlocale(), thus causing the libc functions to change behavior. The
> locale that is set is determined from the caller's environment,
> typically using the LANG or other environment variables. Increasingly,
> on Linux systems anyway, this is some UTF-8 locale.
>
> But Perl isn't supposed to expose the underlying locale outside the
> scope of 'use locale'. Various patches in the 5.19 series have fixed
> all known such leaks except for various POSIX:: functions where it
> doesn't make sense to hide, and $!. The rationale for the latter is
> that $! is for the user of the program, not the programmer, and so
> should be output in the user's language, as gleaned from his/her locale.
>
> What happens if a string scalar is in some locale, and a code point that
> requires UTF-8 is added to it? The answer is that this is generally not
> a good idea to do, but Perl copes by converting the scalar to UTF-8,
> with the code points below 256 assumed to be what they mean in the
> (single-byte) locale, even if they require 2 UTF-8 bytes to represent.
> This means that operations that cross the 255/256 boundary in a UTF-8
> locale are undefined. For example, the uppercase of \xFF is \x{178}
> normally (as in Unicode they are the SMALL and UPPER y with diaresis
> respectively), but within the scope of 'use locale' uc("\xFF") remains
> \xFF, because we don't know what character \xFF really represents in
> that locale. In just the ISO-8859 series of locales, it can be U+FF, or
> U+040F, U+0138, U+2019, or unassigned. (Note that if we knew that a
> locale is UTF-8, we would know what \xFF really is, and so could treat
> things just like non-locale Perl does).
>
> That the meaning of characters is context dependent means that when
> using locale, it generally is not a good idea to pass references to
> variables. Correct me if I'm wrong, but I believe this means that XS
> code gets its caller's lexical scope with regard to this.
>
> Until the commit that generated this ticket, $! returned the bytes that
> comprise the message regardless of whether the message was in UTF-8 or
> not. Thus it behaved as if it were in the scope of both 'use locale'
> and 'use bytes'. What the commit effectively did was to remove the 'use
> bytes' behavior, causing $! to behave as any other string scalar does
> under 'use locale'. Many people on this list think that we should get
> rid of 'use bytes'; that its behavior is never desired. (I'm not one of
> them BTW, but I think it should be used only very rarely.) Thus, on the
> face of it, it is suspect that $! should behave as if it is in 'use
> bytes', and I'm having a hard time groking the argument that we should
> revert back to that.
>
> To clarify my proposal (since Victor misunderstood it), I propose,
> within 'use locale' scope, leaving the behavior as the commit changed it
> to. $! now behaves as other variables in such scope behave; it no
> longer is an outlier that has to be treated specially. Outside that
> scope, I propose to fully decode $! into Perl's internal coding
> (essentially UTF-8). The latter would automatically load the needed
> modules. If the system did not have nl_langinfo(), I now think that the
> best thing to do is to output the message in the C locale, yielding it
> in English, which the user could machine translate. We are not going to
> return undef, as Victor suggested, as that would be throwing away
> potentially crucial information.
>
> As I mentioned above, it's not a good idea to pass references to
> locale-encoded variables. I don't see how $! is different from other
> locale variables in its orneriness. It just comes with the territory.
>
> The idea of reverting this commit and having another global variable
> that does the full decoding harms code within 'use locale' scope.
> Instead of this variable being a typical scalar there, it becomes an
> outlier, which has to have special treatment. We could add a third
> variable which behaves as the current commit now does to accommodate
> such code. This is getting unwieldy. Whatever behavior we decide to do
> has to also be applied to $^E. Now we would then have 6 variables
> instead of 2.
>
> I think my proposal is the least bad of those presented so far.
>
>
---
via perlbug: queue: perl5 status: open
https://rt.perl.org:443/rt3/Ticket/Display.html?id=119499
Thread Previous
|
Thread Next