develooper Front page | perl.perl5.porters | Postings from October 2013

[perl #119499] $! returned with UTF-8 flag under UTF-8 locales only under 5.19.2+

Thread Previous | Thread Next
Victor Efimov via RT
October 10, 2013 12:12
[perl #119499] $! returned with UTF-8 flag under UTF-8 locales only under 5.19.2+
Message ID:
And how one should fix code below (both examples example1 and example2)
to work same way in 5.18 and 5.20 ?

use strict;
use warnings;
use Encode;
my %Config = ( default_locale_encoding => 'UTF-8' ); # user supplied
my $locale_encoding = eval {
	require I18N::Langinfo;
	my $enc = I18N::Langinfo::langinfo(I18N::Langinfo::CODESET());
	defined (find_encoding($enc)) ? $enc : undef;

$locale_encoding ||= $Config{default_locale_encoding};
binmode STDERR, ":encoding($locale_encoding)";

open (my $f, "<", "not_a_file") or do {
	die decode($locale_encoding, "$!", Encode::DIE_ON_ERR|Encode::LEAVE_SRC);

$ perl 
No such file or directory at line 15.

$ LANG=ru_RU LANGUAGE=ru_RU:ru LC_ALL=ru_RU.utf8 perl
Нет такого файла или каталога at line 15.

use strict;
use warnings;
open (my $f, "<", "not_a_file") or do {
	die "$!";

$ perl
No such file or directory at line 4.

$ LANG=ru_RU LANGUAGE=ru_RU:ru LC_ALL=ru_RU.utf8 perl
Нет такого файла или каталога at line 4.

On Wed Oct 09 18:46:47 2013, wrote:
> tl;dr
> 0) A brief overview of how locales work with Perl is presented
> 1) $! used to work as if it always was in the scope of both 'use locale' 
> and 'use bytes'
> 2) The blamed commit removed the 'use bytes' component, breaking code 
> that relied on that; fixing some code that didn't.
> 3) Many people think that 'use bytes' should be outlawed.  Thus we 
> should take a good hard look before reverting the commit and restoring 
> 'use bytes' behavior.
> 4) $! now acts (with regard to encoding) as any other scalar does within 
> the scope of 'use locale'.  My proposal is to leave it that way when in 
> that scope.  Thus, it doesn't become an outlier that has to be treated 
> specially.
> 5) Outside such scope: on systems that have nl_langinfo(), $! would 
> automatically be decoded to UTF-8; otherwise to English (C locale), 
> which the end user could google translate if necessary.
> 6) An objection has been raised that this creates problems when 
> references to $! are passed, and in XS code where it gets its caller's 
> scope.  But this is no different than any variable that deals with
> 7) An alternative is to revert this commit (bringing back 'use bytes' 
> behavior), and to create a new variable that always fully decodes.  But 
> that doesn't help code that is in 'use locale'.  There would be no 
> variable that gives correct behavior for that situation (The behavior of 
> the current commit is that correct behavior).  Perhaps another new 
> variable would be created that does what the current commit does, 
> regardless of scope, making 3 variables.  Also, $^E also has this 
> problem, and should have the same solution applied to it as we do to $!. 
>   That would mean 4 new variables would have to be created, making 6 
> variables.  That seems overly ugly, and confusing.
> ===================================
> I'd like to start with a brief refresher on Perl and locales.  Every C 
> program always is running in a particular locale.  Absent a setlocale() 
> to the contrary, that locale is the "C" locale, which gives the behavior 
> described in K&R.  But a setlocale() call to something else will cause 
> many libc functions to behave differently.  Under those, theoretically:
> 1) any particular byte in a string could mean nearly any character (or 
> portion of a character);
> 2) the language for the text of $! could be anything;
> 3) etc.
> There can be single-byte locales, wide character (U16 or U32 usually) 
> locales, and varying character length locales (which UTF-8 is).  Perl 
> has never officially supported anything other than single byte locales. 
>   In practice, almost all published locales have every ASCII-range code 
> point mean the corresponding ASCII character, hence differing only in 
> non-ASCII bytes.  Perl avoids assuming this ASCII correspondence pretty 
> much as best it can.
> One of the first things that Perl does when it starts up (with a minor 
> exception for embedded Perl, added in the 5.19 series) is to call 
> setlocale(), thus causing the libc functions to change behavior.  The 
> locale that is set is determined from the caller's environment, 
> typically using the LANG or other environment variables. Increasingly, 
> on Linux systems anyway, this is some UTF-8 locale.
> But Perl isn't supposed to expose the underlying locale outside the 
> scope of 'use locale'.  Various patches in the 5.19 series have fixed 
> all known such leaks except for various POSIX:: functions where it 
> doesn't make sense to hide, and $!.  The rationale for the latter is 
> that $! is for the user of the program, not the programmer, and so 
> should be output in the user's language, as gleaned from his/her locale.
> What happens if a string scalar is in some locale, and a code point that 
> requires UTF-8 is added to it?  The answer is that this is generally not 
> a good idea to do, but Perl copes by converting the scalar to UTF-8, 
> with the code points below 256 assumed to be what they mean in the 
> (single-byte) locale, even if they require 2 UTF-8 bytes to represent. 
> This means that operations that cross the 255/256 boundary in a UTF-8 
> locale are undefined.  For example, the uppercase of \xFF is \x{178} 
> normally (as in Unicode they are the SMALL and UPPER y with diaresis 
> respectively), but within the scope of 'use locale' uc("\xFF") remains 
> \xFF, because we don't know what character \xFF really represents in 
> that locale.  In just the ISO-8859 series of locales, it can be U+FF, or 
> U+040F, U+0138, U+2019, or unassigned.  (Note that if we knew that a 
> locale is UTF-8, we would know what \xFF really is, and so could treat 
> things just like non-locale Perl does).
> That the meaning of characters is context dependent means that when 
> using locale, it generally is not a good idea to pass references to 
> variables.  Correct me if I'm wrong, but I believe this means that XS 
> code gets its caller's lexical scope with regard to this.
> Until the commit that generated this ticket, $! returned the bytes that 
> comprise the message regardless of whether the message was in UTF-8 or 
> not.  Thus it behaved as if it were in the scope of both 'use locale' 
> and 'use bytes'.  What the commit effectively did was to remove the 'use 
> bytes' behavior, causing $! to behave as any other string scalar does 
> under 'use locale'.  Many people on this list think that we should get 
> rid of 'use bytes'; that its behavior is never desired.  (I'm not one of 
> them BTW, but I think it should be used only very rarely.)  Thus, on the 
> face of it, it is suspect that $! should behave as if it is in 'use 
> bytes', and I'm having a hard time groking the argument that we should 
> revert back to that.
> To clarify my proposal (since Victor misunderstood it),  I propose, 
> within 'use locale' scope, leaving the behavior as the commit changed it 
> to.  $! now behaves as other variables in such scope behave; it no 
> longer is an outlier that has to be treated specially.  Outside that 
> scope, I propose to fully decode $! into Perl's internal coding 
> (essentially UTF-8).  The latter would automatically load the needed 
> modules.  If the system did not have nl_langinfo(), I now think that the 
> best thing to do is to output the message in the C locale, yielding it 
> in English, which the user could machine translate.  We are not going to 
> return undef, as Victor suggested, as that would be throwing away 
> potentially crucial information.
> As I mentioned above, it's not a good idea to pass references to 
> locale-encoded variables.  I don't see how $! is different from other 
> locale variables in its orneriness.  It just comes with the territory.
> The idea of reverting this commit and having another global variable 
> that does the full decoding harms code within 'use locale' scope. 
> Instead of this variable being a typical scalar there, it becomes an 
> outlier, which has to have special treatment.  We could add a third 
> variable which behaves as the current commit now does to accommodate 
> such code.  This is getting unwieldy.  Whatever behavior we decide to do 
> has to also be applied to $^E.  Now we would then have 6 variables 
> instead of 2.
> I think my proposal is the least bad of those presented so far.

via perlbug:  queue: perl5 status: open

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About