develooper Front page | perl.perl5.porters | Postings from August 2013

Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 localesonly under 5.19.2+

Thread Previous | Thread Next
Karl Williamson
August 29, 2013 20:04
Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 localesonly under 5.19.2+
Message ID:
On 08/29/2013 02:15 AM, Victor Efimov via RT wrote:
> On Wed Aug 28 23:40:08 2013, sprout wrote:
>> So now, $! may or may not be encoded, and you have to way of telling
>> reliably without doing the same environment checks that perl itself did
>> internally before deciding to decode $! itself.

I don't follow these arguments.  What that commit did is only to look at 
the string returned by the operating system, and if it is encoded in 
UTF-8, to set that flag in the scalar.  That's it (*).  If the OS didn't 
return UTF-8, it leaves the flag alone.  I find it hard to comprehend 
that this isn't the right thing to do.  For the first time, $! in string 
context is no different than any other string scalar in Perl.  They have 
a utf-8 bit set which means that the encoding is in UTF-8, or they don't 
have it set, which means that the encoding is unknown to Perl.  This 
commit did not change the latter part one iota.
We have conventions as to what the bytes in that scalar mean depending 
on the context it is used, the pragmas that are in effect in those 
contexts, and the operations that are being performed on it.  But they 
are just conventions.  This commit did not change that.

What is different about $! is that we have made the decision to respect 
locale when accessing it even when not in the scope of 'use locale'.  In 
light of these issues, perhaps this should be discussed again.  I'll let 
the people who argued for that decision to again argue for it.

The change fixed two bug reports for the common case where the locales 
for messages and the I/O matched and where people had not taken pains to 
deal with locale.  I think that should trump the less frequent cases, 
given the conflicts.

If code wants $! to be expressed in a certain language, it should set 
the locale to that language while accessing $! and then restore the old 

> Small corrections:
> a) Actually there is a way: check is_utf8($!) flag (which is not good
> because is_utf8 marked as danger, and it's documented you cant distinct
> characters from bytes with this flag)

I don't see that danger marked currently in the pod for  Where 
do you see that?
> b) Current fix does not do environment checks, it just tries to do UTF-8
> validity check

(*)  To be precise

1) if the string returned by the OS is entirely ASCII, it does not set 
the UTF-8 flag.  This is because ASCII UTF-8 and non-UTF-8 are 
identical, so the flag is irrelevant.  And yes, this is buggy if 
operating under a non-ASCII 7-bit locale, as in ISO 646.  These locales 
have all been superseded so should be rare today, but a bug report could 
be written on this.

2) As Victor notes, the commit does a UTF-8 validity check, so it is 
possible that that could give false positives.  But as Wikipedia says, 
"One of the few cases where charset detection works reliably is 
detecting UTF-8. This is due to the large percentage of invalid byte 
sequences in UTF-8, so that text in any other encoding that uses bytes 
with the high bit set is extremely unlikely to pass a UTF-8 validity 
test."  (The original emphasized "extremely".)  I checked this out with 
the CP1251 character set, and the only modern Russian character that 
could be a continuation byte is ё.  All other vowels and consonants must 
be start bytes.  That means that to generate a false positive, an OS 
message in CP1251 must only contain words whose 2nd, 4th, ... bytes are 
that vowel.  That just isn't going to happen, though the common Russian 
word Её (her, hers, ...) could be confusable if there were no other 
words in the message.

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About