develooper Front page | perl.perl5.porters | Postings from March 2014

Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 localesonly under 5.19.2+

Thread Previous | Thread Next
Karl Williamson
March 26, 2014 23:11
Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 localesonly under 5.19.2+
Message ID:
On 03/26/2014 04:06 PM, Victor Efimov wrote:
> 2014-03-27 1:41 GMT+04:00 Karl Williamson via RT <>:
>> I looked at
>> which shows that ack is broken by the 5.19.2 change.
>> If you look at that link, you'll see that the russian comes out fine, but with a warning that didn't use to be there; the french is broken.
>> What is happening is that ack treats everything as bytes, and so everything just worked.  STDERR is opened as a byte-oriented file, and if $! actually did contain UTF-8, it wasn't marked as such, and its component bytes were output as-is, so that if in fact the terminal is expecting UTF-8, they come out looking like UTF-8 to it, and everything held together.  (Garbage would ensue if the terminal wasn't expecting the encoding that $! is in; I haven't checked, but my guess is that the grep output is also output as-is, so if the file encodings differ from the terminal expectation, that garbage could be printed; but in practice I doubt that this is a problem.)
>> What the 5.19 change did effectively is to make the stringification of "$!" obey "use bytes".  Most code isn't in bytes' scope, so the UTF-8 flag gets turned on if appropriate.
>> Perl's do_print() function checks if the stream is listed as UTF-8 or not.  The string being output is converted to the stream's encoding if necessary and possible.  If not possible, things are just output as-is, possibly with warnings.  In ack's case the stream never is (AFAIK) UTF-8.  Starting in 5.19.2+, the message can be marked as UTF-8, and so tries to get converted to the non-UTF-8 stream.  This is impossible in Russian, so the bytes are output as-is, with a warning.  Since the terminal really is UTF-8, they display correctly.  But it is possible to convert the French text, as all the characters in the message in the bug report are Latin1.  So do_print() does this, but since the terminal's encoding doesn't match what ack thinks it is, the non-ascii characters come out as garbage.
> yes agree. anyway warnings are bad. and broken latin1 bad too.

It's arguable that the warnings should have been output all along. 
since really it is UTF-8 being output to a terminal that perl thinks 
can't handle it.
>> Note that ack has some of its messages hard-coded in English.  For example, it does a -e on the file name, and outputs English-only if it doesn't exist.  rjbs has pointed out to me privately that typical uses of $! are of the form
>>   die "my message in English: $!"
> Right, usually "my message in English" indeed is in English because
> authors don't bother with full localization and translations to all
> languages, but for consistency it's better to see $! in locale's
> language. Other programs usually show it in user language.
>> I am not an ack user, but it appears to me that ack is like a filter which doesn't care about encodings.  It is byte rather than character oriented.  This seems to me to be an appropriate use of 'use bytes', and if ack did this, this bug would not arise.
> I would disagree, they try to migrate to unicode
> ack is searching _text_ using _perl regexps_ in text files. it even
> ignore files detected as binary (by default, at least, in my
> installation)

I stand corrected.

>> My proposal to only use ASCII characters in error messages unless within 'use locale' would also fix this problem.  All messages that print in Russian and some messages in French, would now appear in English, adding to the several that already print in English no matter what.
> I am writing programs with correct use of modern Perl unicode now, but
> never used 'use locale', seems it adds additional side effect to code?
> Can there be special option for 'use locale' to not change anything at
> all, except $! behaviour (in lexical scope) ?

locale works a lot better (I anticipate) in 5.20 than before.  I think 
it should finally be possible to 'use locale' as a matter of habit.

I was already thinking that 'use locale' in 5.22 should have the ability 
to select LC_CTYPE and LC_COLLATE individually.  It seems logical to 
make this general, so you could say

'use locale ':messages, numeric';

to get just the effects you want.  Some of this could conceivably be 
added in 5.20 if it helps to resolve this blocker.

> also, can code without 'use locale' behave like 5.18 (i.e. not always
> in English; bytes)

The problem is that the commit fixed real bugs in code that didn't "use 
locale"  Thus the quandary.  If we go back to 5.18 behavior, those bugs 
come back.  I believe that my proposal that only ASCII messages get 
displayed outside of 'use locale' is the only "sure" method that doesn't 
display garbage to someone.  (Note that ASCII doesn't mean necessarily 
English.  Many error messages in Western European languages consist only 
of ASCII characters.  I realize that doesn't help Russian or Chinese, etc.)

Also, I hadn't realized this before, but sometimes the message's 
characters aren't just garbage that someone with the motivation and 
skill could figure out, but the UNICODE REPLACEMENT CHARACTER can be 
displayed instead, so information is lost and can't be recovered.

 > ? and with 'use locale :errno_only' change $! to
 > return unicode character string.

I don't see how this differs from your suggestion above for an option to 
'use locale' to just effect $! (which is BTW LC_MESSAGES).

And that reminds me, MS Windows doesn't have LC_MESSAGES, AFAIK.  Can 
someone explain what languages error messages are displayed in under 
varied locales?

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About