develooper Front page | perl.perl5.porters | Postings from March 2014

[perl #119499] $! returned with UTF-8 flag under UTF-8 locales only under 5.19.2+

Thread Next
Karl Williamson via RT
March 26, 2014 21:41
[perl #119499] $! returned with UTF-8 flag under UTF-8 locales only under 5.19.2+
Message ID:
I looked at
which shows that ack is broken by the 5.19.2 change.

If you look at that link, you'll see that the russian comes out fine, but with a warning that didn't use to be there; the french is broken.

What is happening is that ack treats everything as bytes, and so everything just worked.  STDERR is opened as a byte-oriented file, and if $! actually did contain UTF-8, it wasn't marked as such, and its component bytes were output as-is, so that if in fact the terminal is expecting UTF-8, they come out looking like UTF-8 to it, and everything held together.  (Garbage would ensue if the terminal wasn't expecting the encoding that $! is in; I haven't checked, but my guess is that the grep output is also output as-is, so if the file encodings differ from the terminal expectation, that garbage could be printed; but in practice I doubt that this is a problem.)

What the 5.19 change did effectively is to make the stringification of "$!" obey "use bytes".  Most code isn't in bytes' scope, so the UTF-8 flag gets turned on if appropriate.

Perl's do_print() function checks if the stream is listed as UTF-8 or not.  The string being output is converted to the stream's encoding if necessary and possible.  If not possible, things are just output as-is, possibly with warnings.  In ack's case the stream never is (AFAIK) UTF-8.  Starting in 5.19.2+, the message can be marked as UTF-8, and so tries to get converted to the non-UTF-8 stream.  This is impossible in Russian, so the bytes are output as-is, with a warning.  Since the terminal really is UTF-8, they display correctly.  But it is possible to convert the French text, as all the characters in the message in the bug report are Latin1.  So do_print() does this, but since the terminal's encoding doesn't match what ack thinks it is, the non-ascii characters come out as garbage.

Note that ack has some of its messages hard-coded in English.  For example, it does a -e on the file name, and outputs English-only if it doesn't exist.  rjbs has pointed out to me privately that typical uses of $! are of the form 

 die "my message in English: $!"

I am not an ack user, but it appears to me that ack is like a filter which doesn't care about encodings.  It is byte rather than character oriented.  This seems to me to be an appropriate use of 'use bytes', and if ack did this, this bug would not arise.

My proposal to only use ASCII characters in error messages unless within 'use locale' would also fix this problem.  All messages that print in Russian and some messages in French, would now appear in English, adding to the several that already print in English no matter what.

via perlbug:  queue: perl5 status: open

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About