develooper Front page | perl.perl5.porters | Postings from March 2014

Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 localesonly under 5.19.2+

Thread Previous | Thread Next
Karl Williamson
March 27, 2014 02:07
Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 localesonly under 5.19.2+
Message ID:
On 03/26/2014 05:12 PM, Karl Williamson wrote:
> On 03/26/2014 04:06 PM, Victor Efimov wrote:
>> 2014-03-27 1:41 GMT+04:00 Karl Williamson via RT
>> <>:
>>> I looked at
>>> which shows that ack is broken by the 5.19.2 change.
>>> If you look at that link, you'll see that the russian comes out fine,
>>> but with a warning that didn't use to be there; the french is broken.
>>> What is happening is that ack treats everything as bytes, and so
>>> everything just worked.  STDERR is opened as a byte-oriented file,
>>> and if $! actually did contain UTF-8, it wasn't marked as such, and
>>> its component bytes were output as-is, so that if in fact the
>>> terminal is expecting UTF-8, they come out looking like UTF-8 to it,
>>> and everything held together.  (Garbage would ensue if the terminal
>>> wasn't expecting the encoding that $! is in; I haven't checked, but
>>> my guess is that the grep output is also output as-is, so if the file
>>> encodings differ from the terminal expectation, that garbage could be
>>> printed; but in practice I doubt that this is a problem.)
>>> What the 5.19 change did effectively is to make the stringification
>>> of "$!" obey "use bytes".  Most code isn't in bytes' scope, so the
>>> UTF-8 flag gets turned on if appropriate.
>>> Perl's do_print() function checks if the stream is listed as UTF-8 or
>>> not.  The string being output is converted to the stream's encoding
>>> if necessary and possible.  If not possible, things are just output
>>> as-is, possibly with warnings.  In ack's case the stream never is
>>> (AFAIK) UTF-8.  Starting in 5.19.2+, the message can be marked as
>>> UTF-8, and so tries to get converted to the non-UTF-8 stream.  This
>>> is impossible in Russian, so the bytes are output as-is, with a
>>> warning.  Since the terminal really is UTF-8, they display
>>> correctly.  But it is possible to convert the French text, as all the
>>> characters in the message in the bug report are Latin1.  So
>>> do_print() does this, but since the terminal's encoding doesn't match
>>> what ack thinks it is, the non-ascii characters come out as garbage.
>> yes agree. anyway warnings are bad. and broken latin1 bad too.
> It's arguable that the warnings should have been output all along. since
> really it is UTF-8 being output to a terminal that perl thinks can't
> handle it.
>>> Note that ack has some of its messages hard-coded in English.  For
>>> example, it does a -e on the file name, and outputs English-only if
>>> it doesn't exist.  rjbs has pointed out to me privately that typical
>>> uses of $! are of the form
>>>   die "my message in English: $!"
>> Right, usually "my message in English" indeed is in English because
>> authors don't bother with full localization and translations to all
>> languages, but for consistency it's better to see $! in locale's
>> language. Other programs usually show it in user language.
>>> I am not an ack user, but it appears to me that ack is like a filter
>>> which doesn't care about encodings.  It is byte rather than character
>>> oriented.  This seems to me to be an appropriate use of 'use bytes',
>>> and if ack did this, this bug would not arise.
>> I would disagree, they try to migrate to unicode
>> ack is searching _text_ using _perl regexps_ in text files. it even
>> ignore files detected as binary (by default, at least, in my
>> installation)
> I stand corrected.
>>> My proposal to only use ASCII characters in error messages unless
>>> within 'use locale' would also fix this problem.  All messages that
>>> print in Russian and some messages in French, would now appear in
>>> English, adding to the several that already print in English no
>>> matter what.
>> I am writing programs with correct use of modern Perl unicode now, but
>> never used 'use locale', seems it adds additional side effect to code?
>> Can there be special option for 'use locale' to not change anything at
>> all, except $! behaviour (in lexical scope) ?
> locale works a lot better (I anticipate) in 5.20 than before.  I think
> it should finally be possible to 'use locale' as a matter of habit.
> I was already thinking that 'use locale' in 5.22 should have the ability
> to select LC_CTYPE and LC_COLLATE individually.  It seems logical to
> make this general, so you could say
> 'use locale ':messages, numeric';
> to get just the effects you want.  Some of this could conceivably be
> added in 5.20 if it helps to resolve this blocker.
>> also, can code without 'use locale' behave like 5.18 (i.e. not always
>> in English; bytes)
> The problem is that the commit fixed real bugs in code that didn't "use
> locale"  Thus the quandary.  If we go back to 5.18 behavior, those bugs
> come back.  I believe that my proposal that only ASCII messages get
> displayed outside of 'use locale' is the only "sure" method that doesn't
> display garbage to someone.  (Note that ASCII doesn't mean necessarily
> English.  Many error messages in Western European languages consist only
> of ASCII characters.  I realize that doesn't help Russian or Chinese, etc.)
> Also, I hadn't realized this before, but sometimes the message's
> characters aren't just garbage that someone with the motivation and
> skill could figure out, but the UNICODE REPLACEMENT CHARACTER can be
> displayed instead, so information is lost and can't be recovered.
>  > ? and with 'use locale :errno_only' change $! to
>  > return unicode character string.
> I don't see how this differs from your suggestion above for an option to
> 'use locale' to just effect $! (which is BTW LC_MESSAGES).
> And that reminds me, MS Windows doesn't have LC_MESSAGES, AFAIK.  Can
> someone explain what languages error messages are displayed in under
> varied locales?

Another possibility to get programs like ack to work unchanged is to add 
a non-printing above-Latin1 character to the stringification of $! when 
it is UTF-8 and there are only Latin1 characters in it.  A possibility 
is a ZERO WIDTH SPACE.  Then do_print() wouldn't try to downgrade.  The 
drawback is that code that analyzes $! could be thrown off.  But code 
generally should be analyzing the numeric value anyway, and not the 
string representation

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About