Front page | perl.perl5.porters |
Postings from March 2014
Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 localesonly under 5.19.2+
Thread Previous
|
Thread Next
From:
Karl Williamson
Date:
March 27, 2014 02:07
Subject:
Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 localesonly under 5.19.2+
Message ID:
53338823.2070801@khwilliamson.com
On 03/26/2014 05:12 PM, Karl Williamson wrote:
> On 03/26/2014 04:06 PM, Victor Efimov wrote:
>> 2014-03-27 1:41 GMT+04:00 Karl Williamson via RT
>> <perlbug-followup@perl.org>:
>>> I looked at https://github.com/petdance/ack2/issues/367
>>> which shows that ack is broken by the 5.19.2 change.
>>>
>>> If you look at that link, you'll see that the russian comes out fine,
>>> but with a warning that didn't use to be there; the french is broken.
>>>
>>> What is happening is that ack treats everything as bytes, and so
>>> everything just worked. STDERR is opened as a byte-oriented file,
>>> and if $! actually did contain UTF-8, it wasn't marked as such, and
>>> its component bytes were output as-is, so that if in fact the
>>> terminal is expecting UTF-8, they come out looking like UTF-8 to it,
>>> and everything held together. (Garbage would ensue if the terminal
>>> wasn't expecting the encoding that $! is in; I haven't checked, but
>>> my guess is that the grep output is also output as-is, so if the file
>>> encodings differ from the terminal expectation, that garbage could be
>>> printed; but in practice I doubt that this is a problem.)
>>>
>>> What the 5.19 change did effectively is to make the stringification
>>> of "$!" obey "use bytes". Most code isn't in bytes' scope, so the
>>> UTF-8 flag gets turned on if appropriate.
>>>
>>> Perl's do_print() function checks if the stream is listed as UTF-8 or
>>> not. The string being output is converted to the stream's encoding
>>> if necessary and possible. If not possible, things are just output
>>> as-is, possibly with warnings. In ack's case the stream never is
>>> (AFAIK) UTF-8. Starting in 5.19.2+, the message can be marked as
>>> UTF-8, and so tries to get converted to the non-UTF-8 stream. This
>>> is impossible in Russian, so the bytes are output as-is, with a
>>> warning. Since the terminal really is UTF-8, they display
>>> correctly. But it is possible to convert the French text, as all the
>>> characters in the message in the bug report are Latin1. So
>>> do_print() does this, but since the terminal's encoding doesn't match
>>> what ack thinks it is, the non-ascii characters come out as garbage.
>>
>> yes agree. anyway warnings are bad. and broken latin1 bad too.
>
> It's arguable that the warnings should have been output all along. since
> really it is UTF-8 being output to a terminal that perl thinks can't
> handle it.
>>
>>>
>>> Note that ack has some of its messages hard-coded in English. For
>>> example, it does a -e on the file name, and outputs English-only if
>>> it doesn't exist. rjbs has pointed out to me privately that typical
>>> uses of $! are of the form
>>>
>>> die "my message in English: $!"
>>
>> Right, usually "my message in English" indeed is in English because
>> authors don't bother with full localization and translations to all
>> languages, but for consistency it's better to see $! in locale's
>> language. Other programs usually show it in user language.
>>
>>>
>>> I am not an ack user, but it appears to me that ack is like a filter
>>> which doesn't care about encodings. It is byte rather than character
>>> oriented. This seems to me to be an appropriate use of 'use bytes',
>>> and if ack did this, this bug would not arise.
>>
>> I would disagree, they try to migrate to unicode
>>
>> https://github.com/petdance/ack2/issues/120
>> https://github.com/petdance/ack2/issues/344
>> https://github.com/petdance/ack2/issues/350
>> https://github.com/petdance/ack2/issues/355
>>
>> ack is searching _text_ using _perl regexps_ in text files. it even
>> ignore files detected as binary (by default, at least, in my
>> installation)
>
> I stand corrected.
>
>>
>>>
>>> My proposal to only use ASCII characters in error messages unless
>>> within 'use locale' would also fix this problem. All messages that
>>> print in Russian and some messages in French, would now appear in
>>> English, adding to the several that already print in English no
>>> matter what.
>>>
>>
>> I am writing programs with correct use of modern Perl unicode now, but
>> never used 'use locale', seems it adds additional side effect to code?
>> Can there be special option for 'use locale' to not change anything at
>> all, except $! behaviour (in lexical scope) ?
>
> locale works a lot better (I anticipate) in 5.20 than before. I think
> it should finally be possible to 'use locale' as a matter of habit.
>
> I was already thinking that 'use locale' in 5.22 should have the ability
> to select LC_CTYPE and LC_COLLATE individually. It seems logical to
> make this general, so you could say
>
> 'use locale ':messages, numeric';
>
> to get just the effects you want. Some of this could conceivably be
> added in 5.20 if it helps to resolve this blocker.
>
>>
>> also, can code without 'use locale' behave like 5.18 (i.e. not always
>> in English; bytes)
>
>
> The problem is that the commit fixed real bugs in code that didn't "use
> locale" Thus the quandary. If we go back to 5.18 behavior, those bugs
> come back. I believe that my proposal that only ASCII messages get
> displayed outside of 'use locale' is the only "sure" method that doesn't
> display garbage to someone. (Note that ASCII doesn't mean necessarily
> English. Many error messages in Western European languages consist only
> of ASCII characters. I realize that doesn't help Russian or Chinese, etc.)
>
> Also, I hadn't realized this before, but sometimes the message's
> characters aren't just garbage that someone with the motivation and
> skill could figure out, but the UNICODE REPLACEMENT CHARACTER can be
> displayed instead, so information is lost and can't be recovered.
>
> > ? and with 'use locale :errno_only' change $! to
> > return unicode character string.
>
> I don't see how this differs from your suggestion above for an option to
> 'use locale' to just effect $! (which is BTW LC_MESSAGES).
>
> And that reminds me, MS Windows doesn't have LC_MESSAGES, AFAIK. Can
> someone explain what languages error messages are displayed in under
> varied locales?
>
Another possibility to get programs like ack to work unchanged is to add
a non-printing above-Latin1 character to the stringification of $! when
it is UTF-8 and there are only Latin1 characters in it. A possibility
is a ZERO WIDTH SPACE. Then do_print() wouldn't try to downgrade. The
drawback is that code that analyzes $! could be thrown off. But code
generally should be analyzing the numeric value anyway, and not the
string representation
Thread Previous
|
Thread Next