Front page | perl.perl5.porters |
Postings from August 2017
Re: my_strerror() as API function
Thread Previous
|
Thread Next
From:
Karl Williamson
Date:
August 15, 2017 01:46
Subject:
Re: my_strerror() as API function
Message ID:
fb49ce1a-748d-3824-3301-3314ca4d4dab@khwilliamson.com
On 08/12/2017 05:21 PM, Zefram wrote:
> Karl Williamson wrote:
>> The heuristic you say is dodgy has been used traditionally in perl,
>
> I don't recall ever encountering it before. Though looking now, I see
> some other locale-related uses and, scarily, some in the tokeniser.
>
>> For those of you who aren't familiar with it, it leaves
>> the UTF-8 flag off on strings that have the same representation in UTF-8 as
>> not. For those, the flag's state is immaterial.
>
> This is presupposing that the only thing to decide is whether to turn
> on SvUTF8. A more accurate statement of this part of the heuristic
> would be that it interprets any byte sequence that could be valid ASCII
> as ASCII. This gives the correct result if the actual encoding of the
> input is ASCII-compatible, which one would hope would always be the case
> for locale encodings on an ASCII-based platform. (I'm ignoring EBCDIC.)
>
>> For other strings, it turns
>> on the flag if and only if it is syntactically legal UTF-8.
>
> So the effect is to decode as UTF-8 if it looks like UTF-8. This will
> correctly decode strings for any UTF-8 locale. But you ignored what
> happens in the other case: in your terminology it "leaves the flag off";
> the effect is that it decodes as ISO-8859-1. As you say, it will usually
> avoid decoding as UTF-8 if the encoding was actually ISO-8859-1, so it'll
> usually get a correct decoding for an ISO-8859-1 locale. (Usually is
> not always: I wouldn't want to rely on this for semantic purposes,
> but if only message legibility is at stake then it might be acceptable.)
The point of this is message legibility. so that "$!" doesn't create
mojibake. We have had no complaints since it got fixed to work this
way. It was never intended to do what you want to extend it to.
Please don't use the terms decode and encode. They are ambiguous.
What it appears you want to do is to translate the text from the user's
locale into Perl's underlying encoding, which is ASCII/ISO8859-1, or
UTF-8. That may be a worthwhile enhancement, but for current purposes,
as I said, that hasn't been necessary. We don't analyze the error
message; it's just displayed, and as long as it comes out in the
encoding the user expects, it all works.
>
> But since UTF-8 and ISO-8859-1 are the only decoding options (because
> it's only willing to decide the SvUTF8 flag state), it's *guaranteed*
> to decode incorrectly for anything that's neither of these encodings.
> Cyrillic in ISO-8859-5? Guaranteed to get that wrong. And the layout of
> ISO-8859-5 is very different from ISO-8859-1, having many more letters,
> such that a natural string is considerably more likely to accidentally
> look like UTF-8. So no guarantee of which kind of mojibake you'll get.
As I said, we are not currently trying to find the encoding the text the
message is in, just to prevent mojibake, and for that, all that is
needed is to determine if something is UTF-8 or not, since UTF-8 is the
only multi-byte encoding that Perl supports.
I had never really thought about this before, but I was wrong that the
result depended on the particular way word characters vs punctuation
were positioned in 8859-1. I read that somewhere sometime, and just
assumed it was true. In fact, the range of code points 80 - 9F are not
allocated in any ISO 8859 encoding. This range is entirely controls,
hardly used anywhere anymore, and certainly not in the middle of text.
However, characters from this range are used in every non-ASCII UTF-8
sequence as continuation bytes. This means that the heuristic is 100%
accurate in distinguishing UTF-8 from any of the 8859 encodings,
contrary to what you said about 8859-5.
I concede that there are encodings that do use the 80-9F range, and
these could be wrongly guessed. The most likely one still in common use
is CP 1252. I did try once to create a string that made sense in both
encodings, and I did succeed, but it was quite hard for me to do, and
was very short; much shorter than an error message.
>
> $! used to be consistently mojibaked in a locale-to-Latin-1 manner.
> That sucked. Now, outside the scope of "use locale" it's consistently
> English, which is better. But if one wants localised messages and so
> uses "use locale", now $! isn't consistently anything. It's worse than
> when it was wrong in a consistent way.
That statement doesn't make sense to me.
>
>> There is no way of being able to determine with total reliability the locale
>> that something is encoded in
>
> Wrong question. We're not given an arbitrary string and made to guess
> its locale. We *know* the locale, because it's the LC_MESSAGES setting
> under which we just called strerror(). The tricky bit is to determine
> the character encoding that the locale uses.
>
>> across all systems that Perl can run on.
>
> True, but we can do a lot better than we do now. nl_langinfo(CODESET)
> yields a string naming the encoding, on a lot of systems. We can feed
> that encoding name into Encode.
>
> In fact, we've already got code using nl_langinfo() in the core, in
> locale.c, to try to determine whether a locale uses the UTF-8 encoding.
> Apparently to control the behaviour of -CL. We could do a lot more
> with this.
>
Actually, this is used for various reasons. Again perl internally
currently and in the past only cares whether something is UTF-8 or not.
That's been sufficient for our purposes.
If you look carefully, you will see that it doesn't trust the output of
nl_langinfo. but checks that a claimed UTF-8 codeset has expected behavior.
I do not know if the codesets returned by nl_langinfo match Encode's
names in all cases, or even if the names are standardized across
platforms, and, to repeat, nl_langinfo is not available in some modern
systems, such as win32, which doesn't even have LC_MESSAGES.
Note, that your translation will always end up being in UTF-8 for any
8-bit encoding that is not ISO-8859-1.
And finally, I want to reiterate again that what you are proposing is
not how perl has ever operated on locale data. We do not care what the
encoding is, except for UTF-8. For all others, it's just a series of
bytes that should make sense to the user.
Also, what you are proposing should be trivially achievable in pure Perl
using POSIX::nl_langinfo and Encode. If you were to prototype it that
way you could find out if there are glitches between the names each
understands.
Thread Previous
|
Thread Next