develooper Front page | perl.perl5.porters | Postings from August 2017

Re: my_strerror() as API function

Thread Previous | Thread Next
From:
Zefram
Date:
August 12, 2017 23:22
Subject:
Re: my_strerror() as API function
Message ID:
20170812232155.GV9383@fysh.org
Karl Williamson wrote:
>The heuristic you say is dodgy has been used traditionally in perl,

I don't recall ever encountering it before.  Though looking now, I see
some other locale-related uses and, scarily, some in the tokeniser.

>                      For those of you who aren't familiar with it, it leaves
>the UTF-8 flag off on strings that have the same representation in UTF-8 as
>not.  For those, the flag's state is immaterial.

This is presupposing that the only thing to decide is whether to turn
on SvUTF8.  A more accurate statement of this part of the heuristic
would be that it interprets any byte sequence that could be valid ASCII
as ASCII.  This gives the correct result if the actual encoding of the
input is ASCII-compatible, which one would hope would always be the case
for locale encodings on an ASCII-based platform.  (I'm ignoring EBCDIC.)

>                                                  For other strings, it turns
>on the flag if and only if it is syntactically legal UTF-8.

So the effect is to decode as UTF-8 if it looks like UTF-8.  This will
correctly decode strings for any UTF-8 locale.  But you ignored what
happens in the other case: in your terminology it "leaves the flag off";
the effect is that it decodes as ISO-8859-1.  As you say, it will usually
avoid decoding as UTF-8 if the encoding was actually ISO-8859-1, so it'll
usually get a correct decoding for an ISO-8859-1 locale.  (Usually is
not always: I wouldn't want to rely on this for semantic purposes,
but if only message legibility is at stake then it might be acceptable.)

But since UTF-8 and ISO-8859-1 are the only decoding options (because
it's only willing to decide the SvUTF8 flag state), it's *guaranteed*
to decode incorrectly for anything that's neither of these encodings.
Cyrillic in ISO-8859-5?  Guaranteed to get that wrong.  And the layout of
ISO-8859-5 is very different from ISO-8859-1, having many more letters,
such that a natural string is considerably more likely to accidentally
look like UTF-8.  So no guarantee of which kind of mojibake you'll get.

$! used to be consistently mojibaked in a locale-to-Latin-1 manner.
That sucked.  Now, outside the scope of "use locale" it's consistently
English, which is better.  But if one wants localised messages and so
uses "use locale", now $! isn't consistently anything.  It's worse than
when it was wrong in a consistent way.

>There is no way of being able to determine with total reliability the locale
>that something is encoded in

Wrong question.  We're not given an arbitrary string and made to guess
its locale.  We *know* the locale, because it's the LC_MESSAGES setting
under which we just called strerror().  The tricky bit is to determine
the character encoding that the locale uses.

>                             across all systems that Perl can run on.

True, but we can do a lot better than we do now.  nl_langinfo(CODESET)
yields a string naming the encoding, on a lot of systems.  We can feed
that encoding name into Encode.

In fact, we've already got code using nl_langinfo() in the core, in
locale.c, to try to determine whether a locale uses the UTF-8 encoding.
Apparently to control the behaviour of -CL.  We could do a lot more
with this.

-zefram

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About