develooper Front page | perl.perl5.porters | Postings from August 2017

Re: my_strerror() as API function

Thread Previous | Thread Next
From:
Zefram
Date:
August 15, 2017 04:38
Subject:
Re: my_strerror() as API function
Message ID:
20170815043803.GX9383@fysh.org
Karl Williamson wrote:
>fact, the range of code points 80 - 9F are not allocated in any ISO 8859
>encoding.  This range is entirely controls, hardly used anywhere anymore, and
>certainly not in the middle of text. However, characters from this range are
>used in every non-ASCII UTF-8 sequence as continuation bytes.  This means
>that the heuristic is 100% accurate in distinguishing UTF-8 from any of the
>8859 encodings, contrary to what you said about 8859-5.

No, that's not correct.  The C1 controls are indeed there, in all the ISO
8859 encodings, but they only cover half the range of UTF-8 continuation
bytes.  0xa0 to 0xbf are also continuation bytes.  So many, not all,
multibyte UTF-8 character representations consist entirely of byte values
that represent printable characters in ISO-8859-*.  The thing about the
distribution of letters and symbols comes from the fact that none of 0xa0
to 0xbf represent letters in ISO-8859-1.  But most of them are letters
in ISO-8859-5.  (Luckily they're capital letters, which provides some
lesser degree of safety against accidentally forming UTF-8 sequences.)

>And finally, I want to reiterate again that what you are proposing is not how
>perl has ever operated on locale data.

True, but how it's operating now is crap.  It was somewhat crap
when it didn't decode locale strings at all, and just trusted that
the bytes should make sense to the user.  It was an oversight that
when Unicode was embraced this wasn't changed to decode to the native
Unicode representation.  But at least it was consistent in providing
a locale-encoding byte string.  Now it's inconsistent: $! may provide
either the locale-encoded byte string or the character string that the
byte string probably represents.  Consistently decoding it to a character
string would certainly be novel, but it's the only behaviour that makes
sense in a Unicode environment.

>Also, what you are proposing should be trivially achievable in pure Perl
>using POSIX::nl_langinfo and Encode.

It's not trivial to apply this to $!, because of the aforementioned
inconsistency.  It's *possible* with some mucking about with SvUTF8,
but we'd never say that that kind of treatment of $! was a supported
interface.

>                                      If you were to prototype it that way
>you could find out if there are glitches between the names each understands.

Yes, ish.  The basic decoding can certainly be prototyped this way, and so
can the additional logic for places where nl_langinfo() is unavailable or
where we can detect that it gives bad data.  But this doesn't sound all
that useful as an investigatory tool.  The way to find out how useful
this logic is is to gather strerror()/nl_langinfo() pairs from a wide
range of OSes.  In any case, as a porting task it's not something that
one person can do alone.

-zefram

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About