Front page | perl.perl5.porters |
Postings from August 2017
Re: my_strerror() as API function
Thread Previous
|
Thread Next
From:
Karl Williamson
Date:
August 16, 2017 04:11
Subject:
Re: my_strerror() as API function
Message ID:
4d46201d-7fa3-80f6-def0-0fd43978f2d9@khwilliamson.com
On 08/14/2017 10:38 PM, Zefram wrote:
> Karl Williamson wrote:
>> fact, the range of code points 80 - 9F are not allocated in any ISO 8859
>> encoding. This range is entirely controls, hardly used anywhere anymore, and
>> certainly not in the middle of text. However, characters from this range are
>> used in every non-ASCII UTF-8 sequence as continuation bytes. This means
>> that the heuristic is 100% accurate in distinguishing UTF-8 from any of the
>> 8859 encodings, contrary to what you said about 8859-5.
>
> No, that's not correct. The C1 controls are indeed there, in all the ISO
> 8859 encodings, but they only cover half the range of UTF-8 continuation
> bytes. 0xa0 to 0xbf are also continuation bytes. So many, not all,
> multibyte UTF-8 character representations consist entirely of byte values
> that represent printable characters in ISO-8859-*. The thing about the
> distribution of letters and symbols comes from the fact that none of 0xa0
> to 0xbf represent letters in ISO-8859-1. But most of them are letters
> in ISO-8859-5. (Luckily they're capital letters, which provides some
> lesser degree of safety against accidentally forming UTF-8 sequences.)
I'm sorry I got confused, and additionally misstated stuff I wasn't
confused about. I do sometimes space out that continuation bytes go up
through BF. And my point about the C1 controls was not that they are
unusable in 8859 texts, but that they are separate from 8859, unlike
Windows CP1252 which does use most of the C1-defined code points to
represent graphic characters.
But my point remains, you are just not going to see C1 controls in text.
This leaves the range A0-BF that are legal continuation bytes, and are
mostly symbols in 8859-1, and so makes it hard to confuse that encoding
with UTF-8. And that's why the layout of 8859-1 does make a difference,
but the chances of confusion are above 0%
The range A0-BF in 8859-5 is almost entirely letters. Modern Russian is
represented by the 4 rows B0-EF, plus A1 and F1 (though these last two
are often transliterated to others these days). The other word
characters in 8859-1 are used in other Cyrillic languages. Text in
these language will use a mixture of Russian characters plus characters
from the A0-AF row and the F0-FF row.
The capital letters are those up through CF; anything above is lowercase.
For a byte sequence to be confusable with a UTF-8-encoded character, it
must begin with something C0 and greater, followed by one or more of
something lower than C0.
The range C0-CF is essentially the last half of the capital letters in
the modern Russian alphabet, including half the vowels.
Let's take that case first. To be valid UTF-8, the next byte must be
below C0, and hence must also be uppercase. If this represents a word
in Cyrillic, the next byte must again be C0 and above, and so on. One
could construct a confusable sequence of uppercase letters as long as
every other one comes from the last half of the Russian alphabet, and
the others from the first half, or are from Macedonian, Ukrainian and
the like.
I took Russian in college; the capitalization rules are similar to
English. You just don't see strings of all caps. So yes, this is
confusable for short strings of all caps, provided the other conditions
are met. Something like the Cyrillic equivalent of EINVAL might be
confusable.
Now let's look at the other case, where the first byte is D0 or above.
This is a lowercase letter, and it must be followed by one or more bytes
that are all uppercase. Again, you won't see things like aB, bAR, eINV
in text.
I looked at the remaining 8859 code pages
-2 only one vowel below C0
-3 only one vowel below C0
-4 only two vowels below C0
-6 no letters below C0
-7 7 letters below C0, all polytonic Greek, and I'm not qualified to
analyze this.
-8 only punctuation below E0
-9 only punctuation below C0
-10 almost all characters C0 and above are vowels
-11 I'm not qualified to analyze Thai, but I notice that of the code
points C0 and above, more than half are: 1) unassigned; 2) digits; 3)
must immediately follow another byte; whereas in UTF-8 they are start bytes.
-12 this code page was never finished
-13 only three letters (2 of them vowels) below C0
-14 almost all the letters C0 and above are vowels, so the text would
have to mostly be vc vcc vccc. That's quite unlikely for more than a
couple of words in a row
-15 only two vowels below C0
-16 only three vowels below C0
It looks to me like this heuristic can fail on strings of a few bytes,
but for real text does a pretty good job.
>
>> And finally, I want to reiterate again that what you are proposing is not how
>> perl has ever operated on locale data.
>
> True, but how it's operating now is crap. It was somewhat crap
> when it didn't decode locale strings at all, and just trusted that
> the bytes should make sense to the user. It was an oversight that
> when Unicode was embraced this wasn't changed to decode to the native
> Unicode representation. But at least it was consistent in providing
> a locale-encoding byte string. Now it's inconsistent: $! may provide
> either the locale-encoded byte string or the character string that the
> byte string probably represents. Consistently decoding it to a character
> string would certainly be novel, but it's the only behaviour that makes
> sense in a Unicode environment.
I don't believe most of this. Perhaps some of that is because you used
the word 'decode' again in a way that obscures your meaning.
>
>> Also, what you are proposing should be trivially achievable in pure Perl
>> using POSIX::nl_langinfo and Encode.
>
> It's not trivial to apply this to $!, because of the aforementioned
> inconsistency. It's *possible* with some mucking about with SvUTF8,
> but we'd never say that that kind of treatment of $! was a supported
> interface.
Since I don't understand and don't believe the above stuff, I don't see
that writing in C gives you any more tools than pure perl.
>
>> If you were to prototype it that way
>> you could find out if there are glitches between the names each understands.
>
> Yes, ish. The basic decoding can certainly be prototyped this way, and so
> can the additional logic for places where nl_langinfo() is unavailable or
> where we can detect that it gives bad data. But this doesn't sound all
> that useful as an investigatory tool. The way to find out how useful
> this logic is is to gather strerror()/nl_langinfo() pairs from a wide
> range of OSes. In any case, as a porting task it's not something that
> one person can do alone.
>
> -zefram
>
Thread Previous
|
Thread Next