develooper Front page | perl.perl5.porters | Postings from August 2017

Re: my_strerror() as API function

Thread Previous | Thread Next
Karl Williamson
August 16, 2017 04:11
Re: my_strerror() as API function
Message ID:
On 08/14/2017 10:38 PM, Zefram wrote:
> Karl Williamson wrote:
>> fact, the range of code points 80 - 9F are not allocated in any ISO 8859
>> encoding.  This range is entirely controls, hardly used anywhere anymore, and
>> certainly not in the middle of text. However, characters from this range are
>> used in every non-ASCII UTF-8 sequence as continuation bytes.  This means
>> that the heuristic is 100% accurate in distinguishing UTF-8 from any of the
>> 8859 encodings, contrary to what you said about 8859-5.
> No, that's not correct.  The C1 controls are indeed there, in all the ISO
> 8859 encodings, but they only cover half the range of UTF-8 continuation
> bytes.  0xa0 to 0xbf are also continuation bytes.  So many, not all,
> multibyte UTF-8 character representations consist entirely of byte values
> that represent printable characters in ISO-8859-*.  The thing about the
> distribution of letters and symbols comes from the fact that none of 0xa0
> to 0xbf represent letters in ISO-8859-1.  But most of them are letters
> in ISO-8859-5.  (Luckily they're capital letters, which provides some
> lesser degree of safety against accidentally forming UTF-8 sequences.)

I'm sorry I got confused, and additionally misstated stuff I wasn't 
confused about.  I do sometimes space out that continuation bytes go up 
through BF.  And my point about the C1 controls was not that they are 
unusable in 8859 texts, but that they are separate from 8859, unlike 
Windows CP1252 which does use most of the C1-defined code points to 
represent graphic characters.

But my point remains, you are just not going to see C1 controls in text.

This leaves the range A0-BF that are legal continuation bytes, and are 
mostly symbols in 8859-1, and so makes it hard to confuse that encoding 
with UTF-8.  And that's why the layout of 8859-1 does make a difference, 
but the chances of confusion are above 0%

The range A0-BF in 8859-5 is almost entirely letters.  Modern Russian is 
represented by the 4 rows B0-EF, plus A1 and F1 (though these last two 
are often transliterated to others these days).  The other word 
characters in 8859-1 are used in other Cyrillic languages.  Text in 
these language will use a mixture of Russian characters plus characters 
from the A0-AF row and the F0-FF row.

The capital letters are those up through CF; anything above is lowercase.

For a byte sequence to be confusable with a UTF-8-encoded character, it 
must begin with something C0 and greater, followed by one or more of 
something lower than C0.

The range C0-CF is essentially the last half of the capital letters in 
the modern Russian alphabet, including half the vowels.

Let's take that case first.  To be valid UTF-8, the next byte must be 
below C0, and hence must also be uppercase.  If this represents a word 
in Cyrillic, the next byte must again be C0 and above, and so on.  One 
could construct a confusable sequence of uppercase letters as long as 
every other one comes from the last half of the Russian alphabet, and 
the others from the first half, or are from Macedonian, Ukrainian and 
the like.

I took Russian in college; the capitalization rules are similar to 
English.  You just don't see strings of all caps.  So yes, this is 
confusable for short strings of all caps, provided the other conditions 
are met.  Something like the Cyrillic equivalent of EINVAL might be 

Now let's look at the other case, where the first byte is D0 or above. 
This is a lowercase letter, and it must be followed by one or more bytes 
that are all uppercase.  Again, you won't see things like aB, bAR, eINV 
in text.

I looked at the remaining 8859 code pages
-2  only one vowel below C0
-3  only one vowel below C0
-4  only two vowels below C0
-6  no letters below C0
-7  7 letters below C0, all polytonic Greek, and I'm not qualified to 
analyze this.
-8 only punctuation below E0
-9 only punctuation below C0
-10 almost all characters C0 and above are vowels
-11 I'm not qualified to analyze Thai, but I notice that of the code 
points C0 and above, more than half are: 1) unassigned; 2) digits; 3) 
must immediately follow another byte; whereas in UTF-8 they are start bytes.
-12 this code page was never finished
-13 only three letters (2 of them vowels) below C0
-14 almost all the letters C0 and above are vowels, so the text would 
have to mostly be vc vcc vccc.  That's quite unlikely for more than a 
couple of words in a row
-15 only two vowels below C0
-16 only three vowels below C0

It looks to me like this heuristic can fail on strings of a few bytes, 
but for real text does a pretty good job.
>> And finally, I want to reiterate again that what you are proposing is not how
>> perl has ever operated on locale data.
> True, but how it's operating now is crap.  It was somewhat crap
> when it didn't decode locale strings at all, and just trusted that
> the bytes should make sense to the user.  It was an oversight that
> when Unicode was embraced this wasn't changed to decode to the native
> Unicode representation.  But at least it was consistent in providing
> a locale-encoding byte string.  Now it's inconsistent: $! may provide
> either the locale-encoded byte string or the character string that the
> byte string probably represents.  Consistently decoding it to a character
> string would certainly be novel, but it's the only behaviour that makes
> sense in a Unicode environment.

I don't believe most of this.  Perhaps some of that is because you used 
the word 'decode' again in a way that obscures your meaning.
>> Also, what you are proposing should be trivially achievable in pure Perl
>> using POSIX::nl_langinfo and Encode.
> It's not trivial to apply this to $!, because of the aforementioned
> inconsistency.  It's *possible* with some mucking about with SvUTF8,
> but we'd never say that that kind of treatment of $! was a supported
> interface.

Since I don't understand and don't believe the above stuff, I don't see 
that writing in C gives you any more tools than pure perl.
>>                                       If you were to prototype it that way
>> you could find out if there are glitches between the names each understands.
> Yes, ish.  The basic decoding can certainly be prototyped this way, and so
> can the additional logic for places where nl_langinfo() is unavailable or
> where we can detect that it gives bad data.  But this doesn't sound all
> that useful as an investigatory tool.  The way to find out how useful
> this logic is is to gather strerror()/nl_langinfo() pairs from a wide
> range of OSes.  In any case, as a porting task it's not something that
> one person can do alone.
> -zefram

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About