Karl Williamson wrote: >fact, the range of code points 80 - 9F are not allocated in any ISO 8859 >encoding. This range is entirely controls, hardly used anywhere anymore, and >certainly not in the middle of text. However, characters from this range are >used in every non-ASCII UTF-8 sequence as continuation bytes. This means >that the heuristic is 100% accurate in distinguishing UTF-8 from any of the >8859 encodings, contrary to what you said about 8859-5. No, that's not correct. The C1 controls are indeed there, in all the ISO 8859 encodings, but they only cover half the range of UTF-8 continuation bytes. 0xa0 to 0xbf are also continuation bytes. So many, not all, multibyte UTF-8 character representations consist entirely of byte values that represent printable characters in ISO-8859-*. The thing about the distribution of letters and symbols comes from the fact that none of 0xa0 to 0xbf represent letters in ISO-8859-1. But most of them are letters in ISO-8859-5. (Luckily they're capital letters, which provides some lesser degree of safety against accidentally forming UTF-8 sequences.) >And finally, I want to reiterate again that what you are proposing is not how >perl has ever operated on locale data. True, but how it's operating now is crap. It was somewhat crap when it didn't decode locale strings at all, and just trusted that the bytes should make sense to the user. It was an oversight that when Unicode was embraced this wasn't changed to decode to the native Unicode representation. But at least it was consistent in providing a locale-encoding byte string. Now it's inconsistent: $! may provide either the locale-encoded byte string or the character string that the byte string probably represents. Consistently decoding it to a character string would certainly be novel, but it's the only behaviour that makes sense in a Unicode environment. >Also, what you are proposing should be trivially achievable in pure Perl >using POSIX::nl_langinfo and Encode. It's not trivial to apply this to $!, because of the aforementioned inconsistency. It's *possible* with some mucking about with SvUTF8, but we'd never say that that kind of treatment of $! was a supported interface. > If you were to prototype it that way >you could find out if there are glitches between the names each understands. Yes, ish. The basic decoding can certainly be prototyped this way, and so can the additional logic for places where nl_langinfo() is unavailable or where we can detect that it gives bad data. But this doesn't sound all that useful as an investigatory tool. The way to find out how useful this logic is is to gather strerror()/nl_langinfo() pairs from a wide range of OSes. In any case, as a porting task it's not something that one person can do alone. -zeframThread Previous | Thread Next