develooper Front page | perl.perl5.porters | Postings from August 2017

Re: my_strerror() as API function

Thread Previous
From:
Karl Williamson
Date:
August 16, 2017 05:36
Subject:
Re: my_strerror() as API function
Message ID:
53b2e236-6139-9e03-81d3-f8acdf063453@khwilliamson.com
On 08/15/2017 10:11 PM, Karl Williamson wrote:
> I looked at the remaining 8859 code pages
> -2  only one vowel below C0
> -3  only one vowel below C0
> -4  only two vowels below C0
> -6  no letters below C0
> -7  7 letters below C0, all polytonic Greek, and I'm not qualified to 
> analyze this.
> -8 only punctuation below E0
> -9 only punctuation below C0
> -10 almost all characters C0 and above are vowels
> -11 I'm not qualified to analyze Thai, but I notice that of the code 
> points C0 and above, more than half are: 1) unassigned; 2) digits; 3) 
> must immediately follow another byte; whereas in UTF-8 they are start 
> bytes.
> -12 this code page was never finished
> -13 only three letters (2 of them vowels) below C0
> -14 almost all the letters C0 and above are vowels, so the text would 
> have to mostly be vc vcc vccc.  That's quite unlikely for more than a 
> couple of words in a row
> -15 only two vowels below C0
> -16 only three vowels below C0


I realized that my analysis is flawed for the code pages that are some 
variant of Latin.  With the Cyrillic script, you aren't going to be 
using any of the ASCII letters to fill out words, because they aren't in 
the same script.  But the Latin variants can have ASCII letters 
intermixed to make words, so the constraints aren't as severe as I 
indicated.  Take for example 8859-2, which has only one non-ASCII vowel 
below C0.  One could imagine a word that starts with a consonant C0 and 
above, then has that vowel, and the rest are ASCII.  That would be 
confusable, and if it were the only word with non-ASCII in the text, the 
guess would be wrong.

An exercise one could do is take dictionaries in various languages in 
the appropriate code pages, filtering out all the words that are just 
AsCII, and then check each word to see if it is legal UTF-8.  That would 
quantify how good the heuristic (which I suspect has been around since 
UTF-8 was added to perl) is.  This would be pretty easy to mostly 
automate if there were a source of dictionaries in UTF-8.  It could be 
brute forced, just trying every encoding Encode knows about on every 
dictionary, and then ignoring that case if Encode says it can't output 
that dictionary in that encoding.

Thread Previous


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About