On 08/15/2017 10:11 PM, Karl Williamson wrote: > I looked at the remaining 8859 code pages > -2 only one vowel below C0 > -3 only one vowel below C0 > -4 only two vowels below C0 > -6 no letters below C0 > -7 7 letters below C0, all polytonic Greek, and I'm not qualified to > analyze this. > -8 only punctuation below E0 > -9 only punctuation below C0 > -10 almost all characters C0 and above are vowels > -11 I'm not qualified to analyze Thai, but I notice that of the code > points C0 and above, more than half are: 1) unassigned; 2) digits; 3) > must immediately follow another byte; whereas in UTF-8 they are start > bytes. > -12 this code page was never finished > -13 only three letters (2 of them vowels) below C0 > -14 almost all the letters C0 and above are vowels, so the text would > have to mostly be vc vcc vccc. That's quite unlikely for more than a > couple of words in a row > -15 only two vowels below C0 > -16 only three vowels below C0 I realized that my analysis is flawed for the code pages that are some variant of Latin. With the Cyrillic script, you aren't going to be using any of the ASCII letters to fill out words, because they aren't in the same script. But the Latin variants can have ASCII letters intermixed to make words, so the constraints aren't as severe as I indicated. Take for example 8859-2, which has only one non-ASCII vowel below C0. One could imagine a word that starts with a consonant C0 and above, then has that vowel, and the rest are ASCII. That would be confusable, and if it were the only word with non-ASCII in the text, the guess would be wrong. An exercise one could do is take dictionaries in various languages in the appropriate code pages, filtering out all the words that are just AsCII, and then check each word to see if it is legal UTF-8. That would quantify how good the heuristic (which I suspect has been around since UTF-8 was added to perl) is. This would be pretty easy to mostly automate if there were a source of dictionaries in UTF-8. It could be brute forced, just trying every encoding Encode knows about on every dictionary, and then ignoring that case if Encode says it can't output that dictionary in that encoding.Thread Previous