On Thu Aug 29 13:05:00 2013, public@khwilliamson.com wrote: > > I don't see that danger marked currently in the pod for utf8.pm. > Where > do you see that? http://perldoc.perl.org/perlunifaq.html#What-is-%22the-UTF8-flag%22? ====== Please, unless you're hacking the internals, or debugging weirdness, don't think about the UTF8 flag at all. That means that you very probably shouldn't use is_utf8 , _utf8_on or _utf8_off at all. ====== > 2) As Victor notes, the commit does a UTF-8 validity check, so it is > possible that that could give false positives. But as Wikipedia says, > "One of the few cases where charset detection works reliably is > detecting UTF-8. This is due to the large percentage of invalid byte > sequences in UTF-8, so that text in any other encoding that uses bytes > with the high bit set is extremely unlikely to pass a UTF-8 validity > test." (The original emphasized "extremely".) I checked this out > with > the CP1251 character set, and the only modern Russian character that > could be a continuation byte is ё. All other vowels and consonants > must > be start bytes. That means that to generate a false positive, an OS > message in CP1251 must only contain words whose 2nd, 4th, ... bytes > are > that vowel. That just isn't going to happen, though the common > Russian > word Её (her, hers, ...) could be confusable if there were no other > words in the message. > I agree that it's pretty reliable. However different languages and different encodings can show different misdetection rate. For example rate for CP866 (this is ancient encoding probably) higher than for CP1251. Also Russian alphabet does not contain A-Z characters, unlike German or French. So French error message can contain just couple of non-ASCII-7bit characters, unlike Russian. I would not surprise if this detection is *not* introducing any single bug for any combinations of encoding and language. However I would not surprise too, if this detection is broken for some Language-Encoding pair (perhaps for non-Western, non-Cyrilic languages). --- via perlbug: queue: perl5 status: open https://rt.perl.org:443/rt3/Ticket/Display.html?id=119499Thread Previous | Thread Next