develooper Front page | perl.perl5.porters | Postings from August 2013

[perl #119499] $! returned with UTF-8 flag under UTF-8 locales only under 5.19.2+

Thread Previous | Thread Next
Victor Efimov via RT
August 29, 2013 21:07
[perl #119499] $! returned with UTF-8 flag under UTF-8 locales only under 5.19.2+
Message ID:
On Thu Aug 29 13:05:00 2013, wrote:
> I don't see that danger marked currently in the pod for
> Where
> do you see that?
Please, unless you're hacking the internals, or debugging weirdness,
don't think about the UTF8 flag at all. That means that you very
probably shouldn't use is_utf8 , _utf8_on or _utf8_off at all.

> 2) As Victor notes, the commit does a UTF-8 validity check, so it is
> possible that that could give false positives.  But as Wikipedia says,
> "One of the few cases where charset detection works reliably is
> detecting UTF-8. This is due to the large percentage of invalid byte
> sequences in UTF-8, so that text in any other encoding that uses bytes
> with the high bit set is extremely unlikely to pass a UTF-8 validity
> test."  (The original emphasized "extremely".)  I checked this out
> with
> the CP1251 character set, and the only modern Russian character that
> could be a continuation byte is ё.  All other vowels and consonants
> must
> be start bytes.  That means that to generate a false positive, an OS
> message in CP1251 must only contain words whose 2nd, 4th, ... bytes
> are
> that vowel.  That just isn't going to happen, though the common
> Russian
> word Её (her, hers, ...) could be confusable if there were no other
> words in the message.

I agree that it's pretty reliable. However different languages and
different encodings can show different misdetection rate. For example
rate for CP866 (this is ancient encoding probably) higher than for
CP1251. Also Russian alphabet does not contain A-Z characters, unlike
German or French. So French error message can contain just couple of
non-ASCII-7bit characters, unlike Russian.

I would not surprise if this detection is *not* introducing any single
bug for any combinations of encoding and language.

However I would not surprise too, if this detection is broken for some
Language-Encoding pair (perhaps for non-Western, non-Cyrilic languages).

via perlbug:  queue: perl5 status: open

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About