On 08/29/2013 02:15 AM, Victor Efimov via RT wrote: > > On Wed Aug 28 23:40:08 2013, sprout wrote: >> >> So now, $! may or may not be encoded, and you have to way of telling >> reliably without doing the same environment checks that perl itself did >> internally before deciding to decode $! itself. I don't follow these arguments. What that commit did is only to look at the string returned by the operating system, and if it is encoded in UTF-8, to set that flag in the scalar. That's it (*). If the OS didn't return UTF-8, it leaves the flag alone. I find it hard to comprehend that this isn't the right thing to do. For the first time, $! in string context is no different than any other string scalar in Perl. They have a utf-8 bit set which means that the encoding is in UTF-8, or they don't have it set, which means that the encoding is unknown to Perl. This commit did not change the latter part one iota. We have conventions as to what the bytes in that scalar mean depending on the context it is used, the pragmas that are in effect in those contexts, and the operations that are being performed on it. But they are just conventions. This commit did not change that. What is different about $! is that we have made the decision to respect locale when accessing it even when not in the scope of 'use locale'. In light of these issues, perhaps this should be discussed again. I'll let the people who argued for that decision to again argue for it. The change fixed two bug reports for the common case where the locales for messages and the I/O matched and where people had not taken pains to deal with locale. I think that should trump the less frequent cases, given the conflicts. If code wants $! to be expressed in a certain language, it should set the locale to that language while accessing $! and then restore the old locale. >> > > Small corrections: > > a) Actually there is a way: check is_utf8($!) flag (which is not good > because is_utf8 marked as danger, and it's documented you cant distinct > characters from bytes with this flag) I don't see that danger marked currently in the pod for utf8.pm. Where do you see that? > > b) Current fix does not do environment checks, it just tries to do UTF-8 > validity check > http://perl5.git.perl.org/perl.git/commitdiff/1500bd919ffeae0f3252f8d1bb28b03b043d328e > (*) To be precise 1) if the string returned by the OS is entirely ASCII, it does not set the UTF-8 flag. This is because ASCII UTF-8 and non-UTF-8 are identical, so the flag is irrelevant. And yes, this is buggy if operating under a non-ASCII 7-bit locale, as in ISO 646. These locales have all been superseded so should be rare today, but a bug report could be written on this. 2) As Victor notes, the commit does a UTF-8 validity check, so it is possible that that could give false positives. But as Wikipedia says, "One of the few cases where charset detection works reliably is detecting UTF-8. This is due to the large percentage of invalid byte sequences in UTF-8, so that text in any other encoding that uses bytes with the high bit set is extremely unlikely to pass a UTF-8 validity test." (The original emphasized "extremely".) I checked this out with the CP1251 character set, and the only modern Russian character that could be a continuation byte is ё. All other vowels and consonants must be start bytes. That means that to generate a false positive, an OS message in CP1251 must only contain words whose 2nd, 4th, ... bytes are that vowel. That just isn't going to happen, though the common Russian word Её (her, hers, ...) could be confusable if there were no other words in the message.Thread Previous | Thread Next