develooper Front page | perl.perl5.porters | Postings from August 2013

[perl #119499] $! returned with UTF-8 flag under UTF-8 locales only under 5.19.2+

Thread Previous | Thread Next
Father Chrysostomos via RT
August 31, 2013 13:27
[perl #119499] $! returned with UTF-8 flag under UTF-8 locales only under 5.19.2+
Message ID:
On Thu Aug 29 13:05:00 2013, wrote:
> On 08/29/2013 02:15 AM, Victor Efimov via RT wrote:
> >
> > On Wed Aug 28 23:40:08 2013, sprout wrote:
> >>
> >> So now, $! may or may not be encoded, and you have to way of
> telling
> >> reliably without doing the same environment checks that perl itself
> did
> >> internally before deciding to decode $! itself.
> I don't follow these arguments.  What that commit did is only to look
> at
> the string returned by the operating system, and if it is encoded in
> UTF-8, to set that flag in the scalar.  That's it (*).  If the OS
> didn't
> return UTF-8, it leaves the flag alone.  I find it hard to comprehend
> that this isn't the right thing to do.  For the first time, $! in
> string
> context is no different than any other string scalar in Perl.  They
> have
> a utf-8 bit set which means that the encoding is in UTF-8,

You are still describing this from the point of view of the internals.

From the users point of view, the utf8 flag does not mean it is encoded
in utf8.  It means it is *de*coded; just a sequence of Unicode characters.

> or they
> don't
> have it set, which means that the encoding is unknown to Perl.

I.e., still encoded.

> This
> commit did not change the latter part one iota.

The former is the problem, not the latter.  If a program can find out
what encoding the OS is using for errno messages, it should be able to
apply that encoding to $! via decode($os_encoding, $!,
Encode::FB_CROAK).  But that fails now when perl thought it saw utf8.

> We have conventions as to what the bytes in that scalar mean depending
> on the context it is used, the pragmas that are in effect in those
> contexts, and the operations that are being performed on it.  But they
> are just conventions.  This commit did not change that.

I don’t follow.  The bytes inside the scalar are not visible to the Perl
program without resorting to introspection that should never be used for

Your commit changed the content of the scalar as returned by ord and
substr, but only sometimes.  It’s the ‘only sometimes’ that is problematic.

> What is different about $! is that we have made the decision to
> respect
> locale when accessing it even when not in the scope of 'use locale'.

The problem here is that the locale is only sometimes being respected.

> In
> light of these issues, perhaps this should be discussed again.  I'll
> let
> the people who argued for that decision to again argue for it.
> The change fixed two bug reports for the common case where the locales
> for messages and the I/O matched and where people had not taken pains
> to
> deal with locale.  I think that should trump the less frequent cases,
> given the conflicts.

But the less frequent cases now require one to introspect internal
scalar flags that should make no difference.

Also, is that really more frequent?  What about scripts that pass $!
straight to STDOUT without layers, knowing that $! is already in the
character set the terminal expects?

> If code wants $! to be expressed in a certain language, it should set
> the locale to that language while accessing $! and then restore the
> old
> locale.

Are you suggesting that perl itself start defaulting to the C locale for $!?

> >>
> >
> > Small corrections:
> >
> > a) Actually there is a way: check is_utf8($!) flag (which is not
> good
> > because is_utf8 marked as danger, and it's documented you cant
> distinct
> > characters from bytes with this flag)
> I don't see that danger marked currently in the pod for
> Where
> do you see that?

    (Since Perl 5.8.1)  Test whether I<$string> is marked internally as
    encoded in UTF-8.  Functionally the same as Encode::is_utf8().

I think he is referring to ‘internally’ here, which indicates that you
shouldn’t rely on it.

> >
> > b) Current fix does not do environment checks, it just tries to do
> UTF-8
> > validity check
> >
> >
> (*)  To be precise
> 1) if the string returned by the OS is entirely ASCII, it does not set
> the UTF-8 flag.  This is because ASCII UTF-8 and non-UTF-8 are
> identical, so the flag is irrelevant.  And yes, this is buggy if
> operating under a non-ASCII 7-bit locale, as in ISO 646.  These
> locales
> have all been superseded so should be rare today, but a bug report
> could
> be written on this.
> 2) As Victor notes, the commit does a UTF-8 validity check, so it is
> possible that that could give false positives.  But as Wikipedia says,
> "One of the few cases where charset detection works reliably is
> detecting UTF-8. This is due to the large percentage of invalid byte
> sequences in UTF-8, so that text in any other encoding that uses bytes
> with the high bit set is extremely unlikely to pass a UTF-8 validity
> test."  (The original emphasized "extremely".)  I checked this out
> with
> the CP1251 character set, and the only modern Russian character that
> could be a continuation byte is ё.  All other vowels and consonants
> must
> be start bytes.  That means that to generate a false positive, an OS
> message in CP1251 must only contain words whose 2nd, 4th, ... bytes
> are
> that vowel.  That just isn't going to happen, though the common
> Russian
> word Её (her, hers, ...) could be confusable if there were no other
> words in the message.

That is all very nice, but how would you rewrite this code to work in
5.19.2 and up?

if (!open fh, $filename) {
   # add_to_log expects a string of characters, so decode it
   add_to_log($filename, 0+$!, Encode::decode(


Father Chrysostomos

via perlbug:  queue: perl5 status: open

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About