develooper Front page | perl.perl5.porters | Postings from March 2007

the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flagon plain ascii strings)

Thread Next
From:
Darren Duncan
Date:
March 27, 2007 17:06
Subject:
the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flagon plain ascii strings)
Message ID:
p06240800c22f4661f58c@[192.168.1.100]
John Berthels said:
>  The documentation for the 'decode' function in Encode.pm states:
>
>           ...the utf8 flag for $string is on unless $octets entirely
>           consists of ASCII data...
>
>  but it appears that decode turns on the flag even if the input string is
>  plain ASCII. A test case demonstrating this is appended below.

Considering that I like to write modern programs that simply use 
Unicode end-to-end as possible, and at least internally, which keeps 
everything simple and compatible, it would be easier for me if the 
meaning of the utf8 flag was updated to officially be the new 
behaviour.

I believe that a true utf8 flag should mean that the string contains 
data that is valid utf8, not just that it has utf8 characters outside 
the ASCII range.

As far as I know, the conceptual purpose of the utf8 flag is to 
indicate whether Perl considers a string to be unambiguous character 
data or binary data which could be ambiguous character data, and thus 
how Perl will treat it by default.

If I have a library that wants to work internally with unambiguous 
character data, and to keep things simple will require the user code 
to remove any ambiguity by doing any decoding itself and passing the 
library the result, then it would be simpler if the input checking 
code of the library could just do this:

     sub expects_text {
         my ($v) = @_;
         confess q{Bad arg; it is undefined.}
             if !defined $v;
         confess q{Bad arg; Perl 5 does not consider it to be a char str.}
             if !Encode::is_utf8( $v );
         # $v is okay, so do whatever ...
     }

Instead, the older documented utf8 flag behaviour would require this 
unnecessary extra work in order to accept all valid input:

     sub expects_text {
         my ($v) = @_;
         confess q{Bad arg; it is undefined.}
             if !defined $v;
         confess q{Bad arg; Perl 5 does not consider it to be a char str.}
             if !Encode::is_utf8( $v ) and $v =~ m/[^\x00-\x7F]/xs;
         # $v is okay, so do whatever ...
     }

I would expect the use of the regular expression, which would be 
called for any ASCII data, would be considerably slower than just 
checking the flag, especially since we already know the data is valid 
Unicode characters in order for decode() to possibly set the flag in 
the first place.

Now, if there is some concern that character-oriented regexes and 
such are considerably slower for ASCII data than alternatives, and 
this is a problem and it can't be otherwise dealt with, we could 
perhaps have an additional flag which has the meaning that I ascribed 
to utf8; eg, is_chars() or is_text() etcetera; but in my mind it 
would be simpler to just leave the meaning of is_utf8 adjusted to 
mean is unambiguous character data.

Thank you. -- Darren Duncan

P.S.  On a tangent, it would be nice if there was a simple test to 
see if an SV currently considered its numerical or integer or string 
etc component to be the authoratative one, so eg I could just check 
that rather than using looks_like_number or some such more 
complicated solution.  Though maybe there is already, perhaps in a 
bundled debugging or some such module, and I haven't found it yet?

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About