John Berthels said: > The documentation for the 'decode' function in Encode.pm states: > > ...the utf8 flag for $string is on unless $octets entirely > consists of ASCII data... > > but it appears that decode turns on the flag even if the input string is > plain ASCII. A test case demonstrating this is appended below. Considering that I like to write modern programs that simply use Unicode end-to-end as possible, and at least internally, which keeps everything simple and compatible, it would be easier for me if the meaning of the utf8 flag was updated to officially be the new behaviour. I believe that a true utf8 flag should mean that the string contains data that is valid utf8, not just that it has utf8 characters outside the ASCII range. As far as I know, the conceptual purpose of the utf8 flag is to indicate whether Perl considers a string to be unambiguous character data or binary data which could be ambiguous character data, and thus how Perl will treat it by default. If I have a library that wants to work internally with unambiguous character data, and to keep things simple will require the user code to remove any ambiguity by doing any decoding itself and passing the library the result, then it would be simpler if the input checking code of the library could just do this: sub expects_text { my ($v) = @_; confess q{Bad arg; it is undefined.} if !defined $v; confess q{Bad arg; Perl 5 does not consider it to be a char str.} if !Encode::is_utf8( $v ); # $v is okay, so do whatever ... } Instead, the older documented utf8 flag behaviour would require this unnecessary extra work in order to accept all valid input: sub expects_text { my ($v) = @_; confess q{Bad arg; it is undefined.} if !defined $v; confess q{Bad arg; Perl 5 does not consider it to be a char str.} if !Encode::is_utf8( $v ) and $v =~ m/[^\x00-\x7F]/xs; # $v is okay, so do whatever ... } I would expect the use of the regular expression, which would be called for any ASCII data, would be considerably slower than just checking the flag, especially since we already know the data is valid Unicode characters in order for decode() to possibly set the flag in the first place. Now, if there is some concern that character-oriented regexes and such are considerably slower for ASCII data than alternatives, and this is a problem and it can't be otherwise dealt with, we could perhaps have an additional flag which has the meaning that I ascribed to utf8; eg, is_chars() or is_text() etcetera; but in my mind it would be simpler to just leave the meaning of is_utf8 adjusted to mean is unambiguous character data. Thank you. -- Darren Duncan P.S. On a tangent, it would be nice if there was a simple test to see if an SV currently considered its numerical or integer or string etc component to be the authoratative one, so eg I could just check that rather than using looks_like_number or some such more complicated solution. Though maybe there is already, perhaps in a bundled debugging or some such module, and I haven't found it yet?Thread Next