Darren Duncan skribis 2007-03-27 15:52 (-0700): > I believe that a true utf8 flag should mean that the string contains > data that is valid utf8, not just that it has utf8 characters outside > the ASCII range. How often should Perl check for this? Directly after decoding only, or also after mutating operations like substr, or s///? > As far as I know, the conceptual purpose of the utf8 flag is to > indicate whether Perl considers a string to be unambiguous character > data or binary data which could be ambiguous character data, and thus > how Perl will treat it by default. The *conceptual* purpose of the UTF8 flag isn't there. Conceptually, every string can be a unicode string, and you're not supposed to look at, know, or set the UTF8 flag yourself. It's an internal bit, like IOK and NOK. [1] > confess q{Bad arg; Perl 5 does not consider it to be a char str.} > if !Encode::is_utf8( $v ); As said, this is not the purpose of the flag, and you're not supposed to use is_utf8 for this. It is documented with the "[INTERNAL]" flag, for a good reason. Perl conceptually has a single numeric type, and a single string type. The distinction between integer and float, and between iso-8859-1 and utf-8, is internal. This could be changed, but will introduce incompatibilities and a severe loss of performance for strings that fit in iso-8859-1. What I want (and I think you want too) is a real type system, to have two different distinct types: byte strings and character strings. It would be bad to use a flag called "UTF8" for this, because a byte string can also be UTF8 encoded. Perl already suffers from this problem, but because the UTF8 flag is *INTERNAL*, it's not a big deal. It would be if it surfaced and was used by Perl coders. A whole type system is a bit too much to implement in Perl 5, I think. Our current unicode string semantics are a great way to deal with not having types, in my opinion. > Instead, the older documented utf8 flag behaviour would require this > unnecessary extra work in order to accept all valid input: No. If your subroutine expects text, it can only assume that it gets text, and it should not (must not?) make any distinction based on the internal encoding. The string it gets is a Unicode string. Not a UTF8 string, not a latin1 string. > if !Encode::is_utf8( $v ) and $v =~ m/[^\x00-\x7F]/xs; This check is wrong. If the flag is not set, that means only that the internal encoding is iso-8859-1 if the string is a text string, not that the string is a byte string. The reverse is true, however: if the flag is set, the string will not be a byte string. But lack of UTF8 flag is no indication of byte versus character. > I would expect the use of the regular expression, which would be > called for any ASCII data Note that Perl internally uses iso-8859-1 (8 bit) and utf-8 (variable whole-octet), not ascii (7 bit). The character é (eacute) may be stored internally as the single octet 233 (decimal) and does not by itself cause an internal upgrade to UTF-8. [1] Some parts of Perl break this concept. The regex engine is one of them, and has different semantics depending on the presence of the flag. This is a bug, but any fix would be incompatible. -- korajn salutojn, juerd waalboer: perl hacker <juerd@juerd.nl> <http://juerd.nl/sig> convolution: ict solutions and consultancy <sales@convolution.nl> Ik vertrouw stemcomputers niet. Zie <http://www.wijvertrouwenstemcomputersniet.nl/>.Thread Previous | Thread Next