Juerd Waalboer
March 28, 2007 02:13
Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)
Darren Duncan skribis 2007-03-27 15:52 (-0700):
> I believe that a true utf8 flag should mean that the string contains 
> data that is valid utf8, not just that it has utf8 characters outside 
> the ASCII range.

How often should Perl check for this? Directly after decoding only, or
also after mutating operations like substr, or s///?

> As far as I know, the conceptual purpose of the utf8 flag is to 
> indicate whether Perl considers a string to be unambiguous character 
> data or binary data which could be ambiguous character data, and thus 
> how Perl will treat it by default.

The *conceptual* purpose of the UTF8 flag isn't there. Conceptually,
every string can be a unicode string, and you're not supposed to look
at, know, or set the UTF8 flag yourself. It's an internal bit, like IOK
and NOK. [1]

>         confess q{Bad arg; Perl 5 does not consider it to be a char str.}
>             if !Encode::is_utf8( $v );

As said, this is not the purpose of the flag, and you're not supposed to
use is_utf8 for this. It is documented with the "[INTERNAL]" flag, for a
good reason.

Perl conceptually has a single numeric type, and a single string type.
The distinction between integer and float, and between iso-8859-1 and
utf-8, is internal.

This could be changed, but will introduce incompatibilities and a severe
loss of performance for strings that fit in iso-8859-1.

What I want (and I think you want too) is a real type system, to have
two different distinct types: byte strings and character strings. It
would be bad to use a flag called "UTF8" for this, because a byte string
can also be UTF8 encoded. Perl already suffers from this problem, but
because the UTF8 flag is *INTERNAL*, it's not a big deal. It would be if
it surfaced and was used by Perl coders.

A whole type system is a bit too much to implement in Perl 5, I think.
Our current unicode string semantics are a great way to deal with not
having types, in my opinion.

> Instead, the older documented utf8 flag behaviour would require this 
> unnecessary extra work in order to accept all valid input:


If your subroutine expects text, it can only assume that it gets text,
and it should not (must not?) make any distinction based on the internal

The string it gets is a Unicode string. Not a UTF8 string, not a latin1

>             if !Encode::is_utf8( $v ) and $v =~ m/[^\x00-\x7F]/xs;

This check is wrong. If the flag is not set, that means only that the
internal encoding is iso-8859-1 if the string is a text string, not that
the string is a byte string.

The reverse is true, however: if the flag is set, the string will not be
a byte string. But lack of UTF8 flag is no indication of byte versus

> I would expect the use of the regular expression, which would be 
> called for any ASCII data

Note that Perl internally uses iso-8859-1 (8 bit) and utf-8 (variable
whole-octet), not ascii (7 bit).

The character é (eacute) may be stored internally as the single octet
233 (decimal) and does not by itself cause an internal upgrade to UTF-8.

[1] Some parts of Perl break this concept. The regex engine is one of
them, and has different semantics depending on the presence of the flag.
This is a bug, but any fix would be incompatible.
korajn salutojn,

  juerd waalboer:  perl hacker  <>  <>
  convolution:     ict solutions and consultancy <>

Ik vertrouw stemcomputers niet.
Zie <>.

