develooper Front page | perl.perl5.porters | Postings from March 2007

Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)

Thread Previous | Thread Next
Juerd Waalboer
March 30, 2007 12:54
Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)
Message ID:
Marc Lehmann skribis 2007-03-30 14:02 (+0200):
> > The *conceptual* purpose of the UTF8 flag isn't there. Conceptually,
> > every string can be a unicode string, and you're not supposed to look
> > at, know, or set the UTF8 flag yourself. It's an internal bit, like IOK
> > and NOK. [1]
> Thats not how current perl works.

We must have differing definitions, somewhere.

> > Perl conceptually has a single numeric type, and a single string type.
> > The distinction between integer and float, and between iso-8859-1 and
> > utf-8, is internal.
> I would love if that were the case, but the powers to be decided that every
> perl progarmmer has to know those internals, and needs to be able to deal
> with them.

The best approach to programming with unicode in mind, in Perl, is to
(pretend to) be completely ignorant about Perl's internals with regards
to encoding and the UTF8 flag.

The only exception is the regex engine, which has a big bug. This can be
worked around, again without any knowledge of the internals, by
utf8::upgrade'ing both sides of the regex before trying the match.

Your powers-that-be, might be different. Also, don't confuse "you can
know what Perl does internally" with "you have to know what Perl does

Just being able to access internal metadata doesn't mean you should
actually do so on a daily basis. 

It's entirely possible to make undef writable, and have it equal 42.
No-one is complaining about that, and only very few people ever get the
idea of changing the value of undef.

It's also entirely possible to set the internal flag "UTF8" on an
existing string. But for some reason a lot of people are complaining
about that, and even more people have actually set UTF8 flags

> > Note that Perl internally uses iso-8859-1 (8 bit) and utf-8 (variable
> > whole-octet), not ascii (7 bit).
> No, Perl exposes this. For example, see the recent example of Compress::Zlib:
>         unpack ('CCCCVCC', $$string);
> that code is broken because the powers to be decided that "C" exposes the
> internal encoding, while "V" doesn't.

Yes, any byte-specific operation on a text string (which I keep separate
from character strings) will use the internal encoding. It has to use
/some/ encoding, because it cannot see whether the string was meant as a
byte string or a text string. Perl does not have strong typing.

Personally, I think that unpack with a byte-specific signature should
die, or at least warn, when its operand has the UTF8 flag set. That'll
catch at least some of the cases, because the UTF8 flag always
positively indicates that the string is a text string. (The reverse,
however, is not true: a string without the UTF8 string might be either a
text string or a byte string.)

> That requires every perl programmer who decodes file headers etc.
> using unpack to know about those internals.

No, it requires every Perl programmer to keep track of the function of
every string.

Byte strings and text strings must never be combined, and text strings
must never undergo byte-specific operations.

This again requires no knowledge of the actual encoding that Perl uses
internally, whatsoever.

> The same is true for many XS modules: in older versions of perl, SvPV gave
> you the 8-bit version of a scalar, but in current versions, it randomly
> gives you either 8-bit or utf-8 encoded. SvPV was renamed to SvPVbyte.

Unfortunately, I lack knowledge of these internals, so I cannot comment
about this (yet).

Note that XS writers must have knowledge of Perl's internals. This has
always been true, and is not specific to this fancy new Unicode thing.

> And the problem is that those bugs are not considered bugs but features.

Some bugs are acknowledged as bugs, but won't be fixed anyway, because
there is already a lot of code in the wild that depends on the bugs.

> > [1] Some parts of Perl break this concept. The regex engine is one of
> > them, and has different semantics depending on the presence of the flag.
> > This is a bug, but any fix would be incompatible.
> In fact, some parts of perl break this concept and make perfectly working
> code (in 5.005) not working anymore, or working randomly, and thats not
> considered a bug.

Personally I'm only interested in 5.8.2 and later, but I still would
like to learn about this history.

>    unpack "C", $s;

The C template for unpack is specifically documented as byte-specific.
It should never be used on text strings. If you properly keep text and
byte strings separate, that means that your byte string was never
upgraded, and that unpacking with "C" is reliable and predictable.

If upgrading happened even though the string was not mixed with text
strings or used with unicode semantics, that is a bug. I'm very
interested in these silent upgrades that you are experiencing.

> If you think it is obvious, how about this:
>    my $s = chr 255; # to me, this is one octet. to perl, it might be one or
>                     # two, or maybe more, who knows.
>    warn unpack "C", $s;
>    "$s\x{672c}";
>    warn unpack "C", $s;
>    $s .= "\x{672c}"; substr $s, 1, 1, "";
>    warn unpack "C", $s;
> Can a pure-Perl programmer tell what the output of this program is without
> trying it? 

Not relevant.

> Should he be able to? 

No, because the author of this program made a big mistake in the line

The casual reader can easily figure out that $s was meant as a byte
string: it is used with unpack "C", which is known to be a byte
operation. Because it is a byte string, the chr 255 is just a 0xFF
octet, not a ΓΏ (ÿ) conceptually.

The casual reader can also easily figure out that \x{672c} is meant as a
text string: any codepoint higher than \x{FF} is always a character,
never a single byte.

Then, the author of this snippet uses both the byte string $s and the
text sting "\x{672c}" joined in one string "$s\x{672c}". People not
interested in fixing the code can stop reading there: the code is broken
and its semantics not terribly relevant. People who wish to fix it, will
have to try and figure out what the author really wanted to do here.

Because it's a contrived case, that's very hard to figure out. But I'm
sure that given real world values and variable names, there would be a
clear and logical solution, to be found somewhere along the lines of
encoding and decoding explicitly.

> Thats a broken unicode model

So far, I've only seen a broken understanding of the unicode model, and
a broken regex engine.
korajn salutojn,

  juerd waalboer:  perl hacker  <>  <>
  convolution:     ict solutions and consultancy <>

Ik vertrouw stemcomputers niet.
Zie <>.

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About