develooper Front page | perl.perl5.porters | Postings from March 2007

Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)

Thread Previous | Thread Next
From:
Marc Lehmann
Date:
March 30, 2007 05:02
Subject:
Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)
Message ID:
20070330120232.GA22799@schmorp.de
On Wed, Mar 28, 2007 at 11:12:15AM +0200, Juerd Waalboer <juerd@convolution.nl> wrote:
> > As far as I know, the conceptual purpose of the utf8 flag is to 
> > indicate whether Perl considers a string to be unambiguous character 
> > data or binary data which could be ambiguous character data, and thus 
> > how Perl will treat it by default.
> 
> The *conceptual* purpose of the UTF8 flag isn't there. Conceptually,
> every string can be a unicode string, and you're not supposed to look
> at, know, or set the UTF8 flag yourself. It's an internal bit, like IOK
> and NOK. [1]

Thats not how current perl works.

> Perl conceptually has a single numeric type, and a single string type.
> The distinction between integer and float, and between iso-8859-1 and
> utf-8, is internal.

I would love if that were the case, but the powers to be decided that every
perl progarmmer has to know those internals, and needs to be able to deal
with them.

> Note that Perl internally uses iso-8859-1 (8 bit) and utf-8 (variable
> whole-octet), not ascii (7 bit).

No, Perl exposes this. For example, see the recent example of Compress::Zlib:

        unpack ('CCCCVCC', $$string);

that code is broken because the powers to be decided that "C" exposes the
internal encoding, while "V" doesn't. That requires every perl programmer
who decodes file headers etc. using unpack to know about those internals.

This is especially bad as not only has the meaning of "C" been shifted from
decoding bytes to something else (instead of using a new modifier), but no
alternative has been provided to get the old meaning of "C", so basically all
code that doesn't utf8::downgrade is broken now by this change in meaning.

(Worse is the fact that its wrongly documented to decode an octet even in
the presence of Unicode, but it doesn't decode an octet, unless you define
"octet" in Perl to mean that "\xa0" is either one or two octets)

The same is true for many XS modules: in older versions of perl, SvPV gave
you the 8-bit version of a scalar, but in current versions, it randomly
gives you either 8-bit or utf-8 encoded. SvPV was renamed to SvPVbyte.

Both of those gratitiously backwards-incompatible changes break lots of
existing code.

And the problem is that those bugs are not considered bugs but features.

> [1] Some parts of Perl break this concept. The regex engine is one of
> them, and has different semantics depending on the presence of the flag.
> This is a bug, but any fix would be incompatible.

In fact, some parts of perl break this concept and make perfectly working
code (in 5.005) not working anymore, or working randomly, and thats not
considered a bug.

I wonder why it is ok to break large amounts of perl and xs code silently,
without even documenting how to fix it[1], while at the same time 5.10
introduced "use feature" to shield against possible breakage with far less of
an impact then the changes above.

[1] If it is documented, then anybody please show me why this:

   utf8::downgrade $s;
   unpack "C", $s;

is documented to have different effects from:

   unpack "C", $s;

i.e., where is it documented that perl doesn't upgrade the scalar in between
those lines? If you think it is obvious, how about this:

   my $s = chr 255; # to me, this is one octet. to perl, it might be one or
                    # two, or maybe more, who knows.
   warn unpack "C", $s;
   "$s\x{672c}";
   warn unpack "C", $s;
   $s .= "\x{672c}"; substr $s, 1, 1, "";
   warn unpack "C", $s;

Can a pure-Perl programmer tell what the output of this program is without
trying it? Should he be able to? I would say the answer is no to both.

It is beyond me how people can introduce so much breakage to existing code
so lightly, forcing many modules to be changed and forcing pure-Perl
programmer to understand the perl interpreter sources to get their unicode
right.

Thats a broken unicode model, and as long as those kind of bugs are
considered features, perl programmers very well have to care about that
internal utf-x, utf-8, whatever flag.

-- 
                The choice of a
      -----==-     _GNU_
      ----==-- _       generation     Marc Lehmann
      ---==---(_)__  __ ____  __      pcg@goof.com
      --==---/ / _ \/ // /\ \/ /      http://schmorp.de/
      -=====/_/_//_/\_,_/ /_/\_\      XX11-RIPE

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About