develooper Front page | perl.perl5.porters | Postings from March 2007

Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)

Thread Previous | Thread Next
From:
Marc Lehmann
Date:
March 30, 2007 16:05
Subject:
Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)
Message ID:
20070330230451.GE18872@schmorp.de
On Fri, Mar 30, 2007 at 09:53:52PM +0200, Juerd Waalboer <juerd@convolution.nl> wrote:
> > > at, know, or set the UTF8 flag yourself. It's an internal bit, like IOK
> > > and NOK. [1]
> > Thats not how current perl works.
> 
> We must have differing definitions, somewhere.

No. I have explained elsewhere that we quite agree on how it should be. It is
just that you make strange claims:

> The best approach to programming with unicode in mind, in Perl, is to
> (pretend to) be completely ignorant about Perl's internals with regards
> to encoding and the UTF8 flag.

It doesn't work even when not having unicode in mind. See unpack.

> The only exception is the regex engine, which has a big bug.

Uhm, no.

> Your powers-that-be, might be different. Also, don't confuse "you can
> know what Perl does internally" with "you have to know what Perl does
> internally".

In the example I gave, you have to.

> Just being able to access internal metadata doesn't mean you should
> actually do so on a daily basis. 

Whats the alternative? Replace all my uses of unpack with explicit calls
to ord? Sorry, but thats completely unrealistic.

> It's also entirely possible to set the internal flag "UTF8" on an
> existing string. But for some reason a lot of people are complaining
> about that, and even more people have actually set UTF8 flags
> themselves...

Yes. Because you have to when interfacing with a gazillion of existing
modules (or at the very least clear or downgrade).

If perl wouldn't force people to know the internals so often, one could
certainly get away with telling them: do not touch downgrade/upgrade, and
certainly never utf8_on or is_utf8, it is form the dveil.

But thats far from reality.

> >         unpack ('CCCCVCC', $$string);
> > that code is broken because the powers to be decided that "C" exposes the
> > internal encoding, while "V" doesn't.
> 
> Yes, any byte-specific operation on a text string (which I keep separate
> from character strings) will use the internal encoding. It has to use
> /some/ encoding, because it cannot see whether the string was meant as a
> byte string or a text string. Perl does not have strong typing.

Thats wrong. There is a perfectly good definition for character and byte:
the one from C. It is a single element of a string. The same thing was true
in perl: one byte is one character, and it should be true under the new
model.

Nothing in pack or unpack requires a speciifc encoding, just as nothign in
perl should require me to know the specific encoidng of "chr 200". It is a
single byte/character, regardles sof how perl stores it internally.

> Personally, I think that unpack with a byte-specific signature should
> die, or at least warn, when its operand has the UTF8 flag set.

Thats pure insanity. Then people would again be forced to know the internal
encoding. How can you tell people to not worry about internal encoding and in
the next paragraph force them to know because suddenly they are not allowed
to call unpack unless some _internal_ flag has some specific value.

I severely doubt you understood perls unicode model: It works by abstracting
away the internal flag completely, not forcing the user to deal with it.
Forcing her to deal with it is *wrong*.

> catch at least some of the cases, because the UTF8 flag always
> positively indicates that the string is a text string.

No, absolutely not. You are confused. The UTF-X flag only marks a specific
encoding used by perl internally. It says nothing about text or not text. You
cna store binary just fine in a UTF-X marked string.

> (The reverse,
> however, is not true: a string without the UTF8 string might be either a
> text string or a byte string.)

As might a string with the UTF-X flag set. Perl is typeless, it doesn't know
anything about text vs. binary.

> > That requires every perl programmer who decodes file headers etc.
> > using unpack to know about those internals.
> 
> No, it requires every Perl programmer to keep track of the function of
> every string.

No. A binary string is a binary string because it contains no characters
higher then 255. It is that simple.

> Byte strings and text strings must never be combined, and text strings
> must never undergo byte-specific operations.

That is certainly wrong.

> This again requires no knowledge of the actual encoding that Perl uses
> internally, whatsoever.

It does, for unpack, both in current perl as well as in your proposed change.

> Note that XS writers must have knowledge of Perl's internals. This has
> always been true, and is not specific to this fancy new Unicode thing.

Right. But why gratitiously break old code? In perl, it is broken by at least
unpack, in XS, it is broken by changing the meaning of SvPV.

> > And the problem is that those bugs are not considered bugs but features.
> 
> Some bugs are acknowledged as bugs, but won't be fixed anyway, because
> there is already a lot of code in the wild that depends on the bugs.

Again, I know a lot of code that is currently broken because of that
bug. I asked, but nobody found code "in the wild" that relies on that
specific bug.

> >    unpack "C", $s;
> 
> The C template for unpack is specifically documented as byte-specific.

No, it is specifically documented as being character-specific. Read your
manpage carefully:

                 c   A signed char value.
                 C   An unsigned C char (octet) even under Unicode.

(Note that byte and character is the same thing in C). That leavs us with
"octet". An octet is a number between 0 and 255 (you can give alternative
definitions thta are equivalent to mine, though).

In perl this is an octet:

   $x = chr 200;

Yet unpack under some circumstances returns two values for this single
octet, and sometimes not. And the only way to know is to inspect the
internal UTF-X flag.

> It should never be used on text strings. 

Perl is typeless. There is no such thing as a text string in Perl. The
problem, however, is not that it doesn't work on "text strings",m whatever
that might be, the problem is that unpack doesn't work on binary strings,
ro at least not all the time.

> If you properly keep text and byte strings separate, that means that
> your byte string was never upgraded, and that unpacking with "C" is
> reliable and predictable.

Uhhh, who guarentees that? JSON::XS does no such thing, and cannot
guarantee that, because Perl has no type for "text string" vs. "binary
string". So how do you suggest JSON::XS keeps text and byte strings
separate, if there is no way to detect the type of a string or make a
useful difference between those two?

> If upgrading happened even though the string was not mixed with text
> strings or used with unicode semantics, that is a bug. I'm very
> interested in these silent upgrades that you are experiencing.

Concatenating strings might upgrade them (e.g. in debugging output). More
so, JSON::XS currently can return either UTF-X encoded strings or non
UTF-X-encoded strings.

You can that buggy. So please tell me how to fix that bug. How do I, when
decoding a JSON string, know wther it is one of your text or byte strings?
Whats the difference, if neither JSON nor Perl make one?

> > If you think it is obvious, how about this:
> > 
> >    my $s = chr 255; # to me, this is one octet. to perl, it might be one or
> >                     # two, or maybe more, who knows.
> >    warn unpack "C", $s;
> >    "$s\x{672c}";
> >    warn unpack "C", $s;
> >    $s .= "\x{672c}"; substr $s, 1, 1, "";
> >    warn unpack "C", $s;
> > Can a pure-Perl programmer tell what the output of this program is without
> > trying it? 
> 
> Not relevant.

Very relevant.

> > Should he be able to? 
> 
> No, because the author of this program made a big mistake in the line
> "$s\x{672c}".

Are you sure that upgraded? And why is it a mistake? I very much differ in
that-

> The casual reader can easily figure out that $s was meant as a byte
> string

I cannot, from that short fragment. Neither can Perl.

> it is used with unpack "C", which is known to be a byte
> operation. Because it is a byte string, the chr 255 is just a 0xFF
> octet, not a ΓΏ (&yuml;) conceptually.

Exactly. But unpac does not return 255 for that byte string.

> The casual reader can also easily figure out that \x{672c} is meant as a
> text string: any codepoint higher than \x{FF} is always a character,
> never a single byte.

Why? Lots of people use those higher codepoints. Perl certainly does
not mandate anything like that, so why do you try to enforce it? People
routinely do stuff like join "\x{100}", @png_images to seperate them, and
it works fine.

Perls unicode model does not enforce a meaning of the codepoints used in
strings. It simply allows me to use more character indices than in 5.005.

> Then, the author of this snippet uses both the byte string $s and the
> text sting "\x{672c}" joined in one string "$s\x{672c}". People not
> interested in fixing the code can stop reading there: the code is broken
> and its semantics not terribly relevant.

Thanks for gratitiously calling my code broken. In any case, explain to me
how to fix it in general, I only gave an example of silent upgrades.

use JSON::XS;

   my $x = (from_json to_json [$y])[0];

is another silent upgrade users need to know about.

> People who wish to fix it, will
> have to try and figure out what the author really wanted to do here.

Exactly that.

> Because it's a contrived case, that's very hard to figure out.

Not at all. You are just guessing, and getting it wrong.

> sure that given real world values and variable names, there would be a
> clear and logical solution, to be found somewhere along the lines of
> encoding and decoding explicitly.

See above, figure it out in the real world then.

> > Thats a broken unicode model
> 
> So far, I've only seen a broken understanding of the unicode model, and
> a broken regex engine.

Same here. Your model requires people knowing about the UTF-X flag (at
leats in unpack). Mine doesn't, and I think mine is much closer to what
you want to achieve: not having to tell people about it. In your model
you would have to tell people to downgrade before unpacking string, or
alternatively, you rule out a lot of perfectly fine Perl code on the
assumption that it is easy to figure out that it is broken. Sorry, but I
differ very much.

-- 
                The choice of a
      -----==-     _GNU_
      ----==-- _       generation     Marc Lehmann
      ---==---(_)__  __ ____  __      pcg@goof.com
      --==---/ / _ \/ // /\ \/ /      http://schmorp.de/
      -=====/_/_//_/\_,_/ /_/\_\      XX11-RIPE

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About