develooper Front page | perl.perl5.porters | Postings from January 2005

Re: [perl #33734] unpack fails on utf-8 strings

Thread Previous | Thread Next
From:
pcg
Date:
January 14, 2005 09:04
Subject:
Re: [perl #33734] unpack fails on utf-8 strings
Message ID:
20050114161832.GB667@schmorp.de
On Thu, Jan 13, 2005 at 09:24:16PM -0000, Nicholas Clark via RT <perlbug-followup@perl.org> wrote:
> On Thu, Jan 13, 2005 at 01:56:33PM +0000, Nicholas Clark wrote:
> 
> > I didn't know, but looking at the pack implementation, it's 'U', and only 'U':
> 
> Seems to be 'C' and 'U'

But "C" is documented as:

                   An unsigned char value.  Only does bytes.  See U for
                   Unicode.

I assume that "byte" == "octet", which is not generally true in perl, but
is common language.

While "U" is documented as:

                   A Unicode character number.  Encodes to UTF-8
                   internally (or UTF-EBCDIC in EBCDIC platforms).

If that indeed encodes to a unicode string, not to UTF-8 (as the confusing
"internally" seems to imply), it is badly worded. In any case, this is the
only flag, and I do remember the discussion on perl5-porters that it is
broken and not easily supportable, as Encode provides the correct way to
do that, and much more.

> After changing t/op/join.t to avoid using H* to probe the innards of UTF8
> scalars, the appended diff does make all tests pass. However, I'm not
> convinced that it's the way to go.

Well, then how do you propose to fix the situaiton? The current behaviour
is completely erratic, as the same scalar is interpretetd differently by
unpack, depending on the perl version and it's usage history.

The question, if this is not the right fix, is what the semantics of
unicode strings as arguments to unpack are?

As of now, I don't see how I can control this on the perl level, as perl
can essentially upgrade my scalar anytime. The data in the scalar will
stay the same, by definition, except for unpack, which acts in a rather
undefined way.

In any case, the current state of undocumented random breakage is a
bug. It either needs to get defined semantics, or made comaptible with the
rest of perl.

Try this:

   $x = "\xff\xff";
   $x =~ /\x{100}/;

   die unpack "n", $x;

Can you guess the output of this program without running it? For every
current and future perl version? What about other operations? What about
this:

   $x = "\xff\xff";
   $y = "\x{100}";
   substr $y, 0, 1, "";

   die unpack "n", "$x$y";

So concatenation with an empty string suddenly changes how the firts two
bytes are being interpreted?

It gets worse:

   use utf8;
   $x = "\xff\xff echt ├╝bel";
   die unpack "n", $x;

Now the output suddenly depends on thenormalization form the text editor
used to edit the text used, or wether perl does normalization or not.

Does I/O to a utf-8-encoded file change the unpack outcome? Does
interpolation into a utf-8-encoded string change unpack outcome? HAs this
been the case in earlier versions? Will this stay the same in future
versions?

Why have all of the above code snippets defined behaviour regardless of
the outcome of the questions if I match or substr or concatenate the
scalars, but not when unpakc is involved?

Upgrading behaviour can change with every perl version, and often enough
has.

If that is supposed behaviour, then I would be happy if it's semantics
were clarified somewhere. Saying "perl might change your scalar anytime
behind your back and thus change unpack results" sounds a bit like "your
memory has corrupted bits".

Forcing dveelopers to paste a call to Encode between every scalar and
unpack to avoid circumstances, module usage or future versions changing
it's unpack value is not what I would call deterministic behaviour.

In any case, if the old behaviour is to stay it simply needs to be defined
behaviour. If there is no way to make it behave deterministically on the perl
level, it isn't defined behaviour in my eyes.

This is just a time bomb (and it exploded in my code, and might do so in
other code).

-- 
                The choice of a
      -----==-     _GNU_
      ----==-- _       generation     Marc Lehmann
      ---==---(_)__  __ ____  __      pcg@goof.com
      --==---/ / _ \/ // /\ \/ /      http://schmorp.de/
      -=====/_/_//_/\_,_/ /_/\_\      XX11-RIPE

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About