develooper Front page | perl.perl5.porters | Postings from October 2000

pack/unpack U ?

From:
Jarkko Hietaniemi
Date:
October 29, 2000 09:57
Subject:
pack/unpack U ?
Message ID:
20001029115749.F22940@chaos.wustl.edu
I just closed a couple of bugs that assumed that pack/unpack U would
'magically' work as a wide-character aware "C".  Examples:

	unpack("U*", "\xDD")
	unpack("U*", "\x{c2}")
	unpack("U*", "\x{80}")

With the most current bleedperl these will warn as follows:

	Malformed UTF-8 character (1 byte, need 2)
	Malformed UTF-8 character (1 byte, need 2)
	Malformed UTF-8 character (unexpected continuation byte 0x80)

and return 65533 (0xfffd, the Unicode 'replacement character').

I closed the bugs because the Camel III says, p. 408, Chapter 15,
Unicode, Effects of Character Semantics:

        ... However, there is a new "U" specifier that will convert
        between UTF-8 characters and integers:

                pack("U*", 1, 20, 300, 4000) eq v1.20.300.4000

        ... In other words, chr and ord are like pack("U") and unpack("U"),
        not like pack("C") and unpack("C").

Since UTF-8 is explicitly mentioned I think that "U" is not meant to
be 'a wide-character-aware "C"'?

Now, what do you think?  Is the curent bleedperl doing the right thing --
or should "U" act like "C" if the string turns out to be invalid as UTF-8?

-- 
$jhi++; # http://www.iki.fi/jhi/
        # There is this special biologist word we use for 'stable'.
        # It is 'dead'. -- Jack Cohen



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About