develooper Front page | perl.perl5.porters | Postings from October 2000

pack/unpack U ?

Jarkko Hietaniemi
October 29, 2000 09:57
pack/unpack U ?
Message ID:
I just closed a couple of bugs that assumed that pack/unpack U would
'magically' work as a wide-character aware "C".  Examples:

	unpack("U*", "\xDD")
	unpack("U*", "\x{c2}")
	unpack("U*", "\x{80}")

With the most current bleedperl these will warn as follows:

	Malformed UTF-8 character (1 byte, need 2)
	Malformed UTF-8 character (1 byte, need 2)
	Malformed UTF-8 character (unexpected continuation byte 0x80)

and return 65533 (0xfffd, the Unicode 'replacement character').

I closed the bugs because the Camel III says, p. 408, Chapter 15,
Unicode, Effects of Character Semantics:

        ... However, there is a new "U" specifier that will convert
        between UTF-8 characters and integers:

                pack("U*", 1, 20, 300, 4000) eq v1.20.300.4000

        ... In other words, chr and ord are like pack("U") and unpack("U"),
        not like pack("C") and unpack("C").

Since UTF-8 is explicitly mentioned I think that "U" is not meant to
be 'a wide-character-aware "C"'?

Now, what do you think?  Is the curent bleedperl doing the right thing --
or should "U" act like "C" if the string turns out to be invalid as UTF-8?

$jhi++; #
        # There is this special biologist word we use for 'stable'.
        # It is 'dead'. -- Jack Cohen Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About