Front page | perl.perl5.porters |
Postings from October 2000
pack/unpack U ?
From:
Jarkko Hietaniemi
Date:
October 29, 2000 09:57
Subject:
pack/unpack U ?
Message ID:
20001029115749.F22940@chaos.wustl.edu
I just closed a couple of bugs that assumed that pack/unpack U would
'magically' work as a wide-character aware "C". Examples:
unpack("U*", "\xDD")
unpack("U*", "\x{c2}")
unpack("U*", "\x{80}")
With the most current bleedperl these will warn as follows:
Malformed UTF-8 character (1 byte, need 2)
Malformed UTF-8 character (1 byte, need 2)
Malformed UTF-8 character (unexpected continuation byte 0x80)
and return 65533 (0xfffd, the Unicode 'replacement character').
I closed the bugs because the Camel III says, p. 408, Chapter 15,
Unicode, Effects of Character Semantics:
... However, there is a new "U" specifier that will convert
between UTF-8 characters and integers:
pack("U*", 1, 20, 300, 4000) eq v1.20.300.4000
... In other words, chr and ord are like pack("U") and unpack("U"),
not like pack("C") and unpack("C").
Since UTF-8 is explicitly mentioned I think that "U" is not meant to
be 'a wide-character-aware "C"'?
Now, what do you think? Is the curent bleedperl doing the right thing --
or should "U" act like "C" if the string turns out to be invalid as UTF-8?
--
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen
-
pack/unpack U ?
by Jarkko Hietaniemi