On Sat, Mar 31, 2007 at 01:53:48AM +0200, Marc Lehmann wrote: > In C, a single byte is a character, even if it happens to have a value > higher than 255 (although very few compilers allow that, usually, a byte > is an octet, although it is common on DSPs to have 32 bit bytes). > > Even if Perl encoded a single character into multiple C bytes/octets, that > does not mean its more than a single character. > > The documentation is completely contradictory when it comes to "C" and can > easily be interpreted to mean a single character in the C sense. > > Fact is "even under Unicode" it doesn't work as advertised, becasue Unicode > can be internally represented in multiple ways in Perl. > > > I think that "char value" should be either removed from perlfunc, or > > explained in more detail. It's NOT OBVIOUS to those who don't know C. > > To those who do know C it has perfectly clear meaning, namely a single > character. http://www.parashift.com/c++-faq-lite/intrinsic-types.html#faq-26.3 But that is not really relevant to the discussion. Communication is difficult if you cannot express clearly what you are trying to say. Terminology is important to get correct, and it is easy to confuse others or yourself if you are not precise when you need to be. Unicode does not even HAVE characters, it has codepoints. This did not happen by accident and is an important distinction to make. $x = "ABCD"; $x = "\x41\x42\x43\x44"; $x = chr(65) . chr(66) . chr(67) . chr(68); $x = pack("C*", 65, 66, 67, 68); All of these put the same data into $x. [1] We can reasonably assume that $x contains a sequence of 4 bytes, each 8 bits wide. We do not know anything about what $x is, if it has an encoding, if it is actually the output of pack "V", or maybe it came after "HTTP/1.1 GET ". The only reasonable thing to assume is that it is just a sequence of octets, aka binary data. Now consider the case of $y = chr(1000); Clearly whatever is in $y cannot be a single octet. The way Perl currently works (and this is my limited understanding here - someone with more knowledge can feel free to step in and correct my errors) is that now $y is considered to be a string of Unicode codepoints. So $y contains a single codepoint, U+03E8. The internal flag is used to indicate that the internal data pointer points to something that is a "Unicode codepoint string". What can we do with such a string? We can try to print it, but if we have not converted it we get a message like Wide character in print at - line 1. and we get the bytes "cf a8" as output because that is the internal encoding. print unpack("H*", $y); produces "cfa8" as output, again because we have been given access to the string as it exists upgraded. On the other hand, print unpack("H*", pack("C", 1000)); produces "e8". So consider again: unpack("C*", $y); This currently produces the list (207, 168) which is again the internal encoding. What else should it do? If you expect values over 255, then you should not use "C". If you don't have values over 255, then why is your string not just a sequence of bytes? Something must have occurred to upgrade it to "sequence of unicode codepoints". Of course if you have values over 255 you have to use "U" in unpack, that only makes sense! On the other hand, if you are agnostic to your string and just treat it as "data" then it will never get upgraded. So where is the issue? It sounds to be that what you are trying to suggest is something along the lines of another type of Sv for the case of "unicode codepoint sequence", so that SvPV implicitly means "This scalar is not upgraded and is just data" and SvP_UnicodeArrayValue_ would contain the upgraded value. Then for anything that wanted a SvPV (XS code, unpack "C") the only sensible thing would be to try to downgrade the string at that point and then emit a warning in the case of "wide characters" being present. This is the point at which someone more familiar with internals chimes in and says "This has problems [backwards compatibility, tuits, other]." And of course this would preclude being able to inspect Perl's internal Unicode representation using unpack "C". :) -- -Ben Carter Human beings, who are almost unique in having the ability to learn from the experience of others, are also remarkable for their apparent disinclination to do so. - Douglas Adams, "Last Chance to See" [1] I am deliberately ignoring the box in the corner labeled "EBCDIC".Thread Previous | Thread Next