Glenn Linderman schreef: > Dr.Ruud: >> Juerd Waalboer: >>> Perl doesn't have an ascii/utf8 >>> distinction, it has a latin1/utf8 distinction. >> >> In Perl 5.8.6, the documentation of the function pack contains: >> >> A A text (ASCII) string, will be space padded. >> >> Maybe that should then read >> >> A A textual octet string (latin1), will be space padded. >> > > You might think so, but: > > 1) Perl doesn't presently have any latin-1 specific semantics, except > that the conversion of bytes buffers to multi-bytes buffers use a > numeric equality conversion that, because latin-1 is a subset of > Unicode, results in what seems to be a latin-1 to Unicode conversion. > This is a convenient accident, but bytes buffers are _not_ given > latin-1 semantics anywhere else in perl (except encode, and then only > if it is specified). > > 2) The pack-A template code does not require the buffer to be ASCII. > Nor does it require it to be latin-1. Nor does it require it to be a > bytes buffer, although it treats it as one -- if it is a multi-bytes > buffer pack-A will happily do a byte-wise read of the buffer, and > include the appropriate number of bytes (not characters) in the > packed output. > > 3) The only "character set" semantics involved in the A template code > is the "space padded" part... and for that it uses the binary value > 32, which happens to work as a space character in ASCII, latin-1, > Unicode, and nearly every other character set. > > Proof: in the below the 3rd parameter to pack is clearly a multi-bytes > buffer, as it has a character code > 255. > However, in the first instance the character code takes 2 bytes in the > multi-bytes buffer, and in the second instance, it takes 5 bytes in > the multi-bytes buffer... and is clearly outside the Unicode > codepoint range (over twice the maximum codepoint value). The buffer > produced by pack (view the below in a fixed-width font to verify that > the C characters at the end of the output actually line up, C? See?) > demonstrates that the multi-bytes buffer is treated as a bytes buffer > of length 8 in the first case and length 11 in the second case. And > the lack of "wide-character" warnings prove that the result of pack > is not a multi-bytes buffer. > > d:\>perl -e "print pack('cA20c', 67, qq<bcd\x{0234}efg>, 67 )" > Cbcd╚┤efg C > d:\>perl -e "print pack('cA20c', 67, qq<bcd\x{02346789}efg>, 67 )" > Cbcd·ìå₧ëefg C I don't think that contradicts what I said. As long as the UTF8 flag is not put on, the result string of calling pack with the "A" template returns a latin1 buffer (unless a locale is active). -- Affijn, Ruud "Gewoon is een tijger."Thread Previous | Thread Next