On approximately 4/4/2007 3:19 AM, came the following characters from the keyboard of Dr.Ruud: > Juerd Waalboer schreef: > > >> Perl doesn't have an ascii/utf8 >> distinction, it has a latin1/utf8 distinction. >> > > > In Perl 5.8.6, the documentation of the function pack contains: > > A A text (ASCII) string, will be space padded. > > Maybe that should then read > > A A textual octet string (latin1), will be space padded. > > You might think so, but: 1) Perl doesn't presently have any latin-1 specific semantics, except that the conversion of bytes buffers to multi-bytes buffers use a numeric equality conversion that, because latin-1 is a subset of Unicode, results in what seems to be a latin-1 to Unicode conversion. This is a convenient accident, but bytes buffers are _not_ given latin-1 semantics anywhere else in perl (except encode, and then only if it is specified). 2) The pack-A template code does not require the buffer to be ASCII. Nor does it require it to be latin-1. Nor does it require it to be a bytes buffer, although it treats it as one -- if it is a multi-bytes buffer pack-A will happily do a byte-wise read of the buffer, and include the appropriate number of bytes (not characters) in the packed output. 3) The only "character set" semantics involved in the A template code is the "space padded" part... and for that it uses the binary value 32, which happens to work as a space character in ASCII, latin-1, Unicode, and nearly every other character set. Proof: in the below the 3rd parameter to pack is clearly a multi-bytes buffer, as it has a character code > 255. However, in the first instance the character code takes 2 bytes in the multi-bytes buffer, and in the second instance, it takes 5 bytes in the multi-bytes buffer... and is clearly outside the Unicode codepoint range (over twice the maximum codepoint value). The buffer produced by pack (view the below in a fixed-width font to verify that the C characters at the end of the output actually line up, C? See?) demonstrates that the multi-bytes buffer is treated as a bytes buffer of length 8 in the first case and length 11 in the second case. And the lack of "wide-character" warnings prove that the result of pack is not a multi-bytes buffer. d:\>perl -e "print pack('cA20c', 67, qq<bcd\x{0234}efg>, 67 )" Cbcd╚┤efg C d:\>perl -e "print pack('cA20c', 67, qq<bcd\x{02346789}efg>, 67 )" Cbcd·ìå₧ëefg C -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration NetworkingThread Previous | Thread Next