develooper Front page | perl.perl5.porters | Postings from April 2007

Re: perl, the data, and the tf8 flag

Thread Previous | Thread Next
Glenn Linderman
April 4, 2007 10:20
Re: perl, the data, and the tf8 flag
Message ID:
On approximately 4/4/2007 3:19 AM, came the following characters from 
the keyboard of Dr.Ruud:
> Juerd Waalboer schreef:
>> Perl doesn't have an ascii/utf8
>> distinction, it has a latin1/utf8 distinction.
> In Perl 5.8.6, the documentation of the function pack contains:
>     A   A text (ASCII) string, will be space padded. 
> Maybe that should then read 
>     A   A textual octet string (latin1), will be space padded. 

You might think so, but:

1) Perl doesn't presently have any latin-1 specific semantics, except 
that the conversion of bytes buffers to multi-bytes buffers use a 
numeric equality conversion that, because latin-1 is a subset of 
Unicode, results in what seems to be a latin-1 to Unicode conversion. 
This is a convenient accident, but bytes buffers are _not_ given latin-1 
semantics anywhere else in perl (except encode, and then only if it is 

2) The pack-A template code does not require the buffer to be ASCII. Nor 
does it require it to be latin-1. Nor does it require it to be a bytes 
buffer, although it treats it as one -- if it is a multi-bytes buffer 
pack-A will happily do a byte-wise read of the buffer, and include the 
appropriate number of bytes (not characters) in the packed output.

3) The only "character set" semantics involved in the A template code is 
the "space padded" part... and for that it uses the binary value 32, 
which happens to work as a space character in ASCII, latin-1, Unicode, 
and nearly every other character set.

Proof: in the below the 3rd parameter to pack is clearly a multi-bytes 
buffer, as it has a character code > 255.
However, in the first instance the character code takes 2 bytes in the 
multi-bytes buffer, and in the second instance, it takes 5 bytes in the 
multi-bytes buffer... and is clearly outside the Unicode codepoint range 
(over twice the maximum codepoint value). The buffer produced by pack 
(view the below in a fixed-width font to verify that the C characters at 
the end of the output actually line up, C? See?) demonstrates that the 
multi-bytes buffer is treated as a bytes buffer of length 8 in the first 
case and length 11 in the second case. And the lack of "wide-character" 
warnings prove that the result of pack is not a multi-bytes buffer.

d:\>perl -e "print pack('cA20c', 67, qq<bcd\x{0234}efg>, 67 )"
Cbcd╚┤efg C
d:\>perl -e "print pack('cA20c', 67, qq<bcd\x{02346789}efg>, 67 )"
Cbcd·ìå₧ëefg C

Glenn --
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About