develooper Front page | perl.perl5.porters | Postings from April 2007

Re: perl, the data, and the tf8 flag

Thread Previous | Thread Next
Glenn Linderman
April 4, 2007 11:25
Re: perl, the data, and the tf8 flag
Message ID:
On approximately 4/4/2007 11:01 AM, came the following characters from 
the keyboard of Dr.Ruud:
> Glenn Linderman schreef:
>> Dr.Ruud:
>>> Juerd Waalboer:
>>>> Perl doesn't have an ascii/utf8
>>>> distinction, it has a latin1/utf8 distinction.
>>> In Perl 5.8.6, the documentation of the function pack contains:
>>>     A   A text (ASCII) string, will be space padded.
>>> Maybe that should then read
>>>     A   A textual octet string (latin1), will be space padded.
>> You might think so, but:
>> 1) Perl doesn't presently have any latin-1 specific semantics, except
>> that the conversion of bytes buffers to multi-bytes buffers use a
>> numeric equality conversion that, because latin-1 is a subset of
>> Unicode, results in what seems to be a latin-1 to Unicode conversion.
>> This is a convenient accident, but bytes buffers are _not_ given
>> latin-1 semantics anywhere else in perl (except encode, and then only
>> if it is specified).
>> 2) The pack-A template code does not require the buffer to be ASCII.
>> Nor does it require it to be latin-1. Nor does it require it to be a
>> bytes buffer, although it treats it as one -- if it is a multi-bytes
>> buffer pack-A will happily do a byte-wise read of the buffer, and
>> include the appropriate number of bytes (not characters) in the
>> packed output.
>> 3) The only "character set" semantics involved in the A template code
>> is the "space padded" part... and for that it uses the binary value
>> 32, which happens to work as a space character in ASCII, latin-1,
>> Unicode, and nearly every other character set.
>> Proof: in the below the 3rd parameter to pack is clearly a multi-bytes
>> buffer, as it has a character code > 255.
>> However, in the first instance the character code takes 2 bytes in the
>> multi-bytes buffer, and in the second instance, it takes 5 bytes in
>> the multi-bytes buffer... and is clearly outside the Unicode
>> codepoint range (over twice the maximum codepoint value). The buffer
>> produced by pack (view the below in a fixed-width font to verify that
>> the C characters at the end of the output actually line up, C? See?)
>> demonstrates that the multi-bytes buffer is treated as a bytes buffer
>> of length 8 in the first case and length 11 in the second case. And
>> the lack of "wide-character" warnings prove that the result of pack
>> is not a multi-bytes buffer.
>> d:\>perl -e "print pack('cA20c', 67, qq<bcd\x{0234}efg>, 67 )"
>> Cbcd╚┤efg C
>> d:\>perl -e "print pack('cA20c', 67, qq<bcd\x{02346789}efg>, 67 )"
>> Cbcd·ìå₧ëefg C
> I don't think that contradicts what I said. As long as the UTF8 flag is
> not put on, the result string of calling pack with the "A" template
> returns a latin1 buffer (unless a locale is active).

Perhaps it is not exactly contradictory... but it would be a stray 
reference to latin-1, when the semantics neither support nor require 
latin-1, replacing a reference to ASCII, when the semantics neither 
support no require ASCII.  It is not clear that such an update would be 
an improvement... especially when the only two character sets that perl 
really supports with semantics are ASCII and Unicode.

I'd recommend updating this documentation to say:

    a   A byte-wide string with arbitrary binary data, will be null padded.
    A   A byte-wide string with arbitrary binary data, will be space padded.
    Z   A byte-wide string with arbitrary binary data, will be null padded.

There is a later paragraph that distinguishes between a and Z for unpack.
Alternate to "byte-wide string" one could say "string of bytes", or
"string treated as bytes" for all three.

Glenn --
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About