develooper Front page | perl.perl5.porters | Postings from April 2007

Re: perl, the data, and the tf8 flag

Thread Previous | Thread Next
From:
Dr.Ruud
Date:
April 4, 2007 11:07
Subject:
Re: perl, the data, and the tf8 flag
Message ID:
20070404180741.13313.qmail@lists.develooper.com
Glenn Linderman schreef:
> Dr.Ruud:
>> Juerd Waalboer:

>>> Perl doesn't have an ascii/utf8
>>> distinction, it has a latin1/utf8 distinction.
>>
>> In Perl 5.8.6, the documentation of the function pack contains:
>>
>>     A   A text (ASCII) string, will be space padded.
>>
>> Maybe that should then read
>>
>>     A   A textual octet string (latin1), will be space padded.
>>
>
> You might think so, but:
>
> 1) Perl doesn't presently have any latin-1 specific semantics, except
> that the conversion of bytes buffers to multi-bytes buffers use a
> numeric equality conversion that, because latin-1 is a subset of
> Unicode, results in what seems to be a latin-1 to Unicode conversion.
> This is a convenient accident, but bytes buffers are _not_ given
> latin-1 semantics anywhere else in perl (except encode, and then only
> if it is specified).
>
> 2) The pack-A template code does not require the buffer to be ASCII.
> Nor does it require it to be latin-1. Nor does it require it to be a
> bytes buffer, although it treats it as one -- if it is a multi-bytes
> buffer pack-A will happily do a byte-wise read of the buffer, and
> include the appropriate number of bytes (not characters) in the
> packed output.
>
> 3) The only "character set" semantics involved in the A template code
> is the "space padded" part... and for that it uses the binary value
> 32, which happens to work as a space character in ASCII, latin-1,
> Unicode, and nearly every other character set.
>
> Proof: in the below the 3rd parameter to pack is clearly a multi-bytes
> buffer, as it has a character code > 255.
> However, in the first instance the character code takes 2 bytes in the
> multi-bytes buffer, and in the second instance, it takes 5 bytes in
> the multi-bytes buffer... and is clearly outside the Unicode
> codepoint range (over twice the maximum codepoint value). The buffer
> produced by pack (view the below in a fixed-width font to verify that
> the C characters at the end of the output actually line up, C? See?)
> demonstrates that the multi-bytes buffer is treated as a bytes buffer
> of length 8 in the first case and length 11 in the second case. And
> the lack of "wide-character" warnings prove that the result of pack
> is not a multi-bytes buffer.
>
> d:\>perl -e "print pack('cA20c', 67, qq<bcd\x{0234}efg>, 67 )"
> Cbcd╚┤efg C
> d:\>perl -e "print pack('cA20c', 67, qq<bcd\x{02346789}efg>, 67 )"
> Cbcd·ìå₧ëefg C

I don't think that contradicts what I said. As long as the UTF8 flag is
not put on, the result string of calling pack with the "A" template
returns a latin1 buffer (unless a locale is active).

-- 
Affijn, Ruud

"Gewoon is een tijger."


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About