Front page | perl.perl5.porters |
Postings from January 2012
Re: pack and ASCII
Thread Previous
|
Thread Next
From:
Nicholas Clark
Date:
January 11, 2012 02:35
Subject:
Re: pack and ASCII
Message ID:
20120111103511.GC9069@plum.flirble.org
On Tue, Jan 10, 2012 at 06:05:12PM -0500, Ricardo Signes wrote:
>
> So, what does it mean someone is asking for, when he or she writes:
>
> my $string = qq[Queensr\N{LATIN SMALL LETTER Y WITH DIAERESIS}che];
> my $packed = pack "A*", $string;
>
> All the other confusion of this thread aside, I *think* that we all agree that
> the person writing this is making a mistake. Is that true? Do we all agree
> that this should be (or would not be incorrect to be) a warning?
Not convinced for the general case. For the example you give, yes.
But I don't think you can rule out that some legitimate code may well exist
that is using the padding features:
./perl -Ilib -e 'printf ">%s<\n", pack "A5", "\N{LATIN SMALL LETTER Y WITH DIAERESIS}"'
>ÿ <
I don't think much, though.
> My hunch is that we will want a warning and an improvement to the
> documentation.
So I'm tempted to agree this far.
However, as best I can tell, the troublesome set are
a A string with arbitrary binary data, will be null padded.
A A text (ASCII) string, will be space padded.
Z A null-terminated (ASCIZ) string, will be null padded.
U A Unicode character number. Encodes to a character in char-
acter mode and UTF-8 (or UTF-EBCDIC in EBCDIC platforms) in
byte mode.
I can see that there's possibility of use of unpack "a" on Unicode strings
as a sort of "programmable" split:
$ perl -MDevel::Peek -e 'Dump $_ foreach unpack "a2a", "\x{100}\x{101}\x{102}"'
SV = PV(0x100801c68) at 0x100803ea0
REFCNT = 2
FLAGS = (POK,pPOK,UTF8)
PV = 0x100201ce0 "\304\200\304\201"\0 [UTF8 "\x{100}\x{101}"]
CUR = 4
LEN = 16
SV = PV(0x100801d58) at 0x100804110
REFCNT = 2
FLAGS = (POK,pPOK,UTF8)
PV = 0x10020a6b0 "\304\202"\0 [UTF8 "\x{102}"]
CUR = 2
LEN = 16
and clearly "U" is *designed* (or whatever passed for design at the time*)
to deal with Unicode character data, so this mess is complex.
Nicholas Clark
* I think the design at the time of 5.6.0 was that pack/unpack offered the
route between byte strings and character strings. Encode did not appear
until 5.7.something. Remember, really, maint-5.6 is only
"marketing-compatible" with Unicode.
Thread Previous
|
Thread Next