develooper Front page | perl.perl5.porters | Postings from January 2012

Re: pack and ASCII

Thread Previous | Thread Next
From:
Nicholas Clark
Date:
January 11, 2012 02:35
Subject:
Re: pack and ASCII
Message ID:
20120111103511.GC9069@plum.flirble.org
On Tue, Jan 10, 2012 at 06:05:12PM -0500, Ricardo Signes wrote:
> 
> So, what does it mean someone is asking for, when he or she writes:
> 
>   my $string = qq[Queensr\N{LATIN SMALL LETTER Y WITH DIAERESIS}che];
>   my $packed = pack "A*", $string;
> 
> All the other confusion of this thread aside, I *think* that we all agree that
> the person writing this is making a mistake.  Is that true?  Do we all agree
> that this should be (or would not be incorrect to be) a warning?

Not convinced for the general case. For the example you give, yes.
But I don't think you can rule out that some legitimate code may well exist
that is using the padding features:

./perl -Ilib -e 'printf ">%s<\n", pack "A5", "\N{LATIN SMALL LETTER Y WITH DIAERESIS}"'
>ÿ    <

I don't think much, though.

> My hunch is that we will want a warning and an improvement to the
> documentation.

So I'm tempted to agree this far.

However, as best I can tell, the troublesome set are

    a  A string with arbitrary binary data, will be null padded.
    A  A text (ASCII) string, will be space padded.
    Z  A null-terminated (ASCIZ) string, will be null padded.

    U  A Unicode character number.  Encodes to a character in char-
       acter mode and UTF-8 (or UTF-EBCDIC in EBCDIC platforms) in
       byte mode.

I can see that there's possibility of use of unpack "a" on Unicode strings
as a sort of "programmable" split:

$ perl -MDevel::Peek -e 'Dump $_ foreach unpack "a2a", "\x{100}\x{101}\x{102}"'
SV = PV(0x100801c68) at 0x100803ea0
  REFCNT = 2
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x100201ce0 "\304\200\304\201"\0 [UTF8 "\x{100}\x{101}"]
  CUR = 4
  LEN = 16
SV = PV(0x100801d58) at 0x100804110
  REFCNT = 2
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x10020a6b0 "\304\202"\0 [UTF8 "\x{102}"]
  CUR = 2
  LEN = 16

and clearly "U" is *designed* (or whatever passed for design at the time*)
to deal with Unicode character data, so this mess is complex.

Nicholas Clark

* I think the design at the time of 5.6.0 was that pack/unpack offered the
  route between byte strings and character strings. Encode did not appear
  until 5.7.something. Remember, really, maint-5.6 is only
  "marketing-compatible" with Unicode.

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About