develooper Front page | perl.perl5.porters | Postings from May 2013

Re: How on earth did we manage to break pack() so badly?

Thread Previous | Thread Next
From:
demerphq
Date:
May 1, 2013 15:21
Subject:
Re: How on earth did we manage to break pack() so badly?
Message ID:
CANgJU+X1ChCSur_ZEiAqepJ3aDM62a8_b8hTgwJ2TBFMFPzbMg@mail.gmail.com
On 1 May 2013 17:05, David Golden <xdg@xdg.me> wrote:
> On Wed, May 1, 2013 at 10:32 AM, demerphq <demerphq@gmail.com> wrote:
>> perl -le'unpack "H*", "\x{DF}\x{100}"'
>>
>> Produces completely different results depending on which Perl you are
>> on. On older perls it produces a relatively useful:
>>
>> c39fc480
>>
>> which as we all know if the hex output of the raw UTF8 form of the
>> string. On newer perls it produces the completely useless:
>>
>> df00
>>
>> Which is not correct regardless of how you look at it. The older
>> behavior was at least correct in some regard.
>
> I see some merit in not having pack treat a string as octets if its
> internally stored in UTF-8.  We have been trying to draw a line and
> say "internal representation is not something users need to know
> about".

Yet that has never really been true, and the people peddling the line
are responsible for much of the mess we are in now.

There has *always* been data that is *not* character oriented which
has been stored in strings, and where you really do have to know about
the internal representation.

IMO pretending otherwise has created far more problems than it solved.

> OTOH, as you point out, "df00" is not useful, either.
>
> My initial instinct is that packing/unpacking a string with characters
>> should have a "wide character in pack/unpack" warning, like we do
> for print, unless the template has an explicit rule for handling wide
> characters (like "U").  Then I'm fine with "df00" being the result.
>
> I don't like the "U0" answer.  That's another one of those arcane "you
> have to know what's going on internally to understand why this works"
> tricks.
>
> I think unpack needs a new modifier that means that a character string
> should be unpacked as UTF-8 octets instead of as characters, so that
> one could unpack it as hex or anything else.

Isnt this basically just the same thing as "U0"?

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About