develooper Front page | perl.perl5.porters | Postings from February 2011

Re: [perl #84670] unpack(C => ...) on string with UTF8 FLAG without <use bytes> may return value more than 255

Thread Previous | Thread Next
From:
Nicholas Clark
Date:
February 22, 2011 06:21
Subject:
Re: [perl #84670] unpack(C => ...) on string with UTF8 FLAG without <use bytes> may return value more than 255
Message ID:
20110222142058.GN24189@plum.flirble.org
On Tue, Feb 22, 2011 at 04:46:15AM -0800, mons @ cpan. org wrote:

> [Please describe your issue here]
> 
> perl5.12.0 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ ), "\n"'
> # 1090
> perl5.13.2 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ ), "\n"'
> # 1090
> perl5.10.0 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ ), "\n"'
> # 1090
> 
> while.
> perl5.8.9 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ ), "\n"'
> # 209
> perl5.6.2 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ ), "\n"'
> # 209
> 
> but with use bytes
> 
> perl5.12.0 -Mbytes -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ ), "\n"'
> # 209
> perl5.13.2 -Mbytes -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ ), "\n"'
> # 209
> 
> It's either worth adding sub unpack into bytes.pm and fix documentation or fix this issue.

It's a documented behaviour change introduced in 5.10, as described in
perl5100delta.pod:

    =head1 Incompatible Changes
    
    =head2 Packing and UTF-8 strings
    
    The semantics of pack() and unpack() regarding UTF-8-encoded data has been
    changed. Processing is now by default character per character instead of
    byte per byte on the underlying encoding. Notably, code that used things
    like C<pack("a*", $string)> to see through the encoding of string will now
    simply get back the original $string. Packed strings can also get upgraded
    during processing when you store upgraded characters. You can get the old
    behaviour by using C<use bytes>.
    
    To be consistent with pack(), the C<C0> in unpack() templates indicates
    that the data is to be processed in character mode, i.e. character by
    character; on the contrary, C<U0> in unpack() indicates UTF-8 mode, where
    the packed string is processed in its UTF-8-encoded Unicode form on a byte
    by byte basis. This is reversed with regard to perl 5.8.X, but now consistent
    between pack() and unpack().

Nicholas Clark

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About