develooper Front page | perl.perl5.porters | Postings from February 2011

Re: [perl #84670] unpack(C => ...) on string with UTF8 FLAGwithout <use bytes> may return value more than 255

Thread Previous
From:
perl5-porters
Date:
February 28, 2011 15:05
Subject:
Re: [perl #84670] unpack(C => ...) on string with UTF8 FLAGwithout <use bytes> may return value more than 255
Message ID:
ikh9o8$659$20@post.home.lunix
In article <rt-3.6.HEAD-24085-1298398878-801.84670-15-0@perl.org>,
	"Eric Brine via RT" <perlbug-followup@perl.org> writes:
> On Tue Feb 22 10:13:05 2011, ikegami@adaelis.com wrote:
>> You didn't say what you expect it to do. I suppose it could throw an
>> exception, but the current behaviour is quite reasonable to me.
> 
> $ perl -we'printf "%02X\n", unpack "N", "\0\0\0\x{442}"'
> Character(s) in 'N' format wrapped in unpack at -e line 1.
> 42
> 
> $ perl -wle'printf "%02X\n", unpack "C", "\x{442}"'
> 442
> 
> I suppose the latter could do like the former (warn and "& 0xFF" the
> input), but the latter's behaviour is so much more useful.

Actually when I made the unicode pack/unpack patch the "C" format was 
seen as a possible backward incompatibility problem and on p5p I was
asked to add another character to mean "full single character semantics",
which became the "W" (word) character. But I only did that for pack it seems:

perl -wle 'print ord pack("C", 1000)'
Character in 'C' format wrapped in pack at -e line 1.
232

perl -wle 'print ord pack("W", 1000)'
1000

So the "C" format basically works "modulo 256"

I think its entirely reasonable to have the same behaviour for unpack so that

unpack "C", "\x{442}" would give 66 (1090 % 256) together with a format 
wrap warning
(notice that it still won't give 209 which is a nonsense answer
corresponding to internal details)

The admittedly much more sane behaviour of returning 1090 would still be
available with W,

 unpack "W", "\x{442}" would give 1090

This woould be completely in line with the documented (in perldoc -f pack)

                   C   An unsigned char (octet) value.
                   W   An unsigned char value (can be greater than 255).

"W" was always meant as the unicode sane version of "C"

I can make a patch if people agree with this...

Thread Previous


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About