develooper Front page | perl.perl5.porters | Postings from January 2005

Re: [perl #33734] unpack fails on utf-8 strings

Thread Previous | Thread Next
From:
Nicholas Clark
Date:
January 10, 2005 05:41
Subject:
Re: [perl #33734] unpack fails on utf-8 strings
Message ID:
20050110134126.GF8659@plum.flirble.org
On Sun, Jan 09, 2005 at 10:19:43PM -0000, Marc Lehmann wrote:

> As the internal encoding (wether latin1 or utf8) does NOT change the
> string on the perl level, unpack must work consistently.

I agree. Well, I thought I did. Then..

> (I found this bug because for some reason perl upgraded my string to
> utf-8 internally, causing very funny effects when I ran various unpacks
> to decode the protocol. As perl can do that in various unexpected ways,
> I chose severity "high" because there is no easy workaround on the perl
> level: feel free to correct this :)
> 
> The solution is to downgrade the string to latin1 before converting it
> within unpack, or failing if the string cnanot be converted.

However, I'm confused. There is this code in pp_pack.c:

#ifdef PACKED_IS_OCTETS
    /* Packed side is assumed to be octets - so force downgrade if it
       has been UTF-8 encoded by accident
     */
    register char *s = SvPVbyte(right, rlen);
#else
    register char *s = SvPV(right, rlen);
#endif

and the default is the #else clause. If I recompile with -DPACKED_IS_OCTETS

  Failed 40 test scripts out of 903, 95.57% okay.

which doesn't look great. It looks like some cases in unpack expect to find
utf8 data in the source string. Great. :-(
I wonder if it's viable to make the integer conversion operators (and the
floating point operators) downgrade just enough characters to be useful?

Nicholas Clark

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About