develooper Front page | perl.perl5.porters | Postings from April 2006

Re: [perl #33734] unpack fails on utf-8 strings

Thread Previous | Thread Next
Nicholas Clark
April 27, 2006 08:04
Re: [perl #33734] unpack fails on utf-8 strings
Message ID:
On Tue, Jan 11, 2005 at 06:08:26PM +0100,  Marc A. Lehmann  wrote:
> On Mon, Jan 10, 2005 at 01:42:14PM -0000, Nicholas Clark via RT <> wrote:

> > I wonder if it's viable to make the integer conversion operators (and the
> > floating point operators) downgrade just enough characters to be useful?
> That would still break "b" and would have questionable semantics on "a"
> for example.

I agree, but I suspect that there are also programs out there which are
buggily relying on using a, b and h to inspect the internal representation
of scalars.

> I frankly cannot see any reason why >255 characters can make any sense as
> argument to unpack, and if the testsuite fails, I guess that is then a bug in
> the testsuite.

I'm never so confident about that. The testsuite may be buggy, but I tend to
assume (rightly or wrongly) that it's representative of Perl code out there.
So if a change to the internals can be made without the testsuite catching
fire, then I assume that it's unlikely to cause problems with code out on

On Fri, Jan 14, 2005 at 05:18:32PM +0100,  Marc A. Lehmann  wrote:

> Well, then how do you propose to fix the situaiton? The current behaviour
> is completely erratic, as the same scalar is interpretetd differently by
> unpack, depending on the perl version and it's usage history.
> The question, if this is not the right fix, is what the semantics of
> unicode strings as arguments to unpack are?

I think Ton's semantics as incorporated into 5.9.x are right (or at least a
heck of a lot less wrong than current) but they change far more than I'm
comfortable with for maint.

> In any case, if the old behaviour is to stay it simply needs to be defined
> behaviour. If there is no way to make it behave deterministically on the perl
> level, it isn't defined behaviour in my eyes.
> This is just a time bomb (and it exploded in my code, and might do so in
> other code).

I agree.

There is already this fleeting reference to the current behaviour in

    =item *
    Most operators that deal with positions or lengths in a string will
    automatically switch to using character positions, including
    C<chop()>, C<chomp()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>,
    C<sprintf()>, C<write()>, and C<length()>.  Operators that
    specifically do not switch include C<vec()>, C<pack()>, and
    C<unpack()>.  Operators that really don't care include
    operators that treats strings as a bucket of bits such as C<sort()>,
    and operators dealing with filenames.

I've merged Ton's code into maint, disabled all the new features, and kept
the current behaviour for all the string, character, bit and hex operators.
However, I am comfortable with changing all the numeric and pointer formatting
operators to deal with converting down from UTF-8, so have enabled that.
I doubt that any sane program is relying on using 'n', 's', 'i' etc to
generate a number based on the UTF-8 representation of some string.

I don't like this compromise, but it seems less likely to trigger new bugs
than either keeping the old behaviour, or entirely adopting Ton's changes.

Nicholas Clark

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About