develooper Front page | perl.perl5.porters | Postings from July 2012

Re: [perl #22375] 'split'/'index' problem for utf8

Thread Previous | Thread Next
Nicholas Clark
July 6, 2012 14:51
Re: [perl #22375] 'split'/'index' problem for utf8
Message ID:
On Fri, Jul 06, 2012 at 01:54:53PM -0700, Jesse Luehrs via RT wrote:
> Closing this, since we can't reproduce it. If someone is able to
> reproduce it, feel free to reopen this ticket.

The test case in the ticket fails on the revision that Andreas mentions in
a comment in the ticket (bleadperl 18530 - ie 7e8c5daceba7cb18)

Adapting the test case to die if the two values are not equal permits
bisecting, which finds that this commit fixed it:

d69d2d9f0d14e0c849a4b59d442938c401a7f281 is the first bad commit
commit d69d2d9f0d14e0c849a4b59d442938c401a7f281
Author: Jarkko Hietaniemi <>
Date:   Fri May 30 05:47:15 2003 +0000

    Fix for "#22375 'split'/'index' problem for utf8".

    p4raw-id: //depot/perl@19640

:100644 100644 d82e354341db1415bc03834f7cf84763568a16b8 310ba50465ec1a2c866438208805f6bcf626227a M      sv.c
:040000 040000 6b5b18db188316b074b9cafdbe453ee41c29ca94 f76a7e8cef2fa3fbc79f3d2fd79ea65f83977269 M      t
bisect run success
That took 1550 seconds

The actual fix is tiny:

diff --git a/sv.c b/sv.c
index d82e354..310ba50 100644
--- a/sv.c
+++ b/sv.c
@@ -5952,8 +5952,6 @@ Perl_sv_pos_b2u(pTHX_ register SV* sv, I32* offsetp)
                        cache[0] -= ubackw;
-                       return;

With some more bisecting, it turns out that the bug was introduced *at*
the commit that Andreas mentioned in the ticket:

7e8c5daceba7cb185532328a3b67d4ca7ba4811b is the first bad commit
commit 7e8c5daceba7cb185532328a3b67d4ca7ba4811b
Author: Hugo van der Sanden <>
Date:   Tue Jan 21 01:37:03 2003 +0000

    integrate (by hand) #18353 and #18359 from maint-5.8:
    Introduce a cache for UTF-8 data: length and byte<->char offset
    mapping are stored in a new type of magic.  Speeds up length(),
    substr(), index(), rindex(), pos(), and some parts of s///.
    The speedup varies a lot (on the usual suspects: what is the
    access pattern of the data, compiler, CPU), but should be at
    least one order of magnitude, and getting to the same magnitude
    as byte string speeds, and in some cases  (length on unchanged data)
    even reaching the byte string speed.  On the other hand, in some
    cases (index) the byte speed is still faster by a factor of five
    or so, but the bottleneck there does not seem to be any more
    the byte<->char offset mapping (instead, the fbm_instr() speed).
    There is one cache slot for the length, and only two for the
    byte<->char offset mapping (the first one for the start->offset,
    and the second for the offset->offset+length, when talking
    in substr() terms).
    Code this hairy is bound to have hairy trolls hiding under it.
    A small tweak on top of #18353: don't display mg_len bytes of
    mg_ptr for PERL_MAGIC_utf8 because that's not what's there.
    p4raw-id: //depot/perl@18530

Nicholas Clark

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About