develooper Front page | perl.perl5.porters | Postings from February 2007

Re: unicode regex performance

From:
Gerard Goossen
Date:
February 10, 2007 06:22
Subject:
Re: unicode regex performance
Message ID:
20070210142538.GF1868@ostwald
On Thu, Feb 08, 2007 at 09:06:31AM +0100, Dr.Ruud wrote:
> Gerard Goossen schreef:
> 
> > But for example when doing a regex for a
> > fixed string like m/aap/ in UTF-8 you just have to do a memory search
> > for the bytes representing 'aap' in UTF-32 you can do the same, but
> > you have much more memory to search through.
> 
> A string encoded such that it uses the word size of the platform, will
> actually search faster. It is an extra step to get to the bytes, which
> on most hardware takes considerable time.

Nonsence. benchmark** of the most basic search:
NONE*:  1.79s
UTF-8:  1.86s	
UTF-16: 2.10s
UTF-32: 2.90s

Which is worse then lineair in memory usage (probably due to cache
misses).
And this is without optimalizations, like if you are searching for a
longer patterns, you can do match multiple codepoints using one integer
comparison.


Gerard Goossen

*: the NONE is without doing a memory lookup, so it measures the 'for'
and the 'if' (so doing some loop unrolling would greatly improve
performance here).

**Benchmark:
This is the most stupid search I could think of, but it demonstrates 
the impact of the size of the string, and that comparing bytes instead of
ints doesn't matter.
Adjust '#define test_t long' and Encode::decode('UTF-32-LE') to get measure
the other encodings.


XS(XS_search)
{
    dVAR;
    dXSARGS;
        SV * const sv = ST(0);

        STRLEN len;
        #define test_t long
        test_t *s = (test_t*) (SvPV(sv, len));
        test_t char1 = 0x41;

        test_t *send = (test_t*) (((char*)s) + len);
        for (; s < send; s++) {
             if (*s == char1) {
                 XSRETURN(0);
             }
        }
        XSRETURN(0);
}

#!perl
use Benchmark "timethese";
use strict;
use warnings;
use utf8;
use Encode;

my $n = 200_000;
my $count = 200000000/$n;
my $a = ("abcefgh" x $n) . "\x{20ac}\x{41}\x{43}\x{44}";
my $a_utf32 = Encode::encode("UTF-32-LE", $a);

timethese( $count, {
                   utf32 => sub { search( $a_utf32 ); },
} );






nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About