develooper Front page | perl.perl5.porters | Postings from February 2007

Re: unicode regex performance

From:
Marvin Humphrey
Date:
February 12, 2007 15:41
Subject:
Re: unicode regex performance
Message ID:
A7227B8C-41D5-4468-BBDA-AFE095D546F1@rectangular.com

On Feb 12, 2007, at 2:28 AM, demerphq wrote:

> What version of perl are these numbers from?

The original numbers were from the stock Perl 5.8.6 (which is  
threaded) on OS X version 10.4.8.  Here are numbers for a slightly  
tweaked test with three versions of Perl.

    ==================================================================
    Mean time to index 1000 ASCII news articles
    ------------------------------------------------------------------
    tokenizer         5.8.6 (thr)     5.8.8 (no thr)    blead (no thr)
    ------------------------------------------------------------------
    UTF-8 regex       4.18 secs       3.72 secs         3.80 secs
    Latin-1 regex     2.84 secs       2.50 secs         2.60 secs
    Purpose-built C   1.82 secs       1.60 secs         1.64 secs

The tokenizer loop is...

             while (/$token_re/g) {
                 push @starts, $-[0];
                 push @ends,   $+[0];
             }

And the default token regex is... qr/\w+(?:'\w+)*/

The behavior of the purpose-built C tokenizer is slightly different  
but results in approximately the same number of tokens.

Gory details of the methodology are available upon request, but  
probably aren't germane.

>> I foresee a time when it will make more sense for me to hack on the
>> regex engine than on my own library.  Since it seems unlikely that
>> someone will address the charclass issue prior to the release of
>> 5.10, would it be possible to, say, break out Perl's regex engine,
>> update it, fix up a binding using the new regex engine hooks, and
>> release a module to CPAN?
>
> Yes, this is already possible with blead and will be possible in 5.10

Fabulous!

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/





nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About