develooper Front page | perl.perl5.porters | Postings from February 2007

Re: unicode regex performance

Thread Previous | Thread Next
Marvin Humphrey
February 12, 2007 15:41
Re: unicode regex performance
Message ID:

On Feb 12, 2007, at 2:28 AM, demerphq wrote:

> What version of perl are these numbers from?

The original numbers were from the stock Perl 5.8.6 (which is  
threaded) on OS X version 10.4.8.  Here are numbers for a slightly  
tweaked test with three versions of Perl.

    Mean time to index 1000 ASCII news articles
    tokenizer         5.8.6 (thr)     5.8.8 (no thr)    blead (no thr)
    UTF-8 regex       4.18 secs       3.72 secs         3.80 secs
    Latin-1 regex     2.84 secs       2.50 secs         2.60 secs
    Purpose-built C   1.82 secs       1.60 secs         1.64 secs

The tokenizer loop is...

             while (/$token_re/g) {
                 push @starts, $-[0];
                 push @ends,   $+[0];

And the default token regex is... qr/\w+(?:'\w+)*/

The behavior of the purpose-built C tokenizer is slightly different  
but results in approximately the same number of tokens.

Gory details of the methodology are available upon request, but  
probably aren't germane.

>> I foresee a time when it will make more sense for me to hack on the
>> regex engine than on my own library.  Since it seems unlikely that
>> someone will address the charclass issue prior to the release of
>> 5.10, would it be possible to, say, break out Perl's regex engine,
>> update it, fix up a binding using the new regex engine hooks, and
>> release a module to CPAN?
> Yes, this is already possible with blead and will be possible in 5.10


Marvin Humphrey
Rectangular Research

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About