develooper Front page | perl.perl5.porters | Postings from February 2007

Re: unicode regex performance (was Re: Future Perl development)

From:
Gerard Goossen
Date:
February 8, 2007 07:46
Subject:
Re: unicode regex performance (was Re: Future Perl development)
Message ID:
20070208154927.GC4898@ostwald
On Thu, Feb 08, 2007 at 02:46:51PM +0100, Juerd Waalboer wrote:
> > But looking for this byte sequence is already what the current regex
> > engine does:
> 
> I don't know how exactly your benchmark turns out those results. Maybe
> because the single match is at position 0, maybe because you "use
> bytes", maybe for some other reason.
> 
> But I'll just show you one of the benchmarks that I did before:
> 
>     juerd@lanova:~$ perl -MBenchmark=cmpthese -MEncode -e'
>         my $unicode = "f\x{20ac}oo"; 
>         Encode::_utf8_off(my $utf8 = $unicode); 
>         my $re_unicode = qr/\x{20ac}/; 
>         my $re_utf8 = qr/\xe2\x82\xac/; 
>         cmpthese -1, { 
>             unicode => sub { (my $dummy = $unicode) =~ s/$unicode_re/E/; }, 
>             utf8    => sub { (my $dummy = $utf8) =~ s/$utf8_re/E/; } 
>         }'
>                 Rate unicode    utf8
>     unicode 314139/s      --    -27%
>     utf8    428740/s     36%      --

#!perl
use Encode;
use Benchmark qw|cmpthese|;
use strict;
use warnings;

        my $unicode = "f\x{20ac}oo";
        Encode::_utf8_off(my $utf8 = $unicode);
        my $re_unicode = qr/\x{20ac}/;
        my $re_utf8 = qr/\xe2\x82\xac/;
        cmpthese -1, {
            unicode => sub { (my $dummy = $unicode) =~ s/$unicode_re/E/; },
            utf8    => sub { (my $dummy = $utf8) =~ s/$utf8_re/E/; }
        }

bleadperl:
Global symbol "$unicode_re" requires explicit package name at t/test2.t line 12.
Global symbol "$utf8_re" requires explicit package name at t/test2.t line 13.
Execution of t/test2.t aborted due to compilation errors.


If you fix you script you will see that the unicode matching is a lot slower.
But it is a lot slower not because the matching is in unicode. But because Perl 5,
has to do a lot to make sure all string are unicode, for example E probably
has to upgraded to latin1. Most optimalization is made for latin1, and some are
just turned of for unicode, like in place substitution.
If you turn of mixing latin1 and unicode matching, things get a _lot_ simpler and
you can do better optimalizations. 
In my branch I can solved some of these problems. Resulting in much better
matching.
Benchmark:

#!perl
use Encode;
use Benchmark qw|cmpthese|;
use strict;
use warnings;
use utf8;

        my $unicode = "f\x{20ac}oo" x 1000;
        Encode::_utf8_off(my $utf8 = $unicode);
        cmpthese -1, {
            unicode => sub { (my $dummy = $unicode) =~ s/\x{20ac}/E/g; },
            utf8    => sub { use bytes; (my $dummy = $utf8) =~ s/\xe2\x82\xac/E/g; }
        }

bleadperl:
          Rate unicode    utf8
unicode 2715/s      --    -21%
utf8    3445/s     27%      --

my branch:
          Rate unicode    utf8
unicode 4148/s      --     -7%
utf8    4483/s      8%      --

perl 5.8.8:
          Rate unicode    utf8
unicode 2955/s      --    -33%
utf8    4403/s     49%      --

I'm not sure why the bleadperl is much slower also in the utf8 case, maybe I used different compiler options.

 
> By the way, when you say "current perl", do you refer to stable, blead,
> or your own branch? I'm currently using 5.8.8.

when I refere to current perl I mean stable or blead (I specify which if I think the difference is 
relevent, like above).
When refering to my branch, I will do so explicit (by saying something like my branch, my patch).


Gerard Goossen




nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About