develooper Front page | perl.perl5.porters | Postings from February 2007

Re: unicode regex performance (was Re: Future Perl development)

Gerard Goossen
February 8, 2007 05:19
Re: unicode regex performance (was Re: Future Perl development)
Message ID:
On Thu, Feb 08, 2007 at 01:04:14PM +0100, Juerd Waalboer wrote:
> demerphq skribis 2007-02-08 12:19 (+0100):
> > It has variable length characters (..) Also UTF8 has the property that
> > there is no valid utf8 sequence that is itself a subsequence of a
> > valid utf8 sequence.
> Because of the latter, the former is not a big problem. That is, if your
> application allows you to be naive and just match bytes instead of
> characters.
> In my "plan", you'd consider each byte a character.
> > Also your plan wouldnt work with case insensitive matching would it?
> Correct. But I did specifically mention that it is an incorrect
> solution. Here, correctness is traded in for performance. When trying to
> find all ? signs in a 300 MB string, and replacing them with ASCII E's,
> ignoring that you're doing UTF8 helps a lot. 
> You'd look for \x{20ac}, you'd be looking for \xe2\x82\xac.

Do NOT use \xe2\x82\xac to create bytes. Use pack (or \x[e2]\x[82]\x[ac]) to create bytes.

But looking for this byte sequence is already what the current regex
engine does:

use Benchmark "cmpthese";
use strict;
use warnings;
use utf8;
use Encode;
my $n = 200000;
my $count = 200000000/$n;
my $a = ("\x{20ad} abc" x $n) . "\x{20ac}";
my $a_bytes = $a;

cmpthese( $count, {
                   utf8 => sub { $a =~ m/\x{20ac}/ or die; },
                   bytes => sub { use bytes; $a_bytes =~ m/\xe2\x82\xac/ or die; },
} );

       Rate bytes  utf8
bytes 337/s    --   -0%
utf8  338/s    0%    --

Gerard Goossen. Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About