develooper Front page | perl.perl5.porters | Postings from February 2007

Re: unicode regex performance (was Re: Future Perl development)

From:
Gerard Goossen
Date:
February 8, 2007 05:19
Subject:
Re: unicode regex performance (was Re: Future Perl development)
Message ID:
20070208132243.GB4898@ostwald
On Thu, Feb 08, 2007 at 01:04:14PM +0100, Juerd Waalboer wrote:
> demerphq skribis 2007-02-08 12:19 (+0100):
> > It has variable length characters (..) Also UTF8 has the property that
> > there is no valid utf8 sequence that is itself a subsequence of a
> > valid utf8 sequence.
> 
> Because of the latter, the former is not a big problem. That is, if your
> application allows you to be naive and just match bytes instead of
> characters.
> 
> In my "plan", you'd consider each byte a character.
> 
> > Also your plan wouldnt work with case insensitive matching would it?
> 
> Correct. But I did specifically mention that it is an incorrect
> solution. Here, correctness is traded in for performance. When trying to
> find all ? signs in a 300 MB string, and replacing them with ASCII E's,
> ignoring that you're doing UTF8 helps a lot. 
> 
> You'd look for \x{20ac}, you'd be looking for \xe2\x82\xac.

Do NOT use \xe2\x82\xac to create bytes. Use pack (or \x[e2]\x[82]\x[ac]) to create bytes.

But looking for this byte sequence is already what the current regex
engine does:

#!perl
use Benchmark "cmpthese";
use strict;
use warnings;
use utf8;
use Encode;
my $n = 200000;
my $count = 200000000/$n;
my $a = ("\x{20ad} abc" x $n) . "\x{20ac}";
my $a_bytes = $a;
Encode::_utf8_off($a_bytes);

cmpthese( $count, {
                   utf8 => sub { $a =~ m/\x{20ac}/ or die; },
                   bytes => sub { use bytes; $a_bytes =~ m/\xe2\x82\xac/ or die; },
} );

bleadperl:
       Rate bytes  utf8
bytes 337/s    --   -0%
utf8  338/s    0%    --


Gerard Goossen.




nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About