develooper Front page | perl.perl5.porters | Postings from February 2007

Re: unicode regex performance (was Re: Future Perl development)

From:
Juerd Waalboer
Date:
February 7, 2007 17:53
Subject:
Re: unicode regex performance (was Re: Future Perl development)
Message ID:
20070208015302.GL25362@c4.convolution.nl
Gerard Goossen skribis 2007-02-08  2:23 (+0100):
> The current regex engine does not work very fast for unicode

Not very fast?

It's not as fast as pure 8 bit matching, but in my experience, even then
it still outperforms some other popular regex libraries, some of which
don't even support unicode at all!

    juerd@lanova:~$ perl -MBenchmark=cmpthese -Mutf8 -e'
        binmode STDOUT, ":utf8"; 
        cmpthese -1, { 
            "e" => sub { "prijs: e 1,00" =~ /e [,\d]+/ }, 
            "€" => sub { "prijs: € 1,00" =~ /€ [,\d]+/ } 
        }
    '
          Rate    €    e
    € 707950/s   -- -26%
    e 954408/s  35%   --

When correctness is less important than performance, just trade the former for
the latter by encoding both sides to UTF8 (in a very incorrect but performant
way, done by forcefully upgrading and then turning the UTF8 flag off), and
performance will be exactly equal to byte matching, because you then HAVE byte
matching. This will, however, hurt a lot when your pattern happens to be
capable of match in the middle of a UTF8 sequence.

Perhaps it would be nice to have a pragma that does just this (upgrade
all regex subjects and patterns, and ignore the UTF8 flag in pattern
matches), for those who want or need extreme performance.
-- 
korajn salutojn,

  juerd waalboer:  perl hacker  <juerd@juerd.nl>  <http://juerd.nl/sig>
  convolution:     ict solutions and consultancy <sales@convolution.nl>

Ik vertrouw stemcomputers niet.
Zie <http://www.wijvertrouwenstemcomputersniet.nl/>.



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About