develooper Front page | perl.perl5.porters | Postings from February 2007

Re: unicode regex performance (was Re: Future Perl development)

From:
demerphq
Date:
February 8, 2007 03:20
Subject:
Re: unicode regex performance (was Re: Future Perl development)
Message ID:
9b18b3110702080319o671a995te570b2ea7b2ead7f@mail.gmail.com
On 2/8/07, Juerd Waalboer <juerd@convolution.nl> wrote:
> Gerard Goossen skribis 2007-02-08  2:23 (+0100):
> > The current regex engine does not work very fast for unicode
>
> Not very fast?

Yes. Not very fast.

> It's not as fast as pure 8 bit matching, but in my experience, even then
> it still outperforms some other popular regex libraries, some of which
> don't even support unicode at all!

Thats just because jhi is a smart man and did a damn good job on our
implementation. But it still doesnt change the fact that the engine
would be massively slower under utf8 than under a sane encoding like
utf32.

>     juerd@lanova:~$ perl -MBenchmark=cmpthese -Mutf8 -e'
>         binmode STDOUT, ":utf8";
>         cmpthese -1, {
>             "e" => sub { "prijs: e 1,00" =~ /e [,\d]+/ },
>             "€" => sub { "prijs: € 1,00" =~ /€ [,\d]+/ }
>         }
>     '
>           Rate    €    e
>     € 707950/s   -- -26%
>     e 954408/s  35%   --

This isnt a particularly representative benchmark. Try doing something
more complicated, in particular something that does case insensitive
matching or has a useful fixed length string to search for and that
needs to a do a lot of backtracking.

> When correctness is less important than performance, just trade the former for
> the latter by encoding both sides to UTF8 (in a very incorrect but performant
> way, done by forcefully upgrading and then turning the UTF8 flag off), and
> performance will be exactly equal to byte matching, because you then HAVE byte
> matching. This will, however, hurt a lot when your pattern happens to be
> capable of match in the middle of a UTF8 sequence.

No, UTF8 is not a format suitable to regexes particularly.

It has variable length characters,  meaning it is difficult or
inefficient to jump forward or backwards N characters (worse for
backwards), something that it kinda important in a regex engine.

Also UTF8 has the property that there is no valid utf8 sequence that
is itself a subsequence of a valid utf8 sequence.

Also your plan wouldnt work with case insensitive matching would it?

Im sure if I thought about it more I could find other reasons.

> Perhaps it would be nice to have a pragma that does just this (upgrade
> all regex subjects and patterns, and ignore the UTF8 flag in pattern
> matches), for those who want or need extreme performance.

They wouldnt see a performance increase, they would see a noticable
performance decrease. In some cases drammatic ones.

Cheers,
Yves

-- 
perl -Mre=debug -e "/just|another|perl|hacker/"



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About