On 2/8/07, Juerd Waalboer <juerd@convolution.nl> wrote: > Gerard Goossen skribis 2007-02-08 2:23 (+0100): > > The current regex engine does not work very fast for unicode > > Not very fast? Yes. Not very fast. > It's not as fast as pure 8 bit matching, but in my experience, even then > it still outperforms some other popular regex libraries, some of which > don't even support unicode at all! Thats just because jhi is a smart man and did a damn good job on our implementation. But it still doesnt change the fact that the engine would be massively slower under utf8 than under a sane encoding like utf32. > juerd@lanova:~$ perl -MBenchmark=cmpthese -Mutf8 -e' > binmode STDOUT, ":utf8"; > cmpthese -1, { > "e" => sub { "prijs: e 1,00" =~ /e [,\d]+/ }, > "€" => sub { "prijs: € 1,00" =~ /€ [,\d]+/ } > } > ' > Rate € e > € 707950/s -- -26% > e 954408/s 35% -- This isnt a particularly representative benchmark. Try doing something more complicated, in particular something that does case insensitive matching or has a useful fixed length string to search for and that needs to a do a lot of backtracking. > When correctness is less important than performance, just trade the former for > the latter by encoding both sides to UTF8 (in a very incorrect but performant > way, done by forcefully upgrading and then turning the UTF8 flag off), and > performance will be exactly equal to byte matching, because you then HAVE byte > matching. This will, however, hurt a lot when your pattern happens to be > capable of match in the middle of a UTF8 sequence. No, UTF8 is not a format suitable to regexes particularly. It has variable length characters, meaning it is difficult or inefficient to jump forward or backwards N characters (worse for backwards), something that it kinda important in a regex engine. Also UTF8 has the property that there is no valid utf8 sequence that is itself a subsequence of a valid utf8 sequence. Also your plan wouldnt work with case insensitive matching would it? Im sure if I thought about it more I could find other reasons. > Perhaps it would be nice to have a pragma that does just this (upgrade > all regex subjects and patterns, and ignore the UTF8 flag in pattern > matches), for those who want or need extreme performance. They wouldnt see a performance increase, they would see a noticable performance decrease. In some cases drammatic ones. Cheers, Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"