Gerard Goossen skribis 2007-02-08 2:23 (+0100): > The current regex engine does not work very fast for unicode Not very fast? It's not as fast as pure 8 bit matching, but in my experience, even then it still outperforms some other popular regex libraries, some of which don't even support unicode at all! juerd@lanova:~$ perl -MBenchmark=cmpthese -Mutf8 -e' binmode STDOUT, ":utf8"; cmpthese -1, { "e" => sub { "prijs: e 1,00" =~ /e [,\d]+/ }, "€" => sub { "prijs: € 1,00" =~ /€ [,\d]+/ } } ' Rate € e € 707950/s -- -26% e 954408/s 35% -- When correctness is less important than performance, just trade the former for the latter by encoding both sides to UTF8 (in a very incorrect but performant way, done by forcefully upgrading and then turning the UTF8 flag off), and performance will be exactly equal to byte matching, because you then HAVE byte matching. This will, however, hurt a lot when your pattern happens to be capable of match in the middle of a UTF8 sequence. Perhaps it would be nice to have a pragma that does just this (upgrade all regex subjects and patterns, and ignore the UTF8 flag in pattern matches), for those who want or need extreme performance. -- korajn salutojn, juerd waalboer: perl hacker <juerd@juerd.nl> <http://juerd.nl/sig> convolution: ict solutions and consultancy <sales@convolution.nl> Ik vertrouw stemcomputers niet. Zie <http://www.wijvertrouwenstemcomputersniet.nl/>.