On Thu, Feb 08, 2007 at 02:46:51PM +0100, Juerd Waalboer wrote: > > But looking for this byte sequence is already what the current regex > > engine does: > > I don't know how exactly your benchmark turns out those results. Maybe > because the single match is at position 0, maybe because you "use > bytes", maybe for some other reason. > > But I'll just show you one of the benchmarks that I did before: > > juerd@lanova:~$ perl -MBenchmark=cmpthese -MEncode -e' > my $unicode = "f\x{20ac}oo"; > Encode::_utf8_off(my $utf8 = $unicode); > my $re_unicode = qr/\x{20ac}/; > my $re_utf8 = qr/\xe2\x82\xac/; > cmpthese -1, { > unicode => sub { (my $dummy = $unicode) =~ s/$unicode_re/E/; }, > utf8 => sub { (my $dummy = $utf8) =~ s/$utf8_re/E/; } > }' > Rate unicode utf8 > unicode 314139/s -- -27% > utf8 428740/s 36% -- #!perl use Encode; use Benchmark qw|cmpthese|; use strict; use warnings; my $unicode = "f\x{20ac}oo"; Encode::_utf8_off(my $utf8 = $unicode); my $re_unicode = qr/\x{20ac}/; my $re_utf8 = qr/\xe2\x82\xac/; cmpthese -1, { unicode => sub { (my $dummy = $unicode) =~ s/$unicode_re/E/; }, utf8 => sub { (my $dummy = $utf8) =~ s/$utf8_re/E/; } } bleadperl: Global symbol "$unicode_re" requires explicit package name at t/test2.t line 12. Global symbol "$utf8_re" requires explicit package name at t/test2.t line 13. Execution of t/test2.t aborted due to compilation errors. If you fix you script you will see that the unicode matching is a lot slower. But it is a lot slower not because the matching is in unicode. But because Perl 5, has to do a lot to make sure all string are unicode, for example E probably has to upgraded to latin1. Most optimalization is made for latin1, and some are just turned of for unicode, like in place substitution. If you turn of mixing latin1 and unicode matching, things get a _lot_ simpler and you can do better optimalizations. In my branch I can solved some of these problems. Resulting in much better matching. Benchmark: #!perl use Encode; use Benchmark qw|cmpthese|; use strict; use warnings; use utf8; my $unicode = "f\x{20ac}oo" x 1000; Encode::_utf8_off(my $utf8 = $unicode); cmpthese -1, { unicode => sub { (my $dummy = $unicode) =~ s/\x{20ac}/E/g; }, utf8 => sub { use bytes; (my $dummy = $utf8) =~ s/\xe2\x82\xac/E/g; } } bleadperl: Rate unicode utf8 unicode 2715/s -- -21% utf8 3445/s 27% -- my branch: Rate unicode utf8 unicode 4148/s -- -7% utf8 4483/s 8% -- perl 5.8.8: Rate unicode utf8 unicode 2955/s -- -33% utf8 4403/s 49% -- I'm not sure why the bleadperl is much slower also in the utf8 case, maybe I used different compiler options. > By the way, when you say "current perl", do you refer to stable, blead, > or your own branch? I'm currently using 5.8.8. when I refere to current perl I mean stable or blead (I specify which if I think the difference is relevent, like above). When refering to my branch, I will do so explicit (by saying something like my branch, my patch). Gerard Goossen