On Feb 12, 2007, at 2:28 AM, demerphq wrote: > What version of perl are these numbers from? The original numbers were from the stock Perl 5.8.6 (which is threaded) on OS X version 10.4.8. Here are numbers for a slightly tweaked test with three versions of Perl. ================================================================== Mean time to index 1000 ASCII news articles ------------------------------------------------------------------ tokenizer 5.8.6 (thr) 5.8.8 (no thr) blead (no thr) ------------------------------------------------------------------ UTF-8 regex 4.18 secs 3.72 secs 3.80 secs Latin-1 regex 2.84 secs 2.50 secs 2.60 secs Purpose-built C 1.82 secs 1.60 secs 1.64 secs The tokenizer loop is... while (/$token_re/g) { push @starts, $-[0]; push @ends, $+[0]; } And the default token regex is... qr/\w+(?:'\w+)*/ The behavior of the purpose-built C tokenizer is slightly different but results in approximately the same number of tokens. Gory details of the methodology are available upon request, but probably aren't germane. >> I foresee a time when it will make more sense for me to hack on the >> regex engine than on my own library. Since it seems unlikely that >> someone will address the charclass issue prior to the release of >> 5.10, would it be possible to, say, break out Perl's regex engine, >> update it, fix up a binding using the new regex engine hooks, and >> release a module to CPAN? > > Yes, this is already possible with blead and will be possible in 5.10 Fabulous! Marvin Humphrey Rectangular Research http://www.rectangular.com/Thread Previous | Thread Next