develooper Front page | perl.perl6.language | Postings from February 2003

regex matching from a position ?

Thread Next
From:
Ph. Marek
Date:
February 11, 2003 23:41
Subject:
regex matching from a position ?
Message ID:
200302120842.57601.philipp.marek@bmlv.gv.at
Hello everybody,

I've sometimes the task to analyse a string 
starting from a given position, where this position 
changes after each iteration. (like index() does)


As this is perl there are MTOWTDIIP but I'd like 
to know the fastest.

So I used Benchmark.pm to find that out. (script attached)


Excerpt from script:
  "from_start"  => sub { m/\S*\s+(\S+)/; },
  "re_dyn"  => sub { m/^[\x00-\xff]{$pos}\S*\s+(\S+)/; },
  "re_once" => sub { m/^[\x00-\xff]{$pos}\S*\s+(\S+)/o; },
  "substr" => sub { substr($_,$pos) =~ m/\S*\s+(\S+)/; },
  "substr_set" => sub { $tmp=substr($_,$pos); $tmp =~ m/\S*\s+(\S+)/; },

from_start is for comparision only as it should be.
re_once is for comparision too as the index can't be adjusted.
(and dynamically recompiling via eval() for changing indexes can't be fast enough)


Results:

2505792 bytes to do ...
Benchmark: timing 1000000 iterations of from_start, re_dyn, re_once, substr, substr_set...
from_start:  1 wallclock secs ( 1.26 usr + -0.01 sys =  1.25 CPU) @ 800000.00/s (n=1000000)
    re_dyn:  9 wallclock secs ( 6.52 usr +  0.00 sys =  6.52 CPU) @ 153374.23/s (n=1000000)
   re_once:  1 wallclock secs ( 1.26 usr +  0.01 sys =  1.27 CPU) @ 787401.57/s (n=1000000)
    substr:  4 wallclock secs ( 2.36 usr +  0.02 sys =  2.38 CPU) @ 420168.07/s (n=1000000)
substr_set:  5 wallclock secs ( 3.23 usr +  0.00 sys =  3.23 CPU) @ 309597.52/s (n=1000000)
               Rate     re_dyn substr_set     substr    re_once from_start
re_dyn     153374/s         --       -50%       -63%       -81%       -81%
substr_set 309598/s       102%         --       -26%       -61%       -61%
substr     420168/s       174%        36%         --       -47%       -47%
re_once    787402/s       413%       154%        87%         --        -2%
from_start 800000/s       422%       158%        90%         2%         --


So: every possibility is *much* slower than necessary!
So I propose (I know that I'm a bit late, but who cares ... :-) 
a new option for regexes (like each, case-insensitive, 
and match- multiple-times) which allows to specify a 
position to start matching. That should be *no* overhead!
eg:
	$text.m:from500:i /\s*(\S+)/;


Currently the substr() is the fastest available option - unless somebody
has more imagination than me (which I take as given).

So, is there a faster possibility, is that no problem for perl6, 
or will something like this be implemented?



Regards,

Phil



Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About