develooper Front page | perl.perl5.porters | Postings from September 2000

Re: regex gripe... (backtracking woes)

Thread Previous | Thread Next
Ben Tilly
September 23, 2000 02:35
Re: regex gripe... (backtracking woes)
Message ID:
Hugo wrote:
>In <>,
>Jeff Pinyan writes:
>:Why does the engine backtrack through things it should KNOW match /[^AB]/
>:for something that can match /.B/?
>Presumably because noone has yet wanted it badly enough to write a
>patch for it. :) I suspect the effort required to do the subexpression
>analysis would actually slow things down a lot in the average case.

Only if it is done wrong I think.

>However I do think we should be looking at the possibility of
>recognising inefficient expressions that we can rewrite to more
>efficient equivalents. Trouble is that we'd more often come across
>(the moral equivalent of) /A([^AB]*.B)*[^AB]*A/, which I think is
>much more difficult to improve on unless we invent something like
>cross-nested parens:

Take another look at

Through "the simple bits" of an RE you can walk through, invent
synthetic states that are ordered lists of states in the original
RE.  This will manage to make the NFA engine match sections of
the RE in basically the same way that a DFA does.  But other than
the fact that your states might wind up hooked up in a way you
cannot easily recognize as an RE, this should not be noticed at
run-time.  You can do as much or little as you want.

That would take care of Jeff's problem.

What would take more work is correctly dealing with populating
$1, $2, etc.  The idea would have to be that you walk through the
string forward, then leaving the optimized section walk back
through the match to resolve how it matched, then forward to
populate the variables as needed before going back to business
as normal.  This triple scan seems like overkill until you notice
that would naturally come up on expressions like /^(\s*yada\s*)*$/
and /(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*)/ both of which
are extremely sore spots in the performance of NFA engines.

So if you could somehow intelligently figure out when it made
sense that would be an improvement.  A good heuristic for a first
pass might be that it makes sense when you can resolve at compile
time how the triple scan will resolve the contents of $1, etc.
If they don't come up, that was Jeff's case.  If they do but you
can show it will always resolve one way, then that takes care of
the wildcard in a wildcard example above.  However the repeated
case really does need a triple-scan to do efficiently.

Get Your Private, Free E-mail from MSN Hotmail at

Share information about yourself, create your own public profile at

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About