Front page | perl.perl5.porters |
Postings from May 2004
Re: slow regex?
May 25, 2004 03:57
Re: slow regex?
Message ID: 200405251059.i4PAxIH27278@zen.crypt.org
Dave Mitchell <email@example.com> wrote:
:This regex takes 2.5 secs on my machine to fail to match (5.8.4 and
: $_ = <<'EOF';
: print "matched\n" if /.*?A.*?B+\s+X/s;
:While I can see that it may need to backtrack 70x70 times, this still
:seems quite slow. (This is a stripped-down version of a regex I wrote
:for production code that basically ground to a halt).
:Is this is bug, or do I have unrealistic expectations?
Testing C<< join '', 'A' x $n, ' ', 'B' x $n, ' ' >> against that
regexp, and grepping the resulting -Dr output for 'Setting an EVAL scope',
the number of scopes set as a function of $len is:
(n^4 + 4n^3 + 5n^2 + 10n) / 4
.. which gives 6,351,800 for n=70.
That's not quite as bad as an exponential search, but still pretty poor:
it is avoidable only by optimisation though, because there really are
4 degrees of freedom that vary according to n. That is, 3 of the
quantifiers /.*?/, /.*/, /B+/, and the starting point (since the match
Since the components are all deterministic, in the sense that there are
no references to captures and no evals, it might be possible to invent
an optimisation scheme that works backwards: the first time we try to
match the /X/ at end of string and fail, we can tell the previous node
'no point matching to here', and that information can accumulate and
propagate backwards through the pattern up until the point we tell the
start node 'no point matching anywhere', at which point it can give up.
However, it would be quite difficult to store that information: you'd
need to be able to store a bit per length($string) per node in the
pattern (even if you might usually be able to collapse the information
to more compact ranges), and that could get real expensive when matching
a long string against a moderately complex pattern.
I think though that this is essentially what the superexponential cache
tries to do, and not so long ago we disabled that for a wide class of
cases because it was giving wrong results in a small subset of cases.
If we could revisit that to either fix the wrong results, or better
characterise the situations it can't guarantee handling correctly, we
might well be able to reduce the cost of such patterns to around O(n^2).