develooper Front page | perl.perl5.porters | Postings from March 2016

Re: [perl #126182] /(.(?2))((?<=(?=(?1)).))/ hangs and eats allavailable RAM

Thread Previous | Thread Next
From:
Zefram
Date:
March 8, 2016 00:36
Subject:
Re: [perl #126182] /(.(?2))((?<=(?=(?1)).))/ hangs and eats allavailable RAM
Message ID:
20160308003623.GA3478@fysh.org
demerphq wrote:
>                           It is no longer illegal to do "infinite
>recursion", instead the match "fails" to match at the given position.

This is broken behaviour, in multiple respects.  Though we already have
some similar to it.

If a recursion occurs without moving through the string, this does
not necessarily mean that the regexp execution would fail to terminate
if left to its own devices.  The arbitrary code that we can embed in a
regexp means that it's easy to have part of a regexp behave differently on
different iterations.  But even excluding that possibility, non-embedding
regexp features are quite sufficient to make a regexp that must iterate
a couple of times without moving in order to match.  For example,

	"ab" =~ /a(?(1)z|(?(2)()|()))*\1b/

should match, with exactly two iterations of the * loop, each consuming
no characters.  The first iteration captures $2 and can't match any
other way, the second captures $1 and can't match any other way, after
that it can't iterate more, and \1 must be matched to complete the match.
In fact, trying this out on perl 5.22.0, perl says that it doesn't match.
This is not due to any error in the structure of the regexp: perl will
happily perform an equivalent match if some character is introduced such
that each iteration of the loop advances through the string:

	"axxb" =~ /a(?:x(?(1)z|(?(2)()|())))*\1b/

So perl has a bug here in failing to perform the earlier match.

Now let's consider cases that actually could recurse infinitely.
The simplest case is something like

	"ab" =~ /a(?:)*b/

Note that in actual *regular* expressions, in the sense of matching a
regular language, this kind of infinite iteration is no problem at all.
If this kind of pattern is compiled down to a DFA, then the DFA just runs
along the string, in constant time per character, and tells you whether
or not it matches.  It doesn't iterate for the iteration that appears in
the expression, and so executes the infinite loop in finite time.  Srsly.
Of course, you can't then ask it how many iterations of the empty loop
were included in its match, because it doesn't generate that sort of
information.

perl does have a problem with infinite iteration, because it uses an NFA
with backtracking.  On the above simple case it does correctly find the
match, but emits a warning while doing so.  And here we *can* ask how
many times it went round the empty loop.  In principle, in this simple
case, any number of times round the loop, zero or more, could produce a
valid match.  The rub is that larger numbers of iterations are preferred,
so with no largest valid number of iterations there is no most-preferred
match.  perl actually took exactly one iteration, so although it was
correct to say that it matched, it was actually incorrect in which way
it matched.  That doesn't make any difference in the simplest case,
but it's detectable with a bit of extra work.

Incidentally, if we change the * to a *?, preferring lower numbers
of iterations, then perl correctly produces a zero-iterations match.
Though it still emits the warning.

We can muck about with the loop structure to get worse misbehaviour
from perl.  For example,

	"ab" =~ /a(?:()|())*\1\2b/

should match, and in fact has infinitely many ways to match.  It requires
at least two iterations of the loop, at least one taking each branch.
But perl declares no match, this time emitting the warning as it
does so.  The way in which this fails is a lot like the finite example
that I started with, but that first one doesn't generate the warning.
This infinite case shares with an earlier case the problem of having
no most-preferred match, but changing to the closest variant that has
a most-preferred match

	"ab" =~ /a(?(2)()|())*?\1\2b/

doesn't make perl recognise the match.

So there are many cases here where the extra logic put in to avoid hanging
is resulting in failing to find a match, or in finding the wrong match.
We do define regexps to try the options in a specific order, and to yield
the most-preferred match, so the latter is not an insignificant matter.

Detecting something that would otherwise hang is nice, but not essential.
We should never compromise correctness to get it.  If the programmer
has instructed us to infinitely recurse, then hanging is a reasonable
response, and so is exhausting memory.  We should not be afraid of doing
what the programmer said.

-zefram

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About