develooper Front page | perl.perl5.porters | Postings from March 2016

Re: [perl #126182] /(.(?2))((?<=(?=(?1)).))/ hangs and eats allavailable RAM

Thread Previous | Thread Next
From:
demerphq
Date:
March 8, 2016 12:18
Subject:
Re: [perl #126182] /(.(?2))((?<=(?=(?1)).))/ hangs and eats allavailable RAM
Message ID:
CANgJU+UkHhQ4J4oiwLsYeezdP5u8yQF4Dz9bsaDjTNQEOUsX9g@mail.gmail.com
On 8 March 2016 at 01:36, Zefram <zefram@fysh.org> wrote:
> demerphq wrote:
>>                           It is no longer illegal to do "infinite
>>recursion", instead the match "fails" to match at the given position.
>
> This is broken behaviour, in multiple respects.  Though we already have
> some similar to it.

I understand your argument, but my reading of the docs says that it is
broken by design.

Specifically:

"To break the loop, the following
match after a zero-length match is prohibited to have a length of zero.
This prohibition interacts with backtracking (see L<"Backtracking">),
and so the I<second best> match is chosen if the I<best> match is of
zero length.
"


>
> If a recursion occurs without moving through the string, this does
> not necessarily mean that the regexp execution would fail to terminate
> if left to its own devices.  The arbitrary code that we can embed in a
> regexp means that it's easy to have part of a regexp behave differently on
> different iterations.  But even excluding that possibility, non-embedding
> regexp features are quite sufficient to make a regexp that must iterate
> a couple of times without moving in order to match.  For example,
>
>         "ab" =~ /a(?(1)z|(?(2)()|()))*\1b/
>
> should match, with exactly two iterations of the * loop, each consuming
> no characters.  The first iteration captures $2 and can't match any
> other way, the second captures $1 and can't match any other way, after
> that it can't iterate more, and \1 must be matched to complete the match.
> In fact, trying this out on perl 5.22.0, perl says that it doesn't match.
> This is not due to any error in the structure of the regexp: perl will
> happily perform an equivalent match if some character is introduced such
> that each iteration of the loop advances through the string:
>
>         "axxb" =~ /a(?:x(?(1)z|(?(2)()|())))*\1b/
>
> So perl has a bug here in failing to perform the earlier match.
>
> Now let's consider cases that actually could recurse infinitely.
> The simplest case is something like
>
>         "ab" =~ /a(?:)*b/
>
> Note that in actual *regular* expressions, in the sense of matching a
> regular language, this kind of infinite iteration is no problem at all.
> If this kind of pattern is compiled down to a DFA, then the DFA just runs
> along the string, in constant time per character, and tells you whether
> or not it matches.  It doesn't iterate for the iteration that appears in
> the expression, and so executes the infinite loop in finite time.  Srsly.
> Of course, you can't then ask it how many iterations of the empty loop
> were included in its match, because it doesn't generate that sort of
> information.
>
> perl does have a problem with infinite iteration, because it uses an NFA
> with backtracking.  On the above simple case it does correctly find the
> match, but emits a warning while doing so.  And here we *can* ask how
> many times it went round the empty loop.  In principle, in this simple
> case, any number of times round the loop, zero or more, could produce a
> valid match.  The rub is that larger numbers of iterations are preferred,
> so with no largest valid number of iterations there is no most-preferred
> match.  perl actually took exactly one iteration, so although it was
> correct to say that it matched, it was actually incorrect in which way
> it matched.  That doesn't make any difference in the simplest case,
> but it's detectable with a bit of extra work.
>
> Incidentally, if we change the * to a *?, preferring lower numbers
> of iterations, then perl correctly produces a zero-iterations match.
> Though it still emits the warning.
>
> We can muck about with the loop structure to get worse misbehaviour
> from perl.  For example,
>
>         "ab" =~ /a(?:()|())*\1\2b/
>
> should match, and in fact has infinitely many ways to match.  It requires
> at least two iterations of the loop, at least one taking each branch.
> But perl declares no match, this time emitting the warning as it
> does so.  The way in which this fails is a lot like the finite example
> that I started with, but that first one doesn't generate the warning.
> This infinite case shares with an earlier case the problem of having
> no most-preferred match, but changing to the closest variant that has
> a most-preferred match
>
>         "ab" =~ /a(?(2)()|())*?\1\2b/
>
> doesn't make perl recognise the match.
>
> So there are many cases here where the extra logic put in to avoid hanging
> is resulting in failing to find a match, or in finding the wrong match.
> We do define regexps to try the options in a specific order, and to yield
> the most-preferred match, so the latter is not an insignificant matter.

Except see the documentation I quote above.

>
> Detecting something that would otherwise hang is nice, but not essential.
> We should never compromise correctness to get it.  If the programmer
> has instructed us to infinitely recurse, then hanging is a reasonable
> response, and so is exhausting memory.  We should not be afraid of doing
> what the programmer said.

I am sympathetic to what  you say here.

But there is a subtle difference, Patterns are often supplied by the
user, and until regex recursion was introduced, it was safe(ish) to
allow them to do so.

So, to me, making every Perl program that allows user supplied regexes
be vulnerable to user supplied infinite loops is not a good thing.

Anyway, much of the dialog that you posted here IMO is unrelated to
regex infinite recursion, it relates to how some parts of the regex
handle matching the empty string.

Can you please reframe your dialog to be purely about regex recursion?
Can you show a pattern where regex recursion does not do the right
thing?

FWIW, I am open to restoring the old behavior, and making it a fatal
exception to enter an infinite loop recursively from the same position
due to left recursion in a pattern.

But I do think that /(?<foo>(?&foo)foo)/ infinite looping and then
eventually exiting due to memory exhaustion is NOT the right thing for
perl to do in this case.

Note that

/(?<foo>a(?&foo)?b)/

works just fine. The only problem are things like /(?<foo>(?&foo))/.

Thanks for your feedback.

I will say that independently of this, we *CAN* implement some form of
"maximum work" mechanism for matching, which would allow people to
address the "insecure pattern" problem in a saner way. Which would to
some degree ameliorate the concern I have here.

cheers,
Yves




-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About