develooper Front page | perl.perl5.porters | Postings from August 2014

Re: [perl #122283] Possible regexp memory explosion in 5.20.0

Thread Previous | Thread Next
From:
demerphq
Date:
August 26, 2014 15:33
Subject:
Re: [perl #122283] Possible regexp memory explosion in 5.20.0
Message ID:
CANgJU+W78LyfOzM5R-TnDSa-t28++3u8QNyBT7E1TSNqFKM5Xw@mail.gmail.com
On 14 July 2014 12:13, Dave Mitchell <davem@iabyn.com> wrote:

> On Mon, Jul 14, 2014 at 12:34:06AM +0100, Aaron Crane wrote:
> > Damian Conway via RT <perlbug-followup@perl.org> wrote:
> > > I have now had the opportunity to investigate the problem, and have
> concluded that this has nothing to do with Regexp::Grammars per se, except
> that R::G is generating the enormous regex that 5.20 is failing to compile.
> >
> > Yes, it looks that way to me too. Thanks for supplying that reduction.
> >
> > The file attached cuts down this regex further still, by removing all
> > the embedded code blocks, and the various DEFINEs whose names begin
> > "______0_88".
> >
> > The symptoms I observed seem to be the same, though I also get a
> > "panic: memory wrap" error (apparently when passing 4 GiB of allocated
> > memory).
> >
> > In blead, it looks like the immediate culprit is study_chunk() — it
> > starts 185 ms after the start of the sizing pass, and I haven't yet
> > had the patience to let it run long enough to finish.
>
> It bisects to the following. Yves...?
>
> commit 099ec7dcf9e085a650e6d9010c12ad9649209bf4
> Author: Yves Orton <demerphq@gmail.com>
> Date:   Fri Nov 22 01:08:39 2013 +0100
>
>     Fix RT #120600: Variable length lookbehind is not variable
>
>     Inside of study_chunk() we have to guard against infinite
>     recursion with recursive subpatterns. The existing logic
>     sort of worked, but didn't address all cases properly.
>
>       qr/
>         (?<W>a)
>         (?<BB>
>           (?=(?&W))(?<=(?&W))
>         )
>         (?&BB)
>       /x;
>
>     The pattern in the test would fail when the optimizer
>     was expanding (&BB). When it recursed, it creates a bitmap
>     for the recursion it performs, it then jumps back to
>     the BB node and then eventually does the first (&W) call.
>     At this point the bit for (&W) would be set in the bitmask.
>     When the recursion for the (&W) exited (fake exit through
>     the study frame logic) the bit was not /unset/. When the parser
>     then entered the (&W) again it was treated as a nested and
>     potentially infinite length pattern.
>
>     The fake-recursion in study-chunk made it little less obvious
>     what was going on in the debug output.
>
>     By reorganizing the code and adding logic to unset the bitmap
>     when exiting this bug was fixed. Unfortunately this also revealed
>     another little issue with patterns like this:
>
>       qr/x|(?0)/
>       qr/(x|(?1))/
>
>     which forced the creation of a new bitmask for each branch.
>     Effectively study_chunk treats each branch as an independent
>     pattern, so when we are expanding (?1) via the 'x' branch
>     we dont want that to prevent us from detecting the infinite recursion
>     in the (?1) branch. If you were to think of trips through study_chunk
>     as paths, and [] as recursive processing you would get something like:
>
>       BRANCH 'x' END
>       BRANCH (?0) [ 'x' END ]
>       BRANCH (?0) [ (?0) [ 'x' END ] ]
>       ...
>
>     When we want something like:
>
>       BRANCH 'x' END
>       BRANCH (?0) [ 'x' END ]
>       BRANCH (?0) [ (?0) INFINITE_RECURSION ]
>
>     So when we deal with a branch we need to make a new recursion bitmask.
>

Some basic details on this issue:

In order to detect infinite recursion, and to expand certain constructs to
make match more efficient, we make study_chunk walk every possible pathway
through the graph formed by the grammar.

So for instance if we have node C which uses D which uses E, and we can
reach C from both A and B then we will walk the full C-D-E twice.

One could probably argue this is wrong and that we should somehow store
that C-D-E is "safe" when first hit it, and then avoid walking it a second
time.

This is coupled with the naive bitmask strategy for marking which nodes we
have seen.

More to come later.

Yves

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About