Front page | perl.perl5.porters |
Postings from August 2014
Re: [perl #122283] Possible regexp memory explosion in 5.20.0
Thread Previous
|
Thread Next
From:
demerphq
Date:
August 26, 2014 15:33
Subject:
Re: [perl #122283] Possible regexp memory explosion in 5.20.0
Message ID:
CANgJU+W78LyfOzM5R-TnDSa-t28++3u8QNyBT7E1TSNqFKM5Xw@mail.gmail.com
On 14 July 2014 12:13, Dave Mitchell <davem@iabyn.com> wrote:
> On Mon, Jul 14, 2014 at 12:34:06AM +0100, Aaron Crane wrote:
> > Damian Conway via RT <perlbug-followup@perl.org> wrote:
> > > I have now had the opportunity to investigate the problem, and have
> concluded that this has nothing to do with Regexp::Grammars per se, except
> that R::G is generating the enormous regex that 5.20 is failing to compile.
> >
> > Yes, it looks that way to me too. Thanks for supplying that reduction.
> >
> > The file attached cuts down this regex further still, by removing all
> > the embedded code blocks, and the various DEFINEs whose names begin
> > "______0_88".
> >
> > The symptoms I observed seem to be the same, though I also get a
> > "panic: memory wrap" error (apparently when passing 4 GiB of allocated
> > memory).
> >
> > In blead, it looks like the immediate culprit is study_chunk() — it
> > starts 185 ms after the start of the sizing pass, and I haven't yet
> > had the patience to let it run long enough to finish.
>
> It bisects to the following. Yves...?
>
> commit 099ec7dcf9e085a650e6d9010c12ad9649209bf4
> Author: Yves Orton <demerphq@gmail.com>
> Date: Fri Nov 22 01:08:39 2013 +0100
>
> Fix RT #120600: Variable length lookbehind is not variable
>
> Inside of study_chunk() we have to guard against infinite
> recursion with recursive subpatterns. The existing logic
> sort of worked, but didn't address all cases properly.
>
> qr/
> (?<W>a)
> (?<BB>
> (?=(?&W))(?<=(?&W))
> )
> (?&BB)
> /x;
>
> The pattern in the test would fail when the optimizer
> was expanding (&BB). When it recursed, it creates a bitmap
> for the recursion it performs, it then jumps back to
> the BB node and then eventually does the first (&W) call.
> At this point the bit for (&W) would be set in the bitmask.
> When the recursion for the (&W) exited (fake exit through
> the study frame logic) the bit was not /unset/. When the parser
> then entered the (&W) again it was treated as a nested and
> potentially infinite length pattern.
>
> The fake-recursion in study-chunk made it little less obvious
> what was going on in the debug output.
>
> By reorganizing the code and adding logic to unset the bitmap
> when exiting this bug was fixed. Unfortunately this also revealed
> another little issue with patterns like this:
>
> qr/x|(?0)/
> qr/(x|(?1))/
>
> which forced the creation of a new bitmask for each branch.
> Effectively study_chunk treats each branch as an independent
> pattern, so when we are expanding (?1) via the 'x' branch
> we dont want that to prevent us from detecting the infinite recursion
> in the (?1) branch. If you were to think of trips through study_chunk
> as paths, and [] as recursive processing you would get something like:
>
> BRANCH 'x' END
> BRANCH (?0) [ 'x' END ]
> BRANCH (?0) [ (?0) [ 'x' END ] ]
> ...
>
> When we want something like:
>
> BRANCH 'x' END
> BRANCH (?0) [ 'x' END ]
> BRANCH (?0) [ (?0) INFINITE_RECURSION ]
>
> So when we deal with a branch we need to make a new recursion bitmask.
>
Some basic details on this issue:
In order to detect infinite recursion, and to expand certain constructs to
make match more efficient, we make study_chunk walk every possible pathway
through the graph formed by the grammar.
So for instance if we have node C which uses D which uses E, and we can
reach C from both A and B then we will walk the full C-D-E twice.
One could probably argue this is wrong and that we should somehow store
that C-D-E is "safe" when first hit it, and then avoid walking it a second
time.
This is coupled with the naive bitmask strategy for marking which nodes we
have seen.
More to come later.
Yves
Thread Previous
|
Thread Next