develooper Front page | perl.perl5.porters | Postings from June 2010

Re: [perl #74484] Regex causing exponential runtime+mem usage

Thread Previous
From:
demerphq
Date:
June 1, 2010 05:19
Subject:
Re: [perl #74484] Regex causing exponential runtime+mem usage
Message ID:
AANLkTiku_buQ2UIGCwC1V9cy2yWIGRWgRN6QLMpVsUbf@mail.gmail.com
On 3 May 2010 15:39, Dave Mitchell <davem@iabyn.com> wrote:
> On Wed, Apr 21, 2010 at 12:24:21PM +0100, Dave Mitchell wrote:
>> On Mon, Apr 19, 2010 at 01:34:02AM -0700, perl@0ne.us wrote:
>> > I've been smoking CPAN modules for CPANTesters and noticed that my bots
>> > got OOMKilled from time to time.
>>
>> Thanks for the report.
>>
>> It appears there are two (mostly) unrelated issues here. First is the
>> performance, which is quadratic in string length.
>>
>> This is due to String::Perl::Warnings creating a regexp that looks
>> essentially like /.*?..../. This is equivalent to /^.*?.*?...../
>> (note the anchor) and on repeated backtracking the two .*? make it go
>> quadratic. The fix for the String::Perl::Warnings module's author is to
>> either remove the initial .*?, or add an initial anchor.
>>
>> The second issue is the memory usage, which again appears to be quadratic
>> on string length. This is due to the trie code leaking SVs on the tmps
>> stack (which then accumulate quadratically with the quadratic backtracking
>> caused by the first issue). I've currently got some ideas on how to rework
>> that bit of code to not need to create temp SVs to store the accept
>> states, but in the meantime I've added a TODO test with commit
>> 8bd05d90e59a554e79d73edf32bf35547dbc2c40
>
> Now fixed in blead by the commit below. Running the OPs code for 400
> iterations on unpatched and patched blead gave the following results:
>
>    unpatched: Loop 400: 2.76304984092712s rss(214076 kB)
>    patched:   Loop 400: 2.16697001457214s rss(6976 kB)
>
> i.e. the memory leak is gone, and it's a bit faster (probably due to not
> having to allocate lots of SVs in this pathological case, rather than the
> new trie code being intrinsically faster).
>
>
> commit 2e64971a6530d2645969bc489f564bfd3ce64993
> Author:     David Mitchell <davem@iabyn.com>
> AuthorDate: Mon May 3 13:57:58 2010 +0100
> Commit:     David Mitchell <davem@iabyn.com>
> CommitDate: Mon May 3 14:29:54 2010 +0100
>
>    tries: don't allocate memory at runtime
>
>    This is an indirect fix for
>        [perl #74484] Regex causing exponential runtime+mem usage
>
>    The trie runtime code was doing more SAVETMPS than FREETMPS and was thus
>    growing a large tmps stack on heavy backtracking. Rather than fixing this
>    directly, I rewrote part of the trie code so that it no longer needs to
>    allocate memory in S_regmatch (it still does in find_byclass()).
>
>    The basic issue is that multiple branches in the trie may trigger an
>    accept state; for example:
>
>        "abcd" =~ /xyz/abcd.*X|ab.*Y|/
>
>    here, words (branches) 2 and 3 are accept states. The original approach
>    was, at run time, to create a list of accepted word numbers and the
>    character positions of the end of each of those words. Then run the rest
>    of the pattern for each word in the list in turn (in word index order).
>    This requires memory for the list to be allocated and freed.
>
>    The new approach involves creating extra info at compile time; in
>    particular, for each word, a pointer to the previous accepted word (if
>    any) in the state tree. For example for the above pattern, part of the
>    state tree may be
>
>          q    b    c    d
>        1 -> 2 -> 3 -> 4 -> 5
>                (#3)       (#2)
>
>    (e.g. at state 1, if the next char is 'a', we transition to state 2).
>    Here, state 3 is an accept state with word #3, and 5 is an accept state
>    with word #2. So we build a table indexed by word number, which has
>    wordinfo[2] = 3, wordinfo[3] = 0, thus building the word chain 2->3->0.
>
>    At run time we run the trie to completion, and remember the word
>    associated with the longest accept state (word #2 above). Then by following
>    back the chain of .prev fields, we can produce a list of all accepting
>    words. We then iteratively find the smallest-numbered (ie LH-most) word in
>    the chain, and run with it. On failure and backtrack, we find the
>    next-smallest and so on.
>
>    Since we are no longer recording the end-position of each word in the
>    string, we have to recalculate this for each backtrack. We initially
>    record the end-position of the shortest accepting word, and given that we
>    know the length of each word, we can calculate the new position each time
>    as an offset from that first word. Depending on unicode and folding, that
>    calculation can be cheap or expensive.
>
>    This algorithm is optimised for the typical case where there are a small
>    number (<= 2) accepting states.
>
>    This patch creates a new compile-time array, trie->wordinfo[], indexed by
>    word number, which contains relevant info about each word. This also
>    supersedes the old trie->newword[] array, whose function of recording
>    "overspills" of multiple words per accept state, is now handled as part of
>    the wordinfo[].prev chain.

Hi Dave. Thanks for this. A very educational patch.

Cheers,
yves


-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Previous


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About