develooper Front page | perl.perl5.porters | Postings from May 2010

Re: [perl #74484] Regex causing exponential runtime+mem usage

Thread Previous | Thread Next
From:
Dave Mitchell
Date:
May 3, 2010 06:39
Subject:
Re: [perl #74484] Regex causing exponential runtime+mem usage
Message ID:
20100503133904.GA26313@iabyn.com
On Wed, Apr 21, 2010 at 12:24:21PM +0100, Dave Mitchell wrote:
> On Mon, Apr 19, 2010 at 01:34:02AM -0700, perl@0ne.us wrote:
> > I've been smoking CPAN modules for CPANTesters and noticed that my bots
> > got OOMKilled from time to time.
> 
> Thanks for the report.
> 
> It appears there are two (mostly) unrelated issues here. First is the
> performance, which is quadratic in string length.
> 
> This is due to String::Perl::Warnings creating a regexp that looks
> essentially like /.*?..../. This is equivalent to /^.*?.*?...../
> (note the anchor) and on repeated backtracking the two .*? make it go
> quadratic. The fix for the String::Perl::Warnings module's author is to
> either remove the initial .*?, or add an initial anchor.
> 
> The second issue is the memory usage, which again appears to be quadratic
> on string length. This is due to the trie code leaking SVs on the tmps
> stack (which then accumulate quadratically with the quadratic backtracking
> caused by the first issue). I've currently got some ideas on how to rework
> that bit of code to not need to create temp SVs to store the accept
> states, but in the meantime I've added a TODO test with commit
> 8bd05d90e59a554e79d73edf32bf35547dbc2c40

Now fixed in blead by the commit below. Running the OPs code for 400
iterations on unpatched and patched blead gave the following results:

    unpatched: Loop 400: 2.76304984092712s rss(214076 kB)
    patched:   Loop 400: 2.16697001457214s rss(6976 kB)

i.e. the memory leak is gone, and it's a bit faster (probably due to not
having to allocate lots of SVs in this pathological case, rather than the
new trie code being intrinsically faster).


commit 2e64971a6530d2645969bc489f564bfd3ce64993
Author:     David Mitchell <davem@iabyn.com>
AuthorDate: Mon May 3 13:57:58 2010 +0100
Commit:     David Mitchell <davem@iabyn.com>
CommitDate: Mon May 3 14:29:54 2010 +0100

    tries: don't allocate memory at runtime
    
    This is an indirect fix for
        [perl #74484] Regex causing exponential runtime+mem usage
    
    The trie runtime code was doing more SAVETMPS than FREETMPS and was thus
    growing a large tmps stack on heavy backtracking. Rather than fixing this
    directly, I rewrote part of the trie code so that it no longer needs to
    allocate memory in S_regmatch (it still does in find_byclass()).
    
    The basic issue is that multiple branches in the trie may trigger an
    accept state; for example:
    
        "abcd" =~ /xyz/abcd.*X|ab.*Y|/
    
    here, words (branches) 2 and 3 are accept states. The original approach
    was, at run time, to create a list of accepted word numbers and the
    character positions of the end of each of those words. Then run the rest
    of the pattern for each word in the list in turn (in word index order).
    This requires memory for the list to be allocated and freed.
    
    The new approach involves creating extra info at compile time; in
    particular, for each word, a pointer to the previous accepted word (if
    any) in the state tree. For example for the above pattern, part of the
    state tree may be
    
          q    b    c    d
        1 -> 2 -> 3 -> 4 -> 5
                (#3)       (#2)
    
    (e.g. at state 1, if the next char is 'a', we transition to state 2).
    Here, state 3 is an accept state with word #3, and 5 is an accept state
    with word #2. So we build a table indexed by word number, which has
    wordinfo[2] = 3, wordinfo[3] = 0, thus building the word chain 2->3->0.
    
    At run time we run the trie to completion, and remember the word
    associated with the longest accept state (word #2 above). Then by following
    back the chain of .prev fields, we can produce a list of all accepting
    words. We then iteratively find the smallest-numbered (ie LH-most) word in
    the chain, and run with it. On failure and backtrack, we find the
    next-smallest and so on.
    
    Since we are no longer recording the end-position of each word in the
    string, we have to recalculate this for each backtrack. We initially
    record the end-position of the shortest accepting word, and given that we
    know the length of each word, we can calculate the new position each time
    as an offset from that first word. Depending on unicode and folding, that
    calculation can be cheap or expensive.
    
    This algorithm is optimised for the typical case where there are a small
    number (<= 2) accepting states.
    
    This patch creates a new compile-time array, trie->wordinfo[], indexed by
    word number, which contains relevant info about each word. This also
    supersedes the old trie->newword[] array, whose function of recording
    "overspills" of multiple words per accept state, is now handled as part of
    the wordinfo[].prev chain.


Affected files ...
    
    M	ext/re/t/regop.t
    M	regcomp.c
    M	regcomp.h
    M	regexec.c
    M	regexp.h
    M	t/op/svleak.t


-- 
I don't want to achieve immortality through my work... I want to achieve
it through not dying.
    -- Woody Allen

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About