Front page | perl.perl5.porters |
Postings from June 2010
Re: [perl #74484] Regex causing exponential runtime+mem usage
Thread Previous
From:
demerphq
Date:
June 1, 2010 05:19
Subject:
Re: [perl #74484] Regex causing exponential runtime+mem usage
Message ID:
AANLkTiku_buQ2UIGCwC1V9cy2yWIGRWgRN6QLMpVsUbf@mail.gmail.com
On 3 May 2010 15:39, Dave Mitchell <davem@iabyn.com> wrote:
> On Wed, Apr 21, 2010 at 12:24:21PM +0100, Dave Mitchell wrote:
>> On Mon, Apr 19, 2010 at 01:34:02AM -0700, perl@0ne.us wrote:
>> > I've been smoking CPAN modules for CPANTesters and noticed that my bots
>> > got OOMKilled from time to time.
>>
>> Thanks for the report.
>>
>> It appears there are two (mostly) unrelated issues here. First is the
>> performance, which is quadratic in string length.
>>
>> This is due to String::Perl::Warnings creating a regexp that looks
>> essentially like /.*?..../. This is equivalent to /^.*?.*?...../
>> (note the anchor) and on repeated backtracking the two .*? make it go
>> quadratic. The fix for the String::Perl::Warnings module's author is to
>> either remove the initial .*?, or add an initial anchor.
>>
>> The second issue is the memory usage, which again appears to be quadratic
>> on string length. This is due to the trie code leaking SVs on the tmps
>> stack (which then accumulate quadratically with the quadratic backtracking
>> caused by the first issue). I've currently got some ideas on how to rework
>> that bit of code to not need to create temp SVs to store the accept
>> states, but in the meantime I've added a TODO test with commit
>> 8bd05d90e59a554e79d73edf32bf35547dbc2c40
>
> Now fixed in blead by the commit below. Running the OPs code for 400
> iterations on unpatched and patched blead gave the following results:
>
> unpatched: Loop 400: 2.76304984092712s rss(214076 kB)
> patched: Loop 400: 2.16697001457214s rss(6976 kB)
>
> i.e. the memory leak is gone, and it's a bit faster (probably due to not
> having to allocate lots of SVs in this pathological case, rather than the
> new trie code being intrinsically faster).
>
>
> commit 2e64971a6530d2645969bc489f564bfd3ce64993
> Author: David Mitchell <davem@iabyn.com>
> AuthorDate: Mon May 3 13:57:58 2010 +0100
> Commit: David Mitchell <davem@iabyn.com>
> CommitDate: Mon May 3 14:29:54 2010 +0100
>
> tries: don't allocate memory at runtime
>
> This is an indirect fix for
> [perl #74484] Regex causing exponential runtime+mem usage
>
> The trie runtime code was doing more SAVETMPS than FREETMPS and was thus
> growing a large tmps stack on heavy backtracking. Rather than fixing this
> directly, I rewrote part of the trie code so that it no longer needs to
> allocate memory in S_regmatch (it still does in find_byclass()).
>
> The basic issue is that multiple branches in the trie may trigger an
> accept state; for example:
>
> "abcd" =~ /xyz/abcd.*X|ab.*Y|/
>
> here, words (branches) 2 and 3 are accept states. The original approach
> was, at run time, to create a list of accepted word numbers and the
> character positions of the end of each of those words. Then run the rest
> of the pattern for each word in the list in turn (in word index order).
> This requires memory for the list to be allocated and freed.
>
> The new approach involves creating extra info at compile time; in
> particular, for each word, a pointer to the previous accepted word (if
> any) in the state tree. For example for the above pattern, part of the
> state tree may be
>
> q b c d
> 1 -> 2 -> 3 -> 4 -> 5
> (#3) (#2)
>
> (e.g. at state 1, if the next char is 'a', we transition to state 2).
> Here, state 3 is an accept state with word #3, and 5 is an accept state
> with word #2. So we build a table indexed by word number, which has
> wordinfo[2] = 3, wordinfo[3] = 0, thus building the word chain 2->3->0.
>
> At run time we run the trie to completion, and remember the word
> associated with the longest accept state (word #2 above). Then by following
> back the chain of .prev fields, we can produce a list of all accepting
> words. We then iteratively find the smallest-numbered (ie LH-most) word in
> the chain, and run with it. On failure and backtrack, we find the
> next-smallest and so on.
>
> Since we are no longer recording the end-position of each word in the
> string, we have to recalculate this for each backtrack. We initially
> record the end-position of the shortest accepting word, and given that we
> know the length of each word, we can calculate the new position each time
> as an offset from that first word. Depending on unicode and folding, that
> calculation can be cheap or expensive.
>
> This algorithm is optimised for the typical case where there are a small
> number (<= 2) accepting states.
>
> This patch creates a new compile-time array, trie->wordinfo[], indexed by
> word number, which contains relevant info about each word. This also
> supersedes the old trie->newword[] array, whose function of recording
> "overspills" of multiple words per accept state, is now handled as part of
> the wordinfo[].prev chain.
Hi Dave. Thanks for this. A very educational patch.
Cheers,
yves
--
perl -Mre=debug -e "/just|another|perl|hacker/"
Thread Previous