develooper Front page | perl.perl5.porters | Postings from January 2013

Re: [perl #116537] Regex: (*THEN) doesn't work as described

Thread Previous | Thread Next
From:
Karl Williamson
Date:
January 25, 2013 23:48
Subject:
Re: [perl #116537] Regex: (*THEN) doesn't work as described
Message ID:
510319A4.70508@khwilliamson.com
On 01/25/2013 03:17 PM, Ronald J Kimball wrote:
> On Fri, Jan 25, 2013 at 09:23:54AM -0800, Philip Hazel wrote:
>
>> My understanding of how (*THEN) works is that the test below should
>> match. The perlre page says "...this verb always matches, and when
>> backtracked into on failure, it causes the regex engine to try the next
>> alternation in the innermost enclosing group (capturing or otherwise)
>> that has alternations." Unless I am going mad, the examples below (one a
>> normal group, the other an assertion) fulfil the condition.
>>
>> $ perl -e 'print (("ac" =~ /^(?=ab|ac)/)? "yes\n":"no\n")'
>> yes
>> $ perl -e 'print (("ac" =~ /^(?=a(*THEN)b|ac)/)? "yes\n":"no\n")'
>> no
>>
>> $ perl -e 'print (("ac" =~ /^(ab|ac)/)? "yes\n":"no\n")'
>> yes
>> $ perl -e 'print (("ac" =~ /^(a(*THEN)b|ac)/)? "yes\n":"no\n")'
>> no
>
> These work in 5.10.1, but not in 5.14.1.
>
> These are the only tests involving (*THEN) that expect a successful match,
> from t/re/pat_advanced.t:
>
>      {
>          #Mindnumbingly simple test of (*THEN)
>          for ("ABC","BAX") {
>              ok /A (*THEN) X | B (*THEN) C/x, "Simple (*THEN) test";
>          }
>      }
>
> The key difference seems to be that in your tests, the two alternations
> begin with the same character.
>
> Ronald
>

I bisected this problem.  The offending commit is
commit 2e64971a6530d2645969bc489f564bfd3ce64993
Author: David Mitchell <davem@iabyn.com>
Date:   Mon May 3 13:57:58 2010 +0100

     tries: don't allocate memory at runtime

     This is an indirect fix for
         [perl #74484] Regex causing exponential runtime+mem usage

     The trie runtime code was doing more SAVETMPS than FREETMPS and was 
thus
     growing a large tmps stack on heavy backtracking. Rather than 
fixing this
     directly, I rewrote part of the trie code so that it no longer needs to
     allocate memory in S_regmatch (it still does in find_byclass()).

     The basic issue is that multiple branches in the trie may trigger an
     accept state; for example:

         "abcd" =~ /xyz/abcd.*X|ab.*Y|/

     here, words (branches) 2 and 3 are accept states. The original approach
     was, at run time, to create a list of accepted word numbers and the
     character positions of the end of each of those words. Then run the 
rest
     of the pattern for each word in the list in turn (in word index order).
     This requires memory for the list to be allocated and freed.

     The new approach involves creating extra info at compile time; in
     particular, for each word, a pointer to the previous accepted word (if
     any) in the state tree. For example for the above pattern, part of the
     state tree may be

           q    b    c    d
         1 -> 2 -> 3 -> 4 -> 5
                 (#3)       (#2)

     (e.g. at state 1, if the next char is 'a', we transition to state 2).
     Here, state 3 is an accept state with word #3, and 5 is an accept state
     with word #2. So we build a table indexed by word number, which has
     wordinfo[2] = 3, wordinfo[3] = 0, thus building the word chain 2->3->0.

     At run time we run the trie to completion, and remember the word
     associated with the longest accept state (word #2 above). Then by 
following
     back the chain of .prev fields, we can produce a list of all accepting
     words. We then iteratively find the smallest-numbered (ie LH-most) 
word in
     the chain, and run with it. On failure and backtrack, we find the
     next-smallest and so on.

     Since we are no longer recording the end-position of each word in the
     string, we have to recalculate this for each backtrack. We initially
     record the end-position of the shortest accepting word, and given 
that we
     know the length of each word, we can calculate the new position 
each time
     as an offset from that first word. Depending on unicode and 
folding, that
     calculation can be cheap or expensive.

     This algorithm is optimised for the typical case where there are a 
small
     number (<= 2) accepting states.

     This patch creates a new compile-time array, trie->wordinfo[], 
indexed by
     word number, which contains relevant info about each word. This also
     supersedes the old trie->newword[] array, whose function of recording
     "overspills" of multiple words per accept state, is now handled as 
part of
     the wordinfo[].prev chain.



Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About