develooper Front page | perl.perl5.porters | Postings from August 2010

Re: [perl #71942] *COMMIT bypassed by optimization?

Thread Previous | Thread Next
From:
demerphq
Date:
August 24, 2010 03:46
Subject:
Re: [perl #71942] *COMMIT bypassed by optimization?
Message ID:
AANLkTi=vq1N6BXcmNTqm3s_q4BZvmmt7CkjUYv3gpV2o@mail.gmail.com
On 8 January 2010 22:15, Bram <p5p@perl.wizbit.be> wrote:
>> # My guess is that there is some cunning optimization in Perl that  is
>> analyzing
>> # the pattern and figuring out that it's not going to match at the start
>> of
>> # the string, and this optimization is skipping along to the sixth
>> character
>> # before even starting to run the regex engine.
>
> That is not the case; it starts at the start of the string and then does the
> wrong thing.

Hate to nitpick, but it appears to do the wrong thing (its actually
not wrong, for some definition of not wrong) because it is the case
that the engine is doing "cunning optimisations"....

Specifically, there is some strange "optimisation" (im a little
skeptical that it actually is an improvement) that finds the first
character of the next item after the CURLY is executed, and if after
matching many items the expected following character is NOT found then
the engine reports it as a non match.

Yes your analysis that a JUMPABLE is at fault is correct. Probably
VERBS should be exempted from JUMPABLE, *however* I fear that
Jumpable is used in other places for different purposes, and that in
fact we need to add a new macro to distinguish the different cases. As
I recall JUMPABLE is used to determine if certain parts of the regex
engine can ignore a token because it does something that doesnt change
the match semantics - such as in the peephole-optimiser  phase we
might ignore a jumpable node when concatenating string together to
build the longest constant substring that must be present for a match
to be successful.

Going back to the optimisation, I am skeptical that this truely is an
optimisation as a rather a lot of work is done at RUN TIME to set up
for the "optimisation", which is suspect outweighs the benefit. I
wonder fi we would be faster removing the optimisation outright, and
im quite  sure we would win speed in a lot of places if the static
component of the analysis was moved to the regex compile phase.

I will try to come up with a patch for this later today or this week.

>
>
> Compiling REx "a+b?(*COMMIT)z"
> Final program:
>   1: PLUS (4)
>   2:   EXACT <a> (0)
>   4: CURLY {0,1} (8)
>   6:   EXACT <b> (0)
>   8: COMMIT (10)
>  10: EXACT <z> (12)
>  12: END (0)
> anchored "a" at 0 (checking anchored) plus minlen 2
> Guessing start of match in sv for REx "a+b?(*COMMIT)z" against "xababz"
> Found anchored substr "a" at offset 1...
> Starting position does not contradict /^/m...
> Guessed: match at offset 1
> Matching REx "a+b?(*COMMIT)z" against "ababz"
>   1 <x> <ababz>             |  1:PLUS(4)
>                                  EXACT <a> can match 1 times out of
> 2147483647...
>   2 <xa> <babz>             |  4:  CURLY {0,1}(8)
>                                    EXACT <b> can match 1 times out of 1...
>                                    failed...
>                                  failed...
>   3 <xab> <abz>             |  1:PLUS(4)
>                                  EXACT <a> can match 1 times out of
> 2147483647...
>   4 <xaba> <bz>             |  4:  CURLY {0,1}(8)
>                                    EXACT <b> can match 1 times out of 1...
>   5 <xabab> <z>             |  8:    COMMIT(10)
>   5 <xabab> <z>             | 10:      EXACT <z>(12)
>   6 <xababz> <>             | 12:      END(0)
> Match successful!
> Freeing REx: "a+b?(*COMMIT)z"
>
> - it started matching at the beginning of the string;
> - it matched 'a'
> - it, for some reason, failed to match 'b'
> => (*COMMIT) not reached
>
>
> Trying other patterns: (debug output attached at the end):
>
> (Look at the end of this mail for the details)
>
>
> Pattern: /a+b?(*COMMMIT)z/        => match succesfull
> Pattern: /a+b?(*COMMMIT)\s*z/     => match failed
> Pattern: /a+(?:b|)(*COMMMIT)z/    => match failed
> Pattern: /a+b{0,6}(*COMMMIT)z/    => match succesfull
> Pattern: /a+(bc){0,6}(*COMMIT)z/  => match succesfull (input was modified to
> xabcabcz)
> Pattern: /a+(bc*){0,6}(*COMMIT)z/ => match failed     (input was modified to
> xabcabcz)
>
>
> My (competly) random guess is that the optimalisation only happens when the
> regex has the format:
> a) CURLY{X,Y} is used with an EXACT
> b) the CURLY(X,Y) is followed by an EXACT
>
>
>
> Best regards,
>
> Bram
>
>
> Debugging output:
>
>
> * Pattern: /a+b?(*COMMMIT)\s*z/:
>
> $ perl5.11.0 -Mre=debug -wle '"xababz" =~ /a+b?(*COMMIT)\s*z/;'
> Compiling REx "a+b?(*COMMIT)\s*z"
> Final program:
>   1: PLUS (4)
>   2:   EXACT <a> (0)
>   4: CURLY {0,1} (8)
>   6:   EXACT <b> (0)
>   8: COMMIT (10)
>  10: STAR (12)
>  11:   SPACE (0)
>  12: EXACT <z> (14)
>  14: END (0)
> anchored "a" at 0 (checking anchored) plus minlen 2
> Guessing start of match in sv for REx "a+b?(*COMMIT)\s*z" against "xababz"
> Found anchored substr "a" at offset 1...
> Starting position does not contradict /^/m...
> Guessed: match at offset 1
> Matching REx "a+b?(*COMMIT)\s*z" against "ababz"
>   1 <x> <ababz>             |  1:PLUS(4)
>                                  EXACT <a> can match 1 times out of
> 2147483647...
>   2 <xa> <babz>             |  4:  CURLY {0,1}(8)
>                                    EXACT <b> can match 1 times out of 1...
>   3 <xab> <abz>             |  8:    COMMIT(10)
>   3 <xab> <abz>             | 10:      STAR(12)
>                                        SPACE can match 0 times out of
> 2147483647...
>                                        failed...
>                                      failed...
> Match failed
> Freeing REx: "a+b?(*COMMIT)\s*z"
>
>
> - it started matching at the beginning of the string;
> - it matched 'a'
> - it matched 'b'
> => (*COMMIT) is reached
>
>
>
> * Pattern: /a+(?:b|)(*COMMMIT)z/:
>
> $ perl5.11.0 -Mre=debug -wle '"xababz" =~ /a+(?:b|)(*COMMIT)z/;'
> Compiling REx "a+(?:b|)(*COMMIT)z"
> Final program:
>   1: PLUS (4)
>   2:   EXACT <a> (0)
>   4: TRIE-EXACT[b] (10)
>      <b>
>      <>
>  10: COMMIT (12)
>  12: EXACT <z> (14)
>  14: END (0)
> anchored "a" at 0 (checking anchored) plus minlen 2
> Guessing start of match in sv for REx "a+(?:b|)(*COMMIT)z" against "xababz"
> Found anchored substr "a" at offset 1...
> Starting position does not contradict /^/m...
> Guessed: match at offset 1
> Matching REx "a+(?:b|)(*COMMIT)z" against "ababz"
>   1 <x> <ababz>             |  1:PLUS(4)
>                                  EXACT <a> can match 1 times out of
> 2147483647...
>   2 <xa> <babz>             |  4:  TRIE-EXACT[b](10)
>   2 <xa> <babz>             |      State:    1 Accepted:    1 Charid:  1 CP:
>  62 After State:    2
>   3 <xab> <abz>             |      State:    2 Accepted:    2 Charid:  1 CP:
>   0 After State:    0
>                                    got 2 possible matches
>                                    got 0 (2) as best, looking at 1 (1)
>                                    trying alternation #1 <b> at node #9
>   3 <xab> <abz>             | 10:    COMMIT(12)
>   3 <xab> <abz>             | 12:      EXACT <z>(14)
>                                        failed...
>                                      failed...
>                                    only one match left: #2 <>
>   2 <xa> <babz>             | 10:  COMMIT(12)
>   2 <xa> <babz>             | 12:    EXACT <z>(14)
>                                      failed...
> Match failed
> Freeing REx: "a+(?:b|)(*COMMIT)z"
>
>
> * Pattern: /a+b{0,6}(*COMMMIT)z/:
>
> $ perl5.11.0 -Mre=debug -wle '"xababz" =~ /a+b{0,6}(*COMMIT)z/;'
> Compiling REx "a+b{0,6}(*COMMIT)z"
> Final program:
>   1: PLUS (4)
>   2:   EXACT <a> (0)
>   4: CURLY {0,6} (8)
>   6:   EXACT <b> (0)
>   8: COMMIT (10)
>  10: EXACT <z> (12)
>  12: END (0)
> anchored "a" at 0 (checking anchored) plus minlen 2
> Guessing start of match in sv for REx "a+b{0,6}(*COMMIT)z" against "xababz"
> Found anchored substr "a" at offset 1...
> Starting position does not contradict /^/m...
> Guessed: match at offset 1
> Matching REx "a+b{0,6}(*COMMIT)z" against "ababz"
>   1 <x> <ababz>             |  1:PLUS(4)
>                                  EXACT <a> can match 1 times out of
> 2147483647...
>   2 <xa> <babz>             |  4:  CURLY {0,6}(8)
>                                    EXACT <b> can match 1 times out of 6...
>                                    failed...
>                                  failed...
>   3 <xab> <abz>             |  1:PLUS(4)
>                                  EXACT <a> can match 1 times out of
> 2147483647...
>   4 <xaba> <bz>             |  4:  CURLY {0,6}(8)
>                                    EXACT <b> can match 1 times out of 6...
>   5 <xabab> <z>             |  8:    COMMIT(10)
>   5 <xabab> <z>             | 10:      EXACT <z>(12)
>   6 <xababz> <>             | 12:      END(0)
> Match successful!
> Freeing REx: "a+b{0,6}(*COMMIT)z"
>
>
> * Pattern: /a+(bc){0,6}(*COMMIT)z/
> $ perl5.11.0 -Mre=debug -wle '"xabcabcz" =~ /a+(bc){0,6}(*COMMIT)z/;'
> Compiling REx "a+(bc){0,6}(*COMMIT)z"
> Final program:
>   1: PLUS (4)
>   2:   EXACT <a> (0)
>   4: CURLYM[1] {0,6} (14)
>   8:   EXACT <bc> (12)
>  12:   SUCCEED (0)
>  13: NOTHING (14)
>  14: COMMIT (16)
>  16: EXACT <z> (18)
>  18: END (0)
> anchored "a" at 0 (checking anchored) plus minlen 2
> Guessing start of match in sv for REx "a+(bc){0,6}(*COMMIT)z" against
> "xabcabcz"
> Found anchored substr "a" at offset 1...
> Starting position does not contradict /^/m...
> Guessed: match at offset 1
> Matching REx "a+(bc){0,6}(*COMMIT)z" against "abcabcz"
>   1 <x> <abcabcz>           |  1:PLUS(4)
>                                  EXACT <a> can match 1 times out of
> 2147483647...
>   2 <xa> <bcabcz>           |  4:  CURLYM[1] {0,6}(14)
>   2 <xa> <bcabcz>           |  8:    EXACT <bc>(12)
>   4 <xabc> <abcz>           | 12:    SUCCEED(0)
>                                      subpattern success...
>                                    CURLYM now matched 1 times, len=2...
>   4 <xabc> <abcz>           |  8:    EXACT <bc>(12)
>                                      failed...
>                                    CURLYM trying tail with matches=1...
>                                    CURLYM trying tail with matches=0...
>                                    failed...
>                                  failed...
>   4 <xabc> <abcz>           |  1:PLUS(4)
>                                  EXACT <a> can match 1 times out of
> 2147483647...
>   5 <xabca> <bcz>           |  4:  CURLYM[1] {0,6}(14)
>   5 <xabca> <bcz>           |  8:    EXACT <bc>(12)
>   7 <xabcabc> <z>           | 12:    SUCCEED(0)
>                                      subpattern success...
>                                    CURLYM now matched 1 times, len=2...
>   7 <xabcabc> <z>           |  8:    EXACT <bc>(12)
>                                      failed...
>                                    CURLYM trying tail with matches=1...
>   7 <xabcabc> <z>           | 14:    COMMIT(16)
>   7 <xabcabc> <z>           | 16:      EXACT <z>(18)
>   8 <xabcabcz> <>           | 18:      END(0)
> Match successful!
> Freeing REx: "a+(bc){0,6}(*COMMIT)z"
>
>
> * Pattern: /a+(bc*){0,6}(*COMMIT)z/
>
> $ perl5.11.0 -Mre=debug -wle '"xabcabcz" =~ /a+(bc*){0,6}(*COMMIT)z/;'
> Compiling REx "a+(bc*){0,6}(*COMMIT)z"
> Final program:
>   1: PLUS (4)
>   2:   EXACT <a> (0)
>   4: CURLYX[0] {0,6} (16)
>   6:   OPEN1 (8)
>   8:     EXACT <b> (10)
>  10:     STAR (13)
>  11:       EXACT <c> (0)
>  13:   CLOSE1 (15)
>  15: WHILEM (0)
>  16: NOTHING (17)
>  17: COMMIT (19)
>  19: EXACT <z> (21)
>  21: END (0)
> anchored "a" at 0 (checking anchored) plus minlen 2
> Guessing start of match in sv for REx "a+(bc*){0,6}(*COMMIT)z" against
> "xabcabcz"
> Found anchored substr "a" at offset 1...
> Starting position does not contradict /^/m...
> Guessed: match at offset 1
> Matching REx "a+(bc*){0,6}(*COMMIT)z" against "abcabcz"
>   1 <x> <abcabcz>           |  1:PLUS(4)
>                                  EXACT <a> can match 1 times out of
> 2147483647...
>   2 <xa> <bcabcz>           |  4:  CURLYX[0] {0,6}(16)
>   2 <xa> <bcabcz>           | 15:    WHILEM(0)
>                                      whilem: matched 0 out of 0..6
>   2 <xa> <bcabcz>           |  6:      OPEN1(8)
>   2 <xa> <bcabcz>           |  8:      EXACT <b>(10)
>   3 <xab> <cabcz>           | 10:      STAR(13)
>                                        EXACT <c> can match 1 times out of
> 2147483647...
>   4 <xabc> <abcz>           | 13:        CLOSE1(15)
>   4 <xabc> <abcz>           | 15:        WHILEM(0)
>                                          whilem: matched 1 out of 0..6
>   4 <xabc> <abcz>           |  6:          OPEN1(8)
>   4 <xabc> <abcz>           |  8:          EXACT <b>(10)
>                                            failed...
>                                          whilem: failed, trying
> continuation...
>   4 <xabc> <abcz>           | 16:          NOTHING(17)
>   4 <xabc> <abcz>           | 17:          COMMIT(19)
>   4 <xabc> <abcz>           | 19:            EXACT <z>(21)
>                                              failed...
>                                            failed...
>                                          failed...
>                                    failed...
> Match failed
> Freeing REx: "a+(bc*){0,6}(*COMMIT)z"
>
>
>
>
>



-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About