develooper Front page | perl.perl5.porters | Postings from June 2013

Re: [perl #92898] (*THEN) broken inside condition subpattern

Thread Previous | Thread Next
From:
demerphq
Date:
June 22, 2013 16:44
Subject:
Re: [perl #92898] (*THEN) broken inside condition subpattern
Message ID:
CANgJU+UDV+BQjubuG3c35dp=hiKcnBi7aLCcffCZTSSDvjv=tg@mail.gmail.com
Sorry about the incredibly laggy reply. :-(

FC recently brought to my attention these threads, which I had managed
to overlook or forget or whatever.

I am doing my best to work my way through them and will reply as I go.

On 20 September 2011 18:32, Philip Hazel <ph10@hermes.cam.ac.uk> wrote:
> On Tue, 20 Sep 2011, Nicholas Clark wrote:
> I've been thinking about this some more. My naive understanding of *THEN
> is basically this: it is effectively just another way of doing what
> (?>...) does, but with possibly simpler syntax and the added feature of
> *THEN:NAME. The emphasis on alternation is really a red herring.
> Thus, if you have
>
>    A (*THEN) B
>
> (where A and B are complex patterns) the matching engine, having passed
> (*THEN) and subsequently failed in B, no longer backtracks into A. This
> would be the same:
>
>    (?>A)B
>
> If you go along with this, it follows that, if (*THEN) is within a
> group, for example,
>
>    C (A (*THEN) B)
>
> then a failure in B must backtrack into C, just as would happen with
>
>    C ((?>A) B)
>
> In other words, the effect of (*THEN) within a group does not propagate
> back beyond the start of the group. IMHO this should also apply to
> conditional groups which, after all, behave sort of like a group with
> only one branch (just that there's a choice of which one each time).
>
> Now, it seems that Perl thinks differently to me. There seems to be the
> concept of "group with no alternation" and "group with alternation" as
> two different things that are handled differently. (And a conditional
> group is of the former type.)

I think you are right that perl thinks differently to you. Probably
because of implementation details.

Perl's regex engine doesn't have a concept of a "group". It has a
concept of an "alternation" (BRANCH), and the special behavior you
associate with "group" parens perl actually associates with the BRANCH
operator "|".

There is no group parens in the pattern /foo|bar|baz/ however there is
an alternation.

The only part the group parens play is to tell the parser where the
alternation starts and ends.

You could say that (?: | ) are some kind of really complicated ternary
operator. IE, not different constructs, but optional part of a single
operator.

Ditto for the association of groups to quantifiers. In reality the
group parens are part of the quantifier and have no behavior of their
own.

So, when we have this:

(?:foo|bar)*

the (?: ) are part of the | operator in that they denote the beginning
and end of the alternation, and they are part of the * quantifier,
because they tell it what pattern that quantifier applies to.

Another aspect of this is that to perl /a(?:foo)b/ is the same as
/afoob/. Since there is neither a quantifier on the (?:) nor a
modifier in the ?: nor an alternation inside, it is a complete no-op.
In other words it is not the case that (?:foo) is a group with only
one alternation. It is simply "foo".

So thinking of a (?: ... ) as a "thing" is wrong from the Perl
perspective. It only becomes a thing when it has a quantifier
attached, or an alternation inside. Even when there is a modifier in
the ?: it still isnt a thing, it modifies the rules for interpreting
the ....

IOW, at an implementation level it is not that case that writing
/foo|bar|baz/ is a short hand way of writing /(?:foo|bar|baz)/.

Consider this output:

$ perl -Mre=debug -e'/[ac]|[bc]/'
Compiling REx "[ac]|[bc]"
Final program:
   1: BRANCH (13)
   2:   ANYOF[ac][] (25)
  13: BRANCH (FAIL)
  14:   ANYOF[bc][] (25)
  25: END (0)
minlen 1
Freeing REx: "[ac]|[bc]"

$ perl -Mre=debug -e'/(?:[ac]|[bc])/'
Compiling REx "(?:[ac]|[bc])"
Final program:
   1: BRANCH (13)
   2:   ANYOF[ac][] (26)
  13: BRANCH (FAIL)
  14:   ANYOF[bc][] (26)
  25: TAIL (26)
  26: END (0)
minlen 1
Freeing REx: "(?:[ac]|[bc])"

The program generated is the same. There is no operator for GROUP,
instead there is a chain of BRANCH operators ending in a TAIL.

(*THEN) affects how we transition from BRANCH to BRANCH, and if we
arent in a BRANCH acts like a (*PRUNE).

This gets hard to debug in Perl if you dont use charclasses with at
least two chars in them, as Perl will optimize to a TRIE.

Anyway, I can see how it would be reasonable to consider (?:foo) to be
an alternation with only one branch, but to Perl it isnt, and that is
how you should interpret the verb operators.

The intent of the (*THEN) operator is for something like this:

/ 1111 ( (?:101)+ (*THEN) 000 | (?: 110 | 011 )+ (*THEN) 000 ) 0101 /x

If we match say 1000 '101's, and we cannot match '000' no amount of
backtracking will allow the '000' to match. Perl is not good at
detecting this, and therefore we allow the user to use the (*THEN)
verb to tell Perl that the thing on the left side of the (*THEN) cant
match this branch through backtracking, and we should go to the next
BRANCH to see what it would do. If there is no branch then it is the
same thing as saying (*PRUNE), in that it effectively says that there
is no match at this starting position.

Here is an example of how this can help. wc -l is useful but crude
count of how much work the regex engine does for a given match. As you
can see, using the hints makes for a faster match.

$ for i in 1 2 4 8 16 32 64 128 256; do perl -Mre=Debug,EXECUTE
-le'("xx".("abcd" x $ARGV[0]). "gyy")=~/^x+ (?: [abcd]+ [pq] |
(?:ab|cd)+ [xy] | (?:abcd)+ [gh] ) y+ /x;' $i 2>&1 | wc -l; done;
61
97
169
313
601
1177
2329
4633
9241
$ for i in 1 2 4 8 16 32 64 128 256; do perl -Mre=Debug,EXECUTE
-le'("xx".("abcd" x $ARGV[0]). "gyy")=~/^x+ (?: [abcd]+ (*THEN) [pq] |
(?:ab|cd)+ (*THEN) [xy] | (?:abcd)+ (*THEN) [gh] ) y+ /x;' $i 2>&1 |
wc -l; done;
55
77
121
209
385
737
1441
2849
5665

The behavior of /A+? (*THEN) BC / appeared to be broken in some cases.
For instance for the literals A B and C, we /should/ match "ABC",
however we trigger the PREGf_SKIP optimization, which basically turns
/A+ B/ into /A+ (*SKIP) B/, which is not correct for /A+? (*THEN) BC/.
You can see this by adding (?{}) at the end of the pattern:

$ ./perl -le'"aaabc"=~/a+?(*THEN)bc/ and print "match: $&"'
$ ./perl -le'"aaabc"=~/a+?(*THEN)bc(?{})/ and print "match: $&"'
match: abc

The (?{}) disables this optimization, and make it do the expected thing.

Your original bug report was about this:

./perl -Ilib -Mre=Debug,All,FLAGS -le'"ba"=~/^.*?(?(?=a)a|b(*THEN)c)/
and print "match: $&"'

which does not match, probably because of an optimisation in .*? handling.

An unanchored match /P/ should match the same thing as  /^.*?P/, and
we see that Perl does match for that form:

./perl -le'"ba"=~/((?(?=a)a|b(*THEN)c))/ and print "match: $&\ncond: $1"'
match: a
cond: a

And if we removed the non-greedy quantifier modifier:

./perl -le'"ba"=~/^.*((?(?=a)a|b(*THEN)c))/ and print "match: $&\ncond: $1"'
match: ba
cond: a

Both of which match as expected, however this still doesnt match:

./perl -le'"ba"=~/^.*?((?(?=a)a|b(*THEN)c))/ and print "match: $&\ncond: $1"'

which I continue to investigate.

I have pushed b8f6efdd9cee0ae5cd4adf5a15aebb2984f57fda to fix the case
of /a+?(*THEN)b/ but I have more to do.

I also pushed 337ff3078c4082e843af19536e11f70d3d14bfe9 which shows the intflags.

You need to use -Mre=Debug,All,FLAGS to see the flags. I might change
that a bit later.

Yves

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About