develooper Front page | perl.perl5.porters | Postings from February 2021

Re: [Perl/perl5] pp_split: \s+ pattern does not hit RXf_WHITE branch(#18515)

Thread Next
February 2, 2021 21:49
Re: [Perl/perl5] pp_split: \s+ pattern does not hit RXf_WHITE branch(#18515)
Message ID:
Regarding the difference between regnext() and NEXTOPER().

In general regexp operators are like sections of a railroad track.
Each node contains the information to find its successor, this is
either implicitly because they are a specifc size and their successor
is a specific size after the current, or via some kind of offset in
the regnode (which might be different sizes depending on the node
type). regnext() is the function that returns the next node in the
"main line" of the pattern.

However alternations are different. An BRANCH node in an alternation
is like a Y junction. The alternation holds the offset to the next
BRANCH node, or the TAIL of the alternation sequence. To find the
first operator *after* the alternation one uses NEXTOPER() which skips
to the next operator based on the assumption that the argument node is
the size of a BRANCH regnode.

Consider the following pattern, For a given branch, the content of
each branch is found at the node: NEXTOPER(branch). regnext(branch)
would return the node at the offset in the parentheses.

$ perl -Mre=debug -e'/x([ax]|[bx]|[cx])y/'
Compiling REx "x([ax]|[bx]|[cx])y"
Final program:
   1: EXACT <x> (3)
   3: OPEN1 (5)
   5:   BRANCH (17)
   6:     ANYOF[ax] (41)
  17:   BRANCH (29)
  18:     ANYOF[bx] (41)
  29:   BRANCH (FAIL)
  30:     ANYOF[cx] (41)
  41: CLOSE1 (43)
  43: EXACT <y> (45)
  45: END (0)

So regnext() is the "next node in the main line", and  NEXTOPER() is
the next node in the alternation, and is only really valid on a BRANCH
like node.


On Sun, 31 Jan 2021 at 15:02, Hugo van der Sanden
<> wrote:
> It chooses the branch based on flags set at compile-time, so I looked at where RXf_WHITE is set: a shortish block in regcomp.c dedicated to spotting specific patterns. Stepping through that block in gdb showed that it wasn't being set because next and therefore nop were pointing to the wrong node; examining the blame history fairly quickly points out 122af31 as the cause.
> I've always been quite vague about precisely what regnext() is supposed to be able to do, so it isn't immediately clear to me whether it is a bug in regnext() that this doesn't work; in the rest of the code, though, we generally rely on knowing the size of the regop we're looking at, so fixing this might be better done by removing the next and nop variables - ie rolling back both 122af31 and half of c9d98c4, and maybe replacing them with a warning that OP(NEXTOPER(first)) will cause reading of uninitialized memory (irking valgrind) if used on an inappropriate regop.
> —
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly, view it on GitHub, or unsubscribe.

perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About