develooper Front page | perl.perl5.porters | Postings from February 2013

NWCLARK TPF grant report #72

Thread Next
From:
Nicholas Clark
Date:
February 8, 2013 13:30
Subject:
NWCLARK TPF grant report #72
Message ID:
20130208133011.GH5656@plum.flirble.org
[Hours]		[Activity]
2013/01/14	Monday
 3.75		SVt_REGEXP and sv_dump
 0.25		process, scalability, mentoring
 2.00		reading/responding to list mail
 3.25		regcomp/setjmp
=====
 9.25

2013/01/15	Tuesday
 0.50		process, scalability, mentoring
 3.50		reading/responding to list mail
 2.50		regcomp/setjmp
 1.00		smoke-me/jrobinson/configure-for-cross
=====
 7.50

2013/01/16	Wednesday
 0.50		Rhapsody
 2.25		SVt_REGEXP and sv_dump
 0.50		reading/responding to list mail
 5.75		regcomp/setjmp
=====
 9.00

2013/01/17	Thursday
 0.25		MOP
 1.50		reading/responding to list mail
 4.00		regcomp/setjmp
=====
 5.75

2013/01/18	Friday
 7.75		regcomp/setjmp
=====
 7.75

2013/01/19	Saturday
 2.00		regcomp/setjmp
=====
 2.00

2013/01/20	Sunday
 1.50		regcomp/setjmp
=====
 1.50

Which I calculate is 42.75 hours


So, the problems this week started with, what's with the return value of
S_regbranch(), and is it safe to ignore? The code in question dates from
November 1997, and is part of Ilya's "Jumbo regexp patch" (commit
c277df42229d99fe). More confusingly, it added code with two calls from
S_reg() to S_regbranch(), one of which carefully checks the return value and
generates a LONGJMP node if it returns NULL, the other of which is called in
void context, and so both ignores any return value, or the possibility that
it is NULL.

So, to try to figure this out...

As documented in pod/perlreguts.pod, the call graph for regex parsing
involves several levels of functions in regcomp.c, sometimes recursing more
than once.

The top level compiling function, S_reg(), calls S_regbranch() to parse each
single branch of an alternation. In turn, that calls S_regpiece() to parse a
simple pattern followed by quantifier, which calls S_regatom() to parse that
simple pattern. S_regatom() can call S_regclass() to handle classes, but can
also recurse into S_reg() to handle subpatterns and some other
constructions. Some other routines call call S_reg(), sometimes using an
alternative pattern that they generate dynamically to represent their input.

These routines all return a pointer to a regnode structure, and take a
pointer to an integer that holds flags, which is also used to return
information. After quite a bit of head scratching and figuring things out
some analysis I have untangled the possible return values from these
5 functions (and related functions which call S_reg()). The return value is
either a pointer to the last node added to the program, or NULL. The pointer
is to something already accessible elsewhere, so it's just a convenience
return. Nothing leaks by ignoring it. But what about those NULL returns?

The obvious cause of NULL returns is when S_reg() encounters a pragma-like
construction - an embedded pattern match operator, such as (?i). In this
case, it consumes it, acts on it, and sets the flags to TRYAGAIN and
returns NULL. But the code seemed to be prepared to cope with other NULL
returns, without the TRYAGAIN flag. So, can those happen?

Starting from the top:
S_reg() will return NULL and set the flags to TRYAGAIN at the end of pragma-
like constructions that it handles. Otherwise, historically it would return
NULL if S_regbranch() returned NULL. In turn, S_regbranch() would return
NULL if S_regpiece() returned NULL without setting TRYAGAIN. If S_regpiece()
returns TRYAGAIN, S_regbranch() loops, and ultimately will not return NULL.

S_regpiece() returns NULL with TRYAGAIN if S_regatom() returns NULL with
TRYAGAIN, but (historically) if S_regatom() returns NULL without setting
the flags to TRYAGAIN, S_regpiece() would to. Where S_regatom() calls
S_reg() it has similar behaviour when passing back return values, although
often it is able to loop instead on getting a TRYAGAIN.

Which gets us back to S_reg(), which can only *generate* NULL in conjunction
with TRYAGAIN. NULL without TRYAGAIN could only be returned if a routine it
called generated it. All other functions that these call that return regnode
structures cannot return NULL. Hence

1) in the loop of functions called, there is no source for a return value of
   NULL without the TRYAGAIN flag being set
2) a return value of NULL with TRYAGAIN set from an inner function does not
   propagate out past S_regbranch()

Hence the only return values that most functions can generate are non-NULL,
or NULL with TRYAGAIN set, and as S_regbranch() catches these, it cannot
return NULL. The longest sequence of functions that can return NULL (with
TRYAGAIN set) is S_reg() -> S_regatom() -> S_regpiece() -> S_regbranch().
Rapidly returning right round the loop back to S_reg() is not possible.

Hence code added by commit c277df42229d99fe to handle a NULL return from
S_regbranch(), along with some other code is dead.

More usefully, this means that all the twisted maze of code makes some sense
- the only "can happen" NULL return value is with the flags TRYAGAIN. Which
means that it is going to be possible to use a NULL return value with a
different flag to signal "restart with Unicode", and eliminate the
longjmp().

However, a couple of things still stood in the way. One was working out that
all other possible return paths from calls to S_reg() were covered. Various
routines in the parser for tricky constructions (eg case insensitive Unicode
folds) are implemented by building up a pattern that represents the
construction in question, and then making a call to S_reg() to parse that as
if it were written in place of the original. So I needed to check each code
path that could get to one of these, to work out whether it could trigger
the restart. Most couldn't, but three or four could, and two of them didn't
have tests. They do now.


As part of this, I discovered a recent regression in the SV dumping code.
Father Chrysostomos fixed problems with the interaction between regex
objects and LVALUE magic by changing the location in the the SV wrapper of
the pointer to the pattern. However, no-one noticed that the SV dumping code
wasn't aware of this change, hence dumping regexs no longer worked. Dumping
regexs is rather useful when you're trying to figure out what the regex
parser is doing :-(. So I added tests for the dumper, and fixed the bug.


Finally, of note this week, I killed the Rhapsody port. I've been forgetting
to note that I'd been removing code related to one dead OS each month. This
month was Rhapsody, an Apple OS that later evolved into Darwin and Mac OS
X. It was initially only released to developers, but later became Mac OS X
Server, wit releases in 1999 and 2000. It was obsoleted by Mac OS X 10.0,
released in March 2001. That's 142 lines gone, no longer getting in the way:

    Configure               |   2 +-
    Cross/Makefile-cross-SH |   2 +-
    MANIFEST                |   1 -
    Makefile.SH             |   2 +-
    hints/rhapsody.sh       | 138 ---------------------------------------------
    installperl             |   4 +-
    pod/perldelta.pod       |   6 +--
    t/op/stat.t             |   3 +-
    8 files changed, 8 insertions(+), 150 deletions(-)

Nicholas Clark

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About