develooper Front page | perl.perl5.porters | Postings from February 2013

NWCLARK TPF grant January report

Thread Next
Nicholas Clark
February 8, 2013 14:02
NWCLARK TPF grant January report
Message ID:
As per my grant conditions, here is a report for the January period.

After catching up with some of the e-mail backlog, I tried having another
prod at the awkward clang/ASAN problem described last month.

It turned out that I missed something. Key to replicating the problem was that
one must turn on clang's optimiser. Of course, trying to debug a problem in the
C code, I'd gone off down the path of trying debugging builds, ie -g, which
by default don't enable the optimiser. (Debugging code with the optimiser
enabled tends to screw with your head, because the debugger reports execution
jumping backwards and forwards.) I hadn't tried compiling with -O on a local
machine, and that *does* replicate the problem. Better still, on that machine,
-O -g together still replicates the problem. So, now I can stick gdb on it:

    ok 1 - $^H{foo} doesn't exist initially
    ok 2 - $^H doesn't contain HINT_LOCALIZE_HH initially
    ok 3 - $^H{foo} is now 'a'
    ok 4 - $^H contains HINT_LOCALIZE_HH while compiling
    ok 5 - $^H{foo} is now 'b'
    ok 6 - $^H{foo} restored to 'a'
    ok 7 - $^H{foo} doesn't exist while finishing compilation
    ok 8 - $^H doesn't contain HINT_LOCALIZE_HH while finishing compilation
    Breakpoint 1, 0x000000000042e440 in __asan_report_error ()
    (gdb) up
    #1  0x000000000042f7a7 in __asan_report_load8 ()
    #2  0x00000000004fa030 in Perl_re_op_compile (patternp=<optimized out>,
        pat_count=-1292, expr=0xffffffffb6e, eng=<optimized out>, old_re=0x0,
        is_bare_re=0x7fffffffdba4, pm_flags=<optimized out>,
        orig_rx_flags=<optimized out>) at regcomp.c:5634
    5634        if (   old_re
    (gdb) p old_re
    $1 = (REGEXP * volatile) 0x0
    (gdb) p &old_re
    Address requested for identifier "old_re" which is in register $r8

Odd. How come a volatile variable thinks that it's in a register?
And it's an LVALUE, so why can't I take its address? What does the compiler
think - let's try some "printf" debugging:

    diff --git a/regcomp.c b/regcomp.c
    index a8b27dc..342e772 100644
    --- a/regcomp.c
    +++ b/regcomp.c
    @@ -5631,6 +5631,7 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_cou
         /* return old regex if pattern hasn't changed */
    +    PerlIO_printf(PerlIO_stderr(), "%p %p\n", old_re, &old_re);
         if (   old_re
             && !recompile
            && !!RX_UTF8(old_re) == !!RExC_utf8

and we stop at the same place:

    ok 1 - $^H{foo} doesn't exist initially
    ok 2 - $^H doesn't contain HINT_LOCALIZE_HH initially
    ok 3 - $^H{foo} is now 'a'
    ok 4 - $^H contains HINT_LOCALIZE_HH while compiling
    ok 5 - $^H{foo} is now 'b'
    ok 6 - $^H{foo} restored to 'a'
    ok 7 - $^H{foo} doesn't exist while finishing compilation
    ok 8 - $^H doesn't contain HINT_LOCALIZE_HH while finishing compilation
    0 7fffffffd760
    0 7fffffffd760
    Breakpoint 1, 0x000000000042e440 in __asan_report_error ()
    (gdb) up
    #1  0x000000000042f7a7 in __asan_report_load8 ()
    (gdb) up
    #2  0x00000000004fa060 in Perl_re_op_compile (patternp=<optimized out>, 
        pat_count=-1292, expr=0xffffffffb6e, eng=<optimized out>, 
        old_re=<optimized out>, is_bare_re=0x7fffffffdba4, 
        pm_flags=<optimized out>, orig_rx_flags=<optimized out>) at regcomp.c:5634
    5634        PerlIO_printf(PerlIO_stderr(), "%p %p\n", old_re, &old_re);
    (gdb) p old_re
    $2 = <optimized out>
    (gdb) p &old_re
    Can't take address of "old_re" which isn't an lvalue.

Can't take the address of it? That's despite the printf clearly showing that
its address is 0x7fffffffd760.

So something is clearly going badly wrong with the code generation here. It's
probably a bug in clang. But (a) it's not clear how to reduce it to a terser
test case (b) that's not actually helpful, as I really need to make this code
work on the clang we have here and now, as without this, we can't usefully
use ASAN any more to find bugs.

But some things started to come together. Whilst this specific failure is
probably a bug in clang, the function isn't exactly innocent. A couple of
people had already reported that they could get everything to pass if they
marked more variables as volatile. But playing whack-a-mole with volatile
is papering over the symptoms, not finding the cause. However, digging
further into the blame log for the function revealed a couple of previous
commits in the past few years adding volatile qualifiers, to fix compiler
warnings due to the addition of a all to setjmp(). So what's that all about?

The setjmp() relates to an optimisation added by commit bbd61b5ffb7621c2 in
September 2010. The backstory is that the compiled format for regular
expressions differs if it needs to store Unicode. (The documentation refers
to the human-readable form as the *pattern*, and the compiled form as the
*program*.) The Unicode representation for the program is larger, and cannot
be matched as efficiently, so it's preferable to use the tighter byte-based
format where possible. Unfortunately, it's not always possible to know for
sure which is the right decision until midway through parsing. If the
pattern contains literal Unicode, it's obvious that the program needs to
store Unicode. Otherwise, the parser optimistically assumes that the more
efficient representation can be used, and starts sizing on this
basis. However, if it then encounters something in the pattern which must be
stored as Unicode, such as an \x{...} escape sequence representing a
character literal, then this means that all previously calculated sizes need
to be redone, using values appropriate for the Unicode representation.

The problem is that the point where the parser discovers that it needs to
redo everything from the start is at least 4 functions down in the best
case, and possibly a lot further if the "problem" construction is within
parentheses groups. (The parser calls back to the top level to parse
parentheses.) Before the commit mentioned above, the recalculation was
implemented by setting a flag of "really, this needs Unicode", and simply
carrying on. The top level call spotted the flag, and restarted the parse.
Obviously, this is rather wasteful, particularly if the problem is detected
very early in the pattern.

So that commit refactored things so that the top level caller used setjmp()
to store a checkpoint, and upon detecting the need for a restart use
longjmp() to warp straight back to it. This skips doing needless work, which
is good.  The problem is the *warp* bit - it's far more "warp" than "jump",
as the C compiler is not expecting the perfectly innocent looking setjmp()
function to return twice. C doesn't have continuations, so this doesn't
happen. But of course, it does, and it is breaking the rules, introducing
control flow that the compiler has no idea about, so you have to start
(effectively) lying to the compiler left right and centre, by declaring
variables volatile, forbidding it to do any (sane, reasonable, normal)
optimisation, so that the non-local control flow doesn't come unstuck. And
it's easy to miss one that matters, as the compiler doesn't warn.

So the question then became, can I implement the early return in some way
other than longjmp()? It sort of looked plausible from an initial look at
the functions - they only return NULL for special cases, and everywhere
checks the return value and propagates NULL onward. So hook into that.

All, that is, except this call to S_regbranch():

        if (is_define) 
            vFAIL("(?(DEFINE)....) does not allow branches");
        lastbr = reganode(pRExC_state, IFTHEN, 0); /* Fake one for optimizer. */
        regbranch(pRExC_state, &flags, 1,depth+1);

So is that a bug I've just found? Or something more subtle that doesn't
matter. What's with the return value of S_regbranch(), and is it safe to
ignore? The code in question dates from November 1997, and is part of Ilya's
"Jumbo regexp patch" (commit c277df42229d99fe). More confusingly, it added
code with two calls from S_reg() to S_regbranch(), one of which carefully
checks the return value and generates a LONGJMP node if it returns NULL, the
other of which is called in void context, and so both ignores any return
value, or the possibility that it is NULL.

So, to try to figure this out...

As documented in pod/perlreguts.pod, the call graph for regex parsing
involves several levels of functions in regcomp.c, sometimes recursing more
than once.

The top level compiling function, S_reg(), calls S_regbranch() to parse each
single branch of an alternation. In turn, that calls S_regpiece() to parse a
simple pattern followed by quantifier, which calls S_regatom() to parse that
simple pattern. S_regatom() can call S_regclass() to handle classes, but can
also recurse into S_reg() to handle subpatterns and some other
constructions. Some other routines call call S_reg(), sometimes using an
alternative pattern that they generate dynamically to represent their input.

These routines all return a pointer to a regnode structure, and take a
pointer to an integer that holds flags, which is also used to return
information. After quite a bit of head scratching and figuring things out I
have untangled the possible return values from these 5 functions (and
related functions which call S_reg()). The return value is either a pointer
to the last node added to the program, or NULL. The pointer is to something
already accessible elsewhere, so it's just a convenience return. Nothing
leaks by ignoring it. But what about those NULL returns?

The obvious cause of NULL returns is when S_reg() encounters a pragma-like
construction - an embedded pattern match operator, such as (?i). In this
case, it consumes it, acts on it, and sets the flags to TRYAGAIN and
returns NULL. But the code seemed to be prepared to cope with other NULL
returns, without the TRYAGAIN flag. So, can those happen?

Starting from the top:
S_reg() will return NULL and set the flags to TRYAGAIN at the end of pragma-
like constructions that it handles. Otherwise, historically it would return
NULL if S_regbranch() returned NULL. In turn, S_regbranch() would return
NULL if S_regpiece() returned NULL without setting TRYAGAIN. If S_regpiece()
returns TRYAGAIN, S_regbranch() loops, and ultimately will not return NULL.

S_regpiece() returns NULL with TRYAGAIN if S_regatom() returns NULL with
TRYAGAIN, but (historically) if S_regatom() returns NULL without setting the
flags to TRYAGAIN, S_regpiece() would to. Where S_regatom() calls S_reg() it
has similar behaviour when passing back return values, although often it is
able to loop instead on getting a TRYAGAIN.

Which gets us back to S_reg(), which can only *generate* NULL in conjunction
with TRYAGAIN. NULL without TRYAGAIN could only be returned if a routine it
called generated it. All other functions that these call that return regnode
structures cannot return NULL. Hence

1) in the loop of functions called, there is no source for a return value of
   NULL without the TRYAGAIN flag being set
2) a return value of NULL with TRYAGAIN set from an inner function does not
   propagate out past S_regbranch()

Hence the only return values that most functions can generate are non-NULL,
or NULL with TRYAGAIN set, and as S_regbranch() catches these, it cannot
return NULL. The longest sequence of functions that can return NULL (with
TRYAGAIN set) is S_reg() -> S_regatom() -> S_regpiece() -> S_regbranch().
Rapidly returning right round the loop back to S_reg() is not possible.

Hence code added by commit c277df42229d99fe to handle a NULL return from
S_regbranch(), along with some other code is dead.

More usefully, this means that all the twisted maze of code makes some sense
- the only "can happen" NULL return value is with the flags TRYAGAIN. Which
means that it is going to be possible to use a NULL return value with a
different flag to signal "restart with Unicode", and eliminate the

However, a couple of things still stood in the way. One was working out that
all other possible return paths from calls to S_reg() were covered. Various
routines in the parser for tricky constructions (eg case insensitive Unicode
folds) are implemented by building up a pattern that represents the
construction in question, and then making a call to S_reg() to parse that as
if it were written in place of the original. So I needed to check each code
path that could get to one of these, to work out whether it could trigger
the restart. Most couldn't, but three or four could, and two of them didn't
have tests. They do now.

Finally I killed the longjmp(). Firstly by moving it up into the same
function as the setjmp() and then by replacing it with a goto. Yes,
unfortunately still evil. But definitely the lesser of two evils.
Particularly as clang was now happy, and address sanitiser reports no
errors. I pushed it to a smoke-me branch to let it stew.

As part of this, I discovered a recent regression in the SV dumping code.
Father Chrysostomos fixed problems with the interaction between regex
objects and LVALUE magic by changing the location in the the SV wrapper of
the pointer to the pattern. However, no-one noticed that the SV dumping code
wasn't aware of this change, hence dumping regexs no longer worked. Dumping
regexs is rather useful when you're trying to figure out what the regex
parser is doing :-(. So I added tests for the dumper, and fixed the bug.

During all this I'd had a couple of further insights into hashing, so spent
some more time investigating them. I sent an updated summary to the security
list. There's nothing I want to report publicly, other than the fact that
I'm still comfortable with the current state of all stable releases and

Finally, of note this month, I killed the Rhapsody port. I've been
forgetting to note that I'd been removing code related to one dead OS each
month. This month was Rhapsody, an Apple OS that later evolved into Darwin
and Mac OS X. It was initially only released to developers, but later became
Mac OS X Server, wit releases in 1999 and 2000. It was obsoleted by Mac OS X
10.0, released in March 2001. That's 142 lines gone, no longer getting in
the way:

    Configure               |   2 +-
    Cross/Makefile-cross-SH |   2 +-
    MANIFEST                |   1 -
    Makefile.SH             |   2 +-
    hints/       | 138 ---------------------------------------------
    installperl             |   4 +-
    pod/perldelta.pod       |   6 +--
    t/op/stat.t             |   3 +-
    8 files changed, 8 insertions(+), 150 deletions(-)

A more detailed breakdown summarised from the weekly reports. In these:

16 hex digits refer to commits in
RT #... is a bug in
CPAN #... is a bug in
BBC is "bleadperl breaks CPAN" - Andreas K├Ânig's test reports for CPAN modules

[Hours]		[Activity]
  0.25		MOP
 36.50		PERL_HASH
  1.00		POSIX::strptime
  0.25		RT #24689
  0.75		Rhapsody
  6.00		SVt_REGEXP and sv_dump
  7.50		clang
  0.50		ext/B/t/optree_misc.t
  1.00		hashing
  1.00		investigating security tickets
  0.50		module evictions
  6.25		process, scalability, mentoring
 26.00		reading/responding to list mail
 42.25		regcomp/setjmp
		regcomp/setjmp (killed longjmp)
  1.00		smoke-me/jrobinson/configure-for-cross
  0.50		smoke-me/threads-shared-stress
  0.25		timely destruction
132.25 hours total

Nicholas Clark

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About