develooper Front page | perl.perl5.porters | Postings from October 2013

NWCLARK TPF grant report #100

Nicholas Clark
October 3, 2013 13:40
NWCLARK TPF grant report #100
Message ID:
[Hours]		[Activity]
2013/07/29	Monday
 2.00		HP-UX
 0.75		ID 20000804.001 (RT #3634)
 0.75		RT #119069/RT #119071/RT #119073/RT #119075
 0.25		RT #38307
 1.25		VMS lib/warnings.t
 0.50		postfix dereference
 2.00		reading/responding to list mail
 1.00		runperl
 0.50		smoke-me/no-LDLIBPTH-for-CC

2013/07/30	Tuesday
 4.00		HP-UX t/re/pat_thr.t
 2.50		Storable blessed big endian failure
 2.50		smoke-me/no-LDLIBPTH-for-CC

2013/07/31	Wednesday
 6.75		RT #119089
 4.50		Storable blessed big endian failure

2013/08/01	Thursday
 7.50		RT #119089

2013/08/02	Friday
 4.75		RT #119089

2013/08/04	Sunday
 0.50		reading/responding to list mail

Which I calculate is 42.00 hours

More by coincidence than design, the top two time sinks this week were

The first was a bug for threads::shared (RT #119089), where the test case
demonstrated that shared references returned from subroutines could
disappear on return. The submitter's concise test case enabled Dave Mitchell
to quickly determine that the reported problem had been fixed by a commit in
the core C code back in January 2012, and Dave also suggested a work around
to avoid the bug. Thus it seemed like case closed.

However, I had my suspicions about that change (6f48390ab209d16e), and
whether it was just fixing the symptoms rather than the cause:

    [perl #95548] Returned magical temps are not copied
    return and leavesub, for speed, were not copying temp variables with a
    refcount of 1, which is fine as long as the fact that it was not cop-
    ied is not observable.
    With magical variables, that *can* be observed, so we have to forego
    the optimisation and copy the variable if it's magical.
    This obviously applies only to rvalue subs.

What raised my concerns was that if assertions are enabled then the test
case triggers an assertion failure on v5.14.0. If assertions are not
enabled, it crashes. So I went looking back in history to find what
introduced the assertion failure, it turns out that there's a heck of a lot
more to this than first meets the eye.

So, what caused the assertions to start failing between v5.12 and v5.14? A
simple bisect suggests that it's a commit to use cBOOL for some casts, which
has nothing directly to do with threads shared, or (not?) copying values on
subroutine return. It turns out after a bit of digging that it's this
section of that commit matters:

diff --git a/mg.c b/mg.c
index 3fb8ec4..4a8d767 100644
--- a/mg.c
+++ b/mg.c
@@ -193,7 +193,7 @@ Perl_mg_get(pTHX_ SV *sv)
     const I32 mgs_ix = SSNEW(sizeof(MGS));
-    const bool was_temp = (bool)SvTEMP(sv);
+    const bool was_temp = cBOOL(SvTEMP(sv));
     bool have_new = 0;
     MAGIC *newmg, *head, *cur, *mg;
     /* guard against sv having being freed midway by holding a private

deep in the routine that deals with get magic.

It's not obvious from the code above, but SvTEMP(sv) returns either 0 or
0x00080000, and in 2010, C<bool> was typedef'd as C<char>, so the above code
always set was_temp to 0, due to truncation. The cast was added to avoid a
warning about the truncation. The (buggy) truncation itself was added when
the type of was_temp was changed from a suitably large integer to a bool as
part of this commit whose only description was "Adding const qualifiers" (ie
the commit message doesn't even note that some variables' types were

The variable in question is used to enable Perl_mg_get() to retain the
SVs_TEMP bit set on values it is processing. This is needed because
Perl_sv_2mortal() conflates "temporary" with "mortal" and automatically sets
the SVs_TEMP flag bit on everything. Hence this code seeks to remove
SVs_TEMP if Perl_sv_2mortal() set it. The upshot of the bug was that
SVs_TEMP was always being removed, so code in subroutine entry that does
*not* copy TEMPs is skipped because the value is becoming unmarked. Hence
the values are always copied.

So, all that conceals what actually going on. When that truncation bug was
first introduced, it didn't matter that TEMPs weren't copied. When that bug
was fixed, it mattered a lot that TEMPs weren't copied. What changed in the

Finding out that involved bisecting between the two commits, patching the
bug described above. Porting/ makes even this pretty easy, as it
can automatically apply user-supplied patches to the tree before building.
The true trigger of the behaviour regression was a change in March 2010 to
how magic is handled on elements from magical arrays and hashes, which fixed
bugs most visible with pos. The key part of that change was 1 line which got
rid of a copy of a value. Hence that change removes one place that copied
the return value, and the change with the cast fix removes the other place,
hence why both are needed to trigger the problem. And hence why the
identified patch fixes it, by re-instating a copy.

So, I assumed that taking blead, and reverting the patch that fixes the test
case, would cause the test case to fail again. This is easy to check, and so
I verified this assumption.

The assumption is wrong. More bisecting revealed that subsequently this
change also has an influence:

commit 4bac9ae47b5ad7845a24e26b0e95609805de688a
Author: Chip Salzenberg <>
Date:   Fri Jun 22 15:18:18 2012 -0700

    Magic flags harmonization.

because part of it removes the entire was_temp (micro?) optimisation logic
mentioned above, meaning that values will always be copied.

(Remember, the bug became exposed because the value ended up not being
copied. If it's copied once, or the twice because a copy is copied, then the
bug is hidden. And it's this insensitivity to the *number* of copies that
makes it such a crafty little critter)

So that explains the given test case. But the story is still not over.

Whilst temporary values should be copied (by default), LVALUEs returned from
:lvalue subs should not. Knowing this, I could easily convert the test case
to use :lvalue subroutines and make everything crash (v5.14.0, blead, all in

I believe that the problem is because threads::shared wrongly assumes that
the proxy for a shared aggregate will always outlive any LVALUE proxies for
its elements. This is generally true, but not for element proxies returned
from :lvalue subroutines for aggregate proxies which go out of scope with
the subroutine.

I believe that the fix is to change threads::shared to deal with cleanup
differently, so that the last magic struct which frees up the relevant
mg_obj is responsible for cleaning up shared space. This is a reasonably
involved change on some complex code, and it's still pending waiting for a
confident second opinion that I got it right and it's good to merge.

The full explanation is in the middle of RT #119089, and most easily viewed

The other fun this week was investigating why builds on HP-UX had started
failing one of the regression tests for regular expressions. Bisecting showed
that the failure started with commit f1e1b256c5c1773d, which fixed the rules
for parsing numeric escapes in regexes. So why did the C code fail? This is
a job for a debugger:

    $ gdb ./perl
    Detected 64-bit executable.
    Invoking /pro/local/bin/gdb64.
    /usr/lib/pa20_64/ Unable to find library ''.

Bother. That's not very useful. Try on the other machine:

    $ gdb ./perl
    Detected 64-bit executable.
    Invoking /pro/local/bin/gdb64.
    GNU gdb 6.6
    Copyright (C) 2006 Free Software Foundation, Inc.
    GDB is free software, covered by the GNU General Public License, and you are
    welcome to change it and/or distribute copies of it under certain conditions.
    Type "show copying" to see the conditions.
    There is absolutely no warranty for GDB.  Type "show warranty" for details.
    This GDB was configured as "hppa64-hp-hpux11.11"...
    (no debugging symbols found)
    (gdb) run
    Starting program: /perl/usr/nick/Perl/perl/perl
    Error trying to get information about dynamic linker.

A chocolate teapot would be about as much use.

So in theory, at this point I start having to figure out HP's own adb. Which
seems to be even more user-hostile than dbx on Solaris. This isn't looking
like it's going to be fun...

So it was a great relief to discover that I didn't need to fight various HP
debuggers (much). In that, the combination of SIGBUS and "the stack seems to
be corrupt" triggered something in my head - these are the typical symptoms
reported by a debugger when a C program has recursed too deeply and bust the
C stack. At which point I could replicate it (down to the same test) by
tweaking the stack size on x86_64 Linux. It's nothing directly to do with
architecture or OS. The default thread stack size on PA-RISC HP-UX is too
small. The regex compiler recurses (in C) for each () group, so 100 nested
() groups makes an abnormally large amount of recursion.

I could reproduce it by setting the thread stack size to 86016 on x86_64
(that number will vary with build options), and tune the failure point
earlier or later by changing the thread size. The regression test will also
fail during parsing even without the associated C code change. Huzzah.
Because that means that the solution is pure-Perl, simply changing the test
runner to set a larger stack size.

(Of course it wasn't quite that simple. The "big" stack size I chose turned
out to be smaller than the default used by gcc on Linux when building with
ASAN, so the first change was a regression elsewhere. And the default on OS
X is also too small, so it also needs an unconditional stack size increase.)

It's good that perl itself is "stackless", else we'd hit these stack issues
with threading whenever Perl code recurses too much.

Nicholas Clark Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About