develooper Front page | perl.perl5.porters | Postings from October 2013

NWCLARK TPF grant August report

Thread Next
Nicholas Clark
October 3, 2013 14:58
NWCLARK TPF grant August report
Message ID:
As per my grant conditions, here is a report for the August period.

Multiple cores are the future! And we need to enjoy this, because we aren't
going to get any choice about it. The current hardware for the hot backup
for has 24 cores, and even mobile phones are thinking
about going quad-core.

The upshot of this is that the more that you can get the computer to do
parallel, the faster the entire task will complete. On *nix the make has
been able to run in parallel for quite a while, and back in 2008 I hacked
away on TAP::Harness until we got the tests running in parallel too. (Sadly
only on *nix - Microsoft's divergent implementation of socket() frustrates
the same approach being used on Win32.)

Well, to be strictly accurate, the make has been able to run in parallel
except when it can't, due to race conditions we were previously unaware of.
(Just like the tests, actually.) We chip away at them as we find them, and I
guess that the best we can hope for is "no known problems", and an
increasingly long time since the last "unknown unknown" turned up.

Frustratingly there had been one "known problem" that we hadn't figured out
a way to solve. SDBM_File contains a subdirectory in which is the source
code to libsdbm.a, and the Makefile.PL for SDBM_File ends up generating a
Makefile which has two rules, both of which recurse into ext/SDBM_File/sdbm/
and try to build the same target. Run in parallel, and you can end up with
one process deleting libsdbm.a at just the same time that a second process
is trying to link it, resulting in that second process bailing out with an
error and the build failing. One rule was the default "build my
subdirectory" rule generated by ExtUtils::MakeMaker. The second was a custom
rule generated by the Makefile.PL to say "to build the dependency libsdbm.a,
you need to do this..." and the difficulty was that the rule that we needed
to keep was the second one.

What has eluded all of us for a very long time is that actually it's pretty
easy to stop ExtUtils::MakeMaker generating that default rule - if you know
*how*. So the fix turns out to be quite simple, once you know the secret,
and now we think that this bug is finally fixed, and as no others have
cropped up in the Makefile for a while, it's starting to look like the
Makefile might be bug free and reliable in parallel.

If multiplying is good for cores, then it must be good for BSDs too, as they
seem to be breeding like rabbits*. The problem with this is that each BSD ends
up having a hints file to tell Configure about its peculiarities and foibles,
those hints files keep getting bigger, and when (for example) Bitrig forks off
OpenBSD, then the Bitrig hints file starts out as a *near* copy of OpenBSD's.
(Although, of course, not identical, as inevitably something key will change,
such as the version numbers)

The upshot of adding support for Bitrig was that now we had 3 hints files
(OpenBSD, MirBSD and Bitrig) each of which contained a 59 line test to
determine if the C library has broken 64 bit functions inherited from OpenBSD.

Triplication is never good, and as the code in question looked to be pretty
much portable and likely to pass on any sane system, I looked into moving it
into Configure, and hence running it everywhere.

Just for a change, this all went strangely smoothly. AIX and HP-UX, both of
which have choked on these sorts of things before, built it just fine, as
did the less problematic Solaris, FreeBSD and Linux. Various smoke-me
smokers reported it clean on Darwin, NetBSD and Cygwin, and BinGOs manually
tested on both 32 and 64 bit Cygwin without problems. So it went "in", and
the code base got (net) 124 lines shorter.

It would be wonderful to parameterise all the BSDs into one master hints
file and a set of options ("oh, it's basically FreeBSD-like, but they use
git for a VCS, the NetBSD ports system, but default to ZFS"), but I suspect
that that's more work than it's worth. Still, it remains that the hints
files for *BSD are already more than twice as long as the Linux hints file,
and sadly I don't think that that ratio is going to improve.

Less conclusive was my investigation into dl_unload_all_files(). This
feature of DynaLoader dates back 13 years, to just before v5.6.0 was
released, and the open ticket is of a similar vintage, "ID 20000402.003" in
the old system (migrated to RT as #2968).

The problem that dl_unload_all_files() is attempting to solve is that in an
embedded environment, the perl interpreter's exit may not always correspond
to the process exit. Worse still, the perl interpreter might be be started a
second time in the same process, and this ought to work. So to make
everything work smoothly, it's important that when embedded, the interpreter
frees up all resources that it was responsible for allocating.

One of the more obscure resources that had got forgotten was shared
libraries that it had loaded, typically by DynaLoader when loading
extensions dynamically. So dl_unload_all_files() was part of an attempt to
have DynaLoader track all the shared objects that it loaded, and free them
all at interpreter exit. Only it didn't quite work out, and despite the
documentation claiming that it was active, it quietly got disabled unless
DynaLoader.xs was compiled with DL_UNLOAD_ALL_AT_EXIT.

The summary is that "it's a mess".

The curious good news is that even building the 13-year-old code with the
current gcc ASAN shows remarkably few problems. The quality of the core's
code has been consistently good for over a decade.

But back to the bad news. Even for the single-threaded case, there are
problems. The implementation of dl_unload_all_files() isn't robust. It uses
perl_atexit() is called after object destruction, but *before* all the
unblessed SVs gets destroyed. The problem is that this is too early for this
use case - anything used later that has a pointer into something dynamically
loaded instead has a pointer into garbage. For example anything that has tie
magic will have the vtable point to static memory (in the sense that it's a
static object allocated by the C compiler). If the tie was set up by XS code
and that memory was in the shared object, that memory isn't there any more,
but the SV still points to where it was. When the full global destruction
happens it will attempt to call the relevant vtable routine to clear up the
tied scalar, follow the pointer into the abyss and then crash (or worse).
The actual problem that highlighted this was the OP free hook in File::Glob,
which again is a pointer from something "owned" by the interpreter into
memory "owned" by the shared object.

It might be possible to fix these global destructions problems by adding a
hook that gets run much much later, after all the SVs are destroyed, pretty
much as the last thing before perl_destruct() returns. However, this won't
fix the multi-threaded case. It's clear that no-one is building with it with
ithreads (or using it with Win32 pseudo-forks), because there is an
explosion of fail if you do. The problem is that DynaLoader is using
perl_atexit() to unmap shared libraries, and that's per *interpreter* exit.
So as soon as an ithread is spawned, its exit triggers the unmapping of all
shared objects, and the next time the parent thread tries to use one,

I think to properly solve the "at interpreter exit" cleanup we'd really need
to change dl_load_file() to internally track what it's been called for
(tracking shared across threads), and (I think) how often, with ownership
being upped on thread clone. At which point, interpreter exit is safe to call
dl_unload_all_files(). But only really late, when the top level parent
interpreter exits. And given that we've lived with this bug for over 13 years
now, it doesn't seem to be a high priority to fix.

Something else that proved to have more "depth" than was initially obvious
was a bug for threads::shared (RT #119089), where the test case demonstrated
that shared references returned from subroutines could disappear on a return.
The submitter's concise test case enabled Dave Mitchell to quickly determine
that the reported problem had been fixed by a commit in the core C code back
in January 2012, and Dave also suggested a work around to avoid the
bug. Thus it seemed like case closed.

However, I had my suspicions about that change (6f48390ab209d16e), and
whether it was just fixing the symptoms rather than the cause:

    [perl #95548] Returned magical temps are not copied
    return and leavesub, for speed, were not copying temp variables with a
    refcount of 1, which is fine as long as the fact that it was not cop-
    ied is not observable.
    With magical variables, that *can* be observed, so we have to forego
    the optimisation and copy the variable if it's magical.
    This obviously applies only to rvalue subs.

What raised my concerns was that if assertions are enabled then the test
case triggers an assertion failure on v5.14.0. If assertions are not
enabled, it crashes. So I went looking back in history to find what
introduced the assertion failure, it turns out that there's a heck of a lot
more to this than first meets the eye.

So, what caused the assertions to start failing between v5.12 and v5.14? A
simple bisect suggests that it's a commit to use cBOOL for some casts, which
has nothing directly to do with threads shared, or (not?) copying values on
subroutine return. It turns out after a bit of digging that it's this
section of that commit matters:

diff --git a/mg.c b/mg.c
index 3fb8ec4..4a8d767 100644
--- a/mg.c
+++ b/mg.c
@@ -193,7 +193,7 @@ Perl_mg_get(pTHX_ SV *sv)
     const I32 mgs_ix = SSNEW(sizeof(MGS));
-    const bool was_temp = (bool)SvTEMP(sv);
+    const bool was_temp = cBOOL(SvTEMP(sv));
     bool have_new = 0;
     MAGIC *newmg, *head, *cur, *mg;
     /* guard against sv having being freed midway by holding a private

deep in the routine that deals with get magic.

It's not obvious from the code above, but SvTEMP(sv) returns either 0 or
0x00080000, and in 2010, C<bool> was typedef'd as C<char>, so the above code
always set was_temp to 0, due to truncation. The cast was added to avoid a
warning about the truncation. The (buggy) truncation itself was added when
the type of was_temp was changed from a suitably large integer to a bool as
part of this commit whose only description was "Adding const qualifiers" (ie
the commit message doesn't even note that some variables' types were

The variable in question is used to enable Perl_mg_get() to retain the
SVs_TEMP bit set on values it is processing. This is needed because
Perl_sv_2mortal() conflates "temporary" with "mortal" and automatically sets
the SVs_TEMP flag bit on everything. Hence this code seeks to remove
SVs_TEMP if Perl_sv_2mortal() set it. The upshot of the bug was that
SVs_TEMP was always being removed, so code in subroutine entry that does
*not* copy TEMPs is skipped because the value is becoming unmarked. Hence
the values are always copied.

So, all that conceals what actually going on. When that truncation bug was
first introduced, it didn't matter that TEMPs weren't copied. When that bug
was fixed, it mattered a lot that TEMPs weren't copied. What changed in the

Finding out that involved bisecting between the two commits, patching the
bug described above. Porting/ makes even this pretty easy, as it
can automatically apply user-supplied patches to the tree before building.
The true trigger of the behaviour regression was a change in March 2010 to
how magic is handled on elements from magical arrays and hashes, which fixed
bugs most visible with pos. The key part of that change was 1 line which got
rid of a copy of a value. Hence that change removes one place that copied
the return value, and the change with the cast fix removes the other place,
hence why both are needed to trigger the problem. And hence why the
identified patch fixes it, by re-instating a copy.

So, I assumed that taking blead, and reverting the patch that fixes the test
case, would cause the test case to fail again. This is easy to check, and so
I verified this assumption.

The assumption is wrong. More bisecting revealed that subsequently this
change also has an influence:

commit 4bac9ae47b5ad7845a24e26b0e95609805de688a
Author: Chip Salzenberg <>
Date:   Fri Jun 22 15:18:18 2012 -0700

    Magic flags harmonization.

because part of it removes the entire was_temp (micro?) optimisation logic
mentioned above, meaning that values will always be copied.

(Remember, the bug became exposed because the value ended up not being
copied. If it's copied once, or the twice because a copy is copied, then the
bug is hidden. And it's this insensitivity to the *number* of copies that
makes it such a crafty little critter)

So that explains the given test case. But the story is still not over.

Whilst temporary values should be copied (by default), LVALUEs returned from
:lvalue subs should not. Knowing this, I could easily convert the test case
to use :lvalue subroutines and make everything crash (v5.14.0, blead, all in

I believe that the problem is because threads::shared wrongly assumes that
the proxy for a shared aggregate will always outlive any LVALUE proxies for
its elements. This is generally true, but not for element proxies returned
from :lvalue subroutines for aggregate proxies which go out of scope with
the subroutine.

I believe that the fix is to change threads::shared to deal with cleanup
differently, so that the last magic struct which frees up the relevant
mg_obj is responsible for cleaning up shared space. This is a reasonably
involved change on some complex code, and it's still pending waiting for a
confident second opinion that I got it right and it's good to merge.

The full explanation is in the middle of RT #119089, and most easily viewed

Last month it was HP-UX doing something strange. This month AIX, as an
innocent looking test had started failing on it, but not other Unix systems.
Specifically, t/op/fork.t, which had not been modified, was now failing in
two places, when previously it used to pass. This implies that some
unrelated change had unfortunate side effects, but what? And why?

Fortunately this is just the sort of job that git bisect excels at, and now
that most of it is automated it only takes small amount of setup effort to
get the computer busy figuring out the answer. Which turned out to be a
perfectly innocent looking commit to a harness function that t/op/fork.t was
using which fixed a frustrating bug:

    commit 684b0ecaad281760b04dd2da317ee0459cafebf6
    Author: Tony Cook <>
    Date:   Tue Jul 16 14:57:20 2013 +1000
        [perl #116190] feed an empty stdin to run_multiple_progs() programs
        Two tests for -a were attempting to read stdin and blocking with the -a
        implies -n change.

So why does this cause a problem? And why only on AIX?

run_multiple_progs() runs its programs using backticks. (backticks turn out
to be the only sufficiently portable way to easily capture output.) The code
it calls already had a feature to feed things to STDIN, which for various
reasons is done by using a pipeline from another perl process. So what that
commit does is change the generated backtick command from

    `./perl -Ilib -e ...`


    `./perl -e "print qq()" | ./perl -Ilib -e ...`

so that stdin is at EOF, and the thing we're testing can't end up hanging if
it inadvertently reads STDIN.

It turns out that this causes fun on AIX with two tests, because on AIX
/bin/sh is actually ksh, and ksh does pipes differently (with one fewer
process). With sh, for the latter command line the sh process forks two
children, which use exec to start the two perl processes. The parent shell
process persists for the duration of the pipeline, and the second perl
process starts with no children. With ksh (and zsh), the shell saves a
process by forking a child for just the first perl process, and execing
itself to start the second. This means that the second perl process starts
with one child which it didn't create. This breaks the tests assume that
wait (or waitpid) will only return information about processes started
within the test. One can replicate this is on Linux:

$ sh -c 'pstree -p $$ | cat'
$ ksh -c 'pstree -p $$ | cat'

Aside from who gets the child process, the behaviour is identical, because
both structures end up with the exit status of the pipeline being that of
the last command - in the ksh case because the last command *is* the parent,
in the /bin/sh case because the sh process ensures that it calls exit() with
the exit status of the child that it reaped.

Problem is that the tests in t/op/fork.t *are* sensitive to who gets the
child process, because they are counting how many children get reaped. I
thought about fixing the tests to make them immune to unexpected extra child
processes, but then realised that it was probably easier to change the
generated backtick command to be

    `./perl </dev/null -Ilib -e ...`

given that all that mattered was that reads on STDIN avoid hanging, not that
they return promptly return anything specific (ie eof instead of error).

I wasn't sure whether this approach would work on VMS, so I tested it there.
To be sure that only things that I expected to change changed, I first ran
the test suite unmodified. This revealed that t/lib/croak.t, lib/charnames.t
and lib/warnings.t failed. These failures are relatively new, and all look
like this:

    # From PTAC$DKA0:[NCLARK.I.perl-7d9633e5aba8.t.lib.croak]mg.
    # mg.c
    $SIG{_HUNGRY} = \&mmm_pie;
    warn "Mmm, pie";
    No such hook: _HUNGRY at - line 2.
    EXIT STATUS: != 0
    No such hook: _HUNGRY at - line 2.
    not ok 1 - Perl_magic_setsig
    t/[.lib]croak ... FAILED at test 1
    Failed 1 test out of 1, 0.00% okay.

ie the test runs a program, and whilst it is getting the output it expects, it
no longer gets the exit status it expects. Craig Berry explained why:

    The root cause of the problem on VMS is that command-line redirection
    is done by Perl and not by the shell.  Tony's addition of the stdin
    parameter to runperl gives us the equivalent of:
    $ perl -e "exit 2;" | perl -e "exit 0;"
    %NONAME-E-NOMSG, Message number 00000002
    $ show symbol $status
      $STATUS == "%X00000002"
    The Perl process with an exit value of 0 is a child of the one that
    has an exit value of 2 so the final status we see in runperl is the
    exit value of the parent, not of the child.  But the child is actually
    the test program whose exit value we're interested in and we don't get
    it this way.

Craig had already observed these failures resulting from Tony's (quite
reasonable) change and had come up with a fix involving another pipe to
capture the child's exit status, but because of the ugly complexity that
entailed he'd been "sitting on it for a bit hoping to think of something

So it turned out that the solution of setting up STDIN as EOF actually got
us two fixes for the price of one. Strictly, we haven't "solved" the
incompatibility of the VMS pipe implementation when it comes to exit
statuses, because instead we've avoided the tests relying on it. But as the
VMS behaviour has been that way for over a decade now, it seems unlikely
that it's going to change any time soon.

I need to thank Zefram for helping diagnose the problem.

Finally, did you know that ${^MPEN} used to be treated as a synonym for
${^OPEN}? No, and I didn't either. This was caused by a missing break;
statement in Perl_gv_fetchpvn_flags(), in the check for variables starting
with ^M. What should have happened is that if the name isn't ^MATCH it
behaves like any other unused multi-character control variable. What was
going wrong was that it was falling through into the code for variables
starting with ^O, so the check there for 'are the next 3 characters "PEN"'
would pass, and hence the odd behaviour. Now fixed in blead.

And this came about as a side effect of a bug report that Carp::confess
loses $! on Windows. Which it does, as it turns out, due to an unhelpful
combination of Microsoft's atoi() function being buggy (it resets errno.
That's naughty), and the regular expression match variables making a call to
atoi() each time they are read. No, I didn't know that. That seems mighty
inefficient, so I looked to fix it, hence I ended up also looking at the
code for ${^MATCH}, and spotted the above problem.

As to $1 and friends...

Well, it transpires that since forever (perl 5.0 alpha 2) runtime lookup of
the numbered match variables has been implemented by finding their name (as
a string) and converting it to a number (the aforementioned call to atoi())
to get the array index to use. Which works, but sure doesn't feel like the
most efficient way of doing it (even if it didn't also cause strange side
effects due to buggy libraries). I think that it was done this way because
it's the same approach as used by all the other punctuation variables - they
all have magic type '\0' attached to them, and then the routines called for
reads and writes on such variables uses switches based on the name to
perform the correct action.

It happens that the "name" is stored as a pointer, length pair. It's also
the case that nothing (else) relies on the name, and that the name is never
NULL. All this is useful. We have an integer field (the "length") and no
existing flag value of NULL. So it was a S.M.O.P. to change the
representation such that for match variables the pointer is now NULL, and
the length is used to store the array index, with the string to number
conversion done once at compile time.

More pleasingly *everything* seemed to come together nicely. Other routines
that handle magic (such as thread duping) already work smoothly if the
length is non-zero even though the pointer is NULL, so nothing needed
changing there. Not needing to store the name until runtime is a memory
saving. Thanks to some refactoring Dave did for the regex engine, *all* the
match variables are actually now implemented as integers - $& is 0, $' is -1,
etc, which means that they too can be implemented this way, not just $1
upwards. Which means that the C code is 23 lines shorter as a result
(although the object code is about the same size - you can't have it all).

A more detailed breakdown summarised from the weekly reports. In these:

16 hex digits refer to commits in
RT #... is a bug in
CPAN #... is a bug in
ID YYYYMMDD.### is an bug number in the old bug system. The RT # is given
  afterwards. You can look up the old IDs at

[Hours]		[Activity]
  0.50		${^MPEN}
  0.50		/[#$x]/
 10.25		AIX t/op/fork.t
		AIX t/op/fork.t (VMS runperl)
  0.25		CPAN #88258, CPAN #88259, CPAN #88260
  0.50		CPAN client and modules removed from core.
  0.50		HP-UX 64bit
  0.50		ID 20020301.005 (RT #8732)
  0.50		LVALUE macros
  9.50		NULL PL_curcop
  0.50		PerlIO_sprintf, PerlIO_vsprintf
  0.50		RT #107816
  0.25		RT #116118
  1.00		RT #116907
  0.75		RT #118987
		RT #118987 (SDBM_File's Makefile)
 13.25		RT #119089
  0.25		RT #119097
  1.50		RT #119155
  0.25		RT #119161
  0.25		RT #119181
  0.25		RT #119189
  1.00		RT #119195
		RT #119195, RT #119197
  0.25		RT #119351
  1.50		RT #119409
  0.50		RT #119437
  0.75		RT #119445
  0.25		RT #119497
  0.50		RT #119515
 20.00		RT #2968
		RT #2968 (dl_unload, etc)
		RT #2968 (dl_unload, etc) ID 20000402.003
  0.25		RT #55896
  0.25		RT #92446
  1.50		Test::PerlRun on VMS
  0.25		VMS
  0.50		\Q \E
  9.00		atoi
		atoi ($1, $2, $& etc)
		atoi (RT #116118)
		atoi (RT #81586)
  2.50		de-duping BSD Call Back Units
  1.50		dl_dld.xs
  0.50		global destruction
  1.00		lib/perl5db.t
  0.75		process, scalability, mentoring
 22.50		reading/responding to list mail
  2.50		runperl
  2.00		t/re/pat_thr.t
  0.25		threads::shared leaks
  2.50		untangling PAUSE permissions
114.50 hours total

Nicholas Clark

* credit Leon Timmermans for that analogy. It feels sadly apt :-(

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About