develooper Front page | perl.perl5.porters | Postings from October 2012

NWCLARK TPF grant September report

Thread Next
Nicholas Clark
October 3, 2012 06:52
NWCLARK TPF grant September report
Message ID:
In September I finally emerged from the rabbit hole of PL_main_start, that I
entered into at some point in mid July. Father Chrysostomos has refactored
OP allocation to always use a slab allocator, as part of fixing some very
long standing bugs to do with OPs. At the start of parsing, a new slab is
allocated. While parsing is ongoing, the slab is "owned" by the CV itself.
To avoid the need to create another pointer in each and every CV, he took
advantage of the fact that during compilation CvROOT() and CvSTART() are
both unused, and stored a pointer to the slab in CvSTART(). [Once compiled,
respectively they point to the root of the subroutine's optree, and to the
first OP to run. At that point, the pointer to the slab can be retrieved
from the OP pointed to by CvROOT()]. As he notes:

   When a CV has a reference count on its slab (CvSLABBED), it is
   responsible for making sure it is freed.  (Hence, no two CVs should ever
   have a reference count on the same slab.)  The CV only needs to reference
   the slab during compilation. Once it is compiled and CvROOT attached, it
   has finished its job, so it can forget the slab.

For nearly every subroutine out there, this little internal game is
completely unnoticeable. The only point that it becomes exposed is with the
(hidden, internal) subroutine that represents the main program, PL_main_cv.
From the description above you would think that the first OP to run from the
main program would be found in CvSTART(PL_main_cv). Well, to be fair, from
all this build up you'd be pretty sure that it's *not*, and you'd be right.
PL_main_cv is a relatively late addition, added in 5.001 (as part of adding
closures), whereas PL_main_root and PL_main_start date from 5.000. Hence
the main program's optree root and first OP are stored in special variables
in the interpreter structure. Which means that when the compilation of the
main program finishes, nothing overwrites CvSTART(PL_main_cv) with a pointer
to the first OP to execute, so it's non-NULL and pointing to an OP slab, not
an OP.

So, I tried to work out (roughly)

a) whether this was a bad thing
b) whether it's possible to ensure that it's NULLed out
c) whether it's possible to finish the work of 5.001 and abolish
   PL_main_root and PL_main_start to save some space.

Given that the answer to (b) is "no", the answer to (a) is obviously "no,
it's not", because any other answer would be depressing :-)

(c) seems to be possible, and is currently on the branch
smoke-me/abolish-PL_main_start, but relies on a test refactoring for
t/op/gv.t that fails on Win32 in ways that I can't determine from logs
supplied by George Greer's smoke-me server.

As to why the answer to (b) is "no" - well, that's a bit more involved.

To be clear, global destruction cleans up everything, so nothing leaks.
It's just that it's not cleared up quite as soon as it might have been.
But could it be cleaned up earlier?

The obvious place to start looking is the location that sets PL_main_start.

    else {
        if (o->op_type == OP_STUB) {
        PL_main_root = op_scope(sawparens(scalarvoid(o)));
        PL_curcop = &PL_compiling;
        PL_main_start = LINKLIST(PL_main_root);
        PL_main_root->op_private |= OPpREFCOUNTED;
        OpREFCNT_set(PL_main_root, 1);
        PL_main_root->op_next = 0;
        PL_compcv = 0;

That works just fine, until you encounter a program such as

    perl -e 'BEGIN {1}'

because that turns out to enter that block (o->op_type == OP_STUB) above.
The code for that is:

            PL_comppad_name = 0;
            PL_compcv = 0;
            S_op_destroy(aTHX_ o);

No comments explained anything.

There are now three paragraphs of comments, as a result of all my digging.
Here are the first two, added by commit 22e660b408c16433:

This block is entered if nothing is compiled for the main program. This will
be the case for an genuinely empty main program, or one which only has BEGIN
blocks etc, so already run and freed.

Historically (5.000) the guard above was !o. However, commit
f8a08f7b8bd67b28 (Jun 2001), integrated to blead as c71fccf11fde0068,
changed perly.y so that newPROG() is now called with the output of
block_end(), which returns a new OP_STUB for the case of an empty
optree. ByteLoader (and maybe other things) also take this path, because
they set up PL_main_start and PL_main_root directly, without generating an

So that's fine - add some more code there to fix up CvSTART(PL_main_cv),
and job's a good'un.

Only it's not. Whack-a-mole continues. The arms race expands.

    perl -e 'BEGIN {exit 1}'

doesn't even pass through Perl_newPROG(). The exit opcode is called during
the running of the BEGIN block, triggering a stack unwinding which flies
straight up all the routines involved in the top level parse phase, only
stopping in their caller, somewhat confusingly named perl_parse().

But that shouldn't make a difference to the main program (although code
running during global destruction might be surprised), because perl_run()
isn't called? After all, main() looks like this:

    exitstatus = perl_parse(my_perl, xs_init, argc, argv, (char **)NULL);
    if (!exitstatus)

Well, no, actually it can still matter. Because this program enters

    perl -e 'BEGIN {exit 0}'

If a call to my_exit() was caught by perl_parse, then value it returns is
the exit status. Which in this case is indistinguishable from a successful
parse. So perl_run() is entered, but returns pretty quickly because
PL_main_start is still NULL.

("Can you tell what it is yet?")

Well, in that case PL_main_start is NULL in blead. But if you try to
s/PL_main_start/CvSTART(PL_main_sv)/ then the wheels fall off. You've merged
the two previously disjoint pointers. The (original) CvSTART(PL_main_sv) is
non-NULL because it's pointing to an OP slab, now the former PL_main_start
is the same thing, so now enter perl_run() with it non-NULL, and hilarity
(well, a SEGV) rapidly ensues.

So the moral of this story is that it's rather too complex to ensure that
in the corner cases the slab is freed prior to global destruction.

And the practical take-away is that when CvROOT() is NULL, "CvSTART() is not
the pointer you were looking for". Remember that, and you'll be fine.

While investigating PL_main_start and PL_main_root, I had reason to read the
mailing list archives from July 2003, and discovered a patch from Adrian
Enache which had gone overlooked, which removed PL_formfeed.

PL_formfeed is (well was), the internal interpreter variable used as a
shortcut to C<$^L>, "What formats output as a form feed. The default is
C<\f>." The variable is only accessed in one place, in C<pp_leavewrite>, in
code relating to I/O, so it seems a strange trade off to dedicate a data
pointer per interpreter plus various support code to something used so
rarely, and not even in a speed critical location.

After 9 years the patch almost applied cleanly. However, builds would fail
with it for a couple of reasons. Firstly, in a subsequent (applied) patch,
Adrian Enache himself had supplied a wrapper, B::formfeed, to access the
variable for the compiler suite. Secondly, Larry had added code as part of
MAD which assigned the value of the CV under compilation to PL_formfeed, if
that CV was called "import". It turns out that this was vestigial debugging
code left over from when Larry was developing MAD - it was used as a way to
easily cause gdb to gain control in particular circumstances. Similar code
was removed in 2007, but this one was missed then.

So with these problems resolved, it was possible to apply the patch, and
slightly reduce the (per-thread) interpreter size, without hurting

In the middle of the month, black smoke started issuing from the smoke
testers after Dave merged a bug fix back to blead, a fix that "worked on my
machine".  However, it didn't work on several of the smokers, which build
with -DPERL_POISON. The intent of -DPERL_POISON is to try scribble all over
freed memory, to spot use-after free errors (by promptly SEGVing). Yes,
memory checkers such as Purify, valgrind or Address Sanitizer can spot
these, but not on all platforms, and at a cost on the platforms that they

So, is this a use after free bug, or is it a bug in the PERL_POISON code
itself? Turns out that it was a bug in the PERL_POISON code, specifically
in part of the state freeing code related to recursive regular expressions,
which had never previously been covered in the regression test suite.
Run time back to commit 94010e71b67db040 in 2005 and it's possible to get
PERL_POISON then to SEGV with the same test case.

So, easy, fix it...

Only that wasn't so easy, because the code in question was deeply tangled
with code enabled under PERL_OLD_COPY_ON_WRITE. This is conditionally enabled
code for a previous attempt at a copy-on-write scheme. It's not enabled by
default because (a) it still has at least one unexplained unresolved test
failure and (b) it wasn't measurably faster. But it's intentionally left in
as a placeholder because (a) it's zero code by default (b) it serves as a
prototype/placeholder for any future more successful copy-on-write scheme.

Both of these are my fault. So I figured it wasn't fair to ask Dave to fix
them. It turned out that Dave's recent improvements to avoid big copies in
regex matching had broken the build with PERL_OLD_COPY_ON_WRITE. So I fixed
these, and then fixed the PERL_POISON code. Good. Then I tried both
together. Kaboom. Spin time back, and it turns out from day one a bug meant
that the two never worked together. So I fixed that, and finally could
verify that there are no (known or obvious) bugs remaining in the save code
in question.

A fair chunk of the month was taken up trying out an idea about optimising
named method calls. I'm sure I'm going to get frowned upon for this, but at
the social meeting we actually had a conversation about Perl.
Specifically, Ash Berlin mentioned an blog post about embracing the dynamic
nature of JRuby, and optimising it by trying very hard to avoid hash

The author, Charles Nutter, stated that they'd found that a single entry
cache at each method call point had turned out to be most effective. ie, for
code such as this:


assume that at this location $obj is (nearly) always going to be the same
class as you saw last time, and so record which method C<bark> ended up
resolving to.

This tallied with something Chip had said - that he thought that we could
speed method calls up with some sort of caching. Thanks to the @ISA caching
added to support pluggable MROs, we have pretty much all the infrastructure
needed to detect when a cache invalidation is necessary, and a single entry
(monomorphic) cache is fairly easy, so I had a go at this while offline
visiting my parents. I took the simplest possible approach. First time
through, remember the object's stash, and cache the address of the method.
Next time through, if it's the same stash, and nothing has been redefined,
use the method you remembered. If something has been redefined in the class
(or its parents), repeat the lookup and remember the new address. However,
if the object is blessed into a different class, assume the worst, that this
call site will continue to be called with objects of different classes, and
force the cache to be disabled (ie assume that it's megamorphic). I think I
had it working in 3 to 4 hours.

*That* was the *easy* part.

The hard part was trying to assess how much faster it was, and more
importantly how much slower it was for a cache miss.

Easy. Build two copies (optimised), and run some benchmarks. Well, apart
from "what is a good benchmark?" (which is a separate thread on

On one machine, the speed order was

fastest: cache hit
median:  cache miss
slowest: original code (no caching)

"cache miss" is more code and data than the original code, yet it is faster

On another machine, the speed order was

fastest: original code
median:  cache hit
slowest: cache miss

ie from that it seems that the only thing that I can be really confident
about is the relative ordering *results from the same binary*. Which is less
than effective at comparing the effects of different source code. :-(

So, what to do? Well, I remembered that a while back I'd had a lot of "fun"
trying to assess the effectiveness of code tweaks with perlbench, and
discovering that functionality completely unrelated to anything I'd touched
would run in measurably different time (sometimes faster, sometimes slower).
This, I eventually suspected, was due to gcc defaulting to align functions,
loops, labels and jump targets heuristically - if a better alignment is
nearby, add padding to get there, else don't pad. I got far more consistent
results by forcing the alignment of everything and avoiding the pad-or-not
roulette. So I tried this. And still I had unexplainable answers.

"when you have eliminated the impossible, whatever remains, however
improbable, must be the truth"

Or at least, something like that. I noticed that the way I had originally
implemented my test case, I selected the two behaviours (cache miss/cache hit)
by an optional (true) command line argument. So I tried passing 0 as the
command line argument. And the times were *different* from no command line
argument, despite every part of the Perl code executed being identical.

At which point I discovered that there was a repeatable 10% speed difference
between no command line arguments and any command line arguments. I reduced
it down to this code, and ran it on a clean build of (then) blead:

    sub baz {

    baz() for 0 .. 1e6;


Note that it doesn't even *read* @ARGV, yet it is influenced by it. For some
reason, the sub call is necessary.

valgrind shows that both have pretty much the same behaviour, with slightly
more work done by, and slightly more cache misses from, running with
arguments. As you would expect. Yet *that* is the case with the shorter run

No, it makes no sense. But it did mean that I re-wrote my test case to
always take one command line argument, and call exactly the same number of
Perl OPs either way.

And yet my illogical results did not go away.

Finally, I found something that seemed to make a difference. I built all the
object files with functions aligned on 64 byte boundaries. That means that
they are all aligned with L1 cache lines. So order *should* not matter. (And
there is sufficient RAM that nothing is swapping.)

Yet order *does* matter.

I took the same object files that I'd built in the morning. Without touching
them, I re-linked them in the default order specified by make, and in
(forwards and reverse) lexical order. The same object files (same cache
alignment) linked in reverse lexical order were 4% faster than when linked
in make's default (mixed up) order.

No, I don't understand.

But yes, I *can* replicate this.

So I tried forcibly linking my build tree for the patched-under-test version
in reverse lexical order, and comparing it with the clean version, linked
in reverse lexical order. (Same compiler flags, obviously.)

Finally, I seem to have results that make some sense. For a tight loop
repeatedly calling the same empty method, a cache hit was a 6% speedup.
Forcing the call site to be treated as a miss (the megamorphic case) was a
1.5% slowdown. In lieu of a decent "typical program" benchmark, I used the
build tool C<mktables> and the install tool C<installman>, which showed 0.7%
to 1% speedups. Which is promising.

But I'm not sure how "typical" they are. 1.5% slowdown will hurt, if in
reality most call sites are megamorphic. So that's the next problem. How to
find "typical program" benchmarks. I asked perl5-porters and for

There have been several useful suggestions so far on which areas or modules
should form a reasonable area to work on. But I was hoping more for
something closer to "module X already has this load test script", than "get
a corpus of docs, build an inverted index out of it and then do some
searches". For that I *first* need to write a search engine, tune it,
assemble sane test data, test it, and only *then* can I start to turn it
into a benchmark.

[To be fair to Simon Wistow, it is actually a really good suggestion of the
right sort of things to test. It's just I was hoping for the final paragraph
to be "and here's where you download it from CPAN" :-)]

As well as looking at the impact of the method caching proposal, I also
tried benchmarking Chip's prototype patch for read only aliases ("views"). I
tried unmodified (then) blead, with Chip's original patch, and then two
variants of Chip's patch. Firstly I attempted to simplify the code by
assuming that the only flag bit allowed to be set on SVt_VIEW is
SVf_READONLY, removing the need for a bit masking operation. Secondly, I
used gcc's compiler hints to indicate to the code generator that scalars
were "unlikely" to be views.  The final tweak helped a bit, but overall the
patch causes a 15-20% slowdown for all code.

Using valgrind (on a machine with 3 levels of cache, so cachegrind is faking
things as LL "last level"), I observed that between patched and unpatched,
we don't have significantly different LLd cache misses. We have more L1 data
cache misses, and quite a few more data cache reads. I had assumed that most
of the SV heads would already be in the L1 cache, so it's a bit surprising
that this goes up.

However, we have massively more I refs, and massively more I1 misses. I
suspect that *this* is what is killing us, which I wasn't quite expecting.
We're maxing out the CPU. Not idling the CPU because we're thrashing the
data cache. That's interesting. It's not totally clear at the next level of
detail whether it's particularly just

a) too many instructions, or
b) thrashing the I1 cache, or
c) failing branch predication, or
d) something I didn't think of.

But it's suggesting that speed improvements in the Perl interpreter
generally (if anything) are going to have to come from identifying which of
those four is the problem, and addressing it in a way that doesn't penalise
too badly any of the others.

In the process of thinking about the implications of read only aliases
versus copies, I discovered that whilst Perl 6 specifies read only aliases,
right now Rakudo actually implements a read only snapshot (ie copy). I've
reported this to their bug tracker, as a discrepancy between the spec and
the implementation, but it's not yet clear which is going to change to bring
Rakudo into conformance.

Near the end of the month, whilst offline visiting my parents, I happened to
look again at the method dispatch code, and something caught my eye about

The background to this is that Perl 5 has rather quirky dispatch rules.
Because neither file handles nor stashes are first class objects, barewords
can refer to either (as well as (I think) 5 other things). Additionally, as
file handles accept method calls, there's an inherent ambiguity. What does
this mean?


Well, it depends at runtime on whether the bareword Foo is a valid file
handle. If it's a file handle, that is used as the object for an instance
method call. If it's not, then that is used as the class name for a class
method call. Good eh? You can't even know until runtime whether a particular
syntax represents a class or instance method.

Somewhat surprisingly, file handles take priority over package names:

    $ ./perl -Ilib -le 'Foo->bar;'
    Can't locate object method "bar" via package "Foo" (perhaps you forgot to load "Foo"?) at -e line 1.

    $ ./perl -Ilib -le 'Foo->bar; open Foo'
    Can't locate object method "bar" via package "IO::File" at -e line 1.

    $ ./perl -Ilib -le 'sub Foo::bar {warn}; Foo->bar; open Foo'
    Can't locate object method "bar" via package "IO::File" at -e line 1.

Back in 2003 Artur Bergman noticed that this runtime check for a file handle
was a measurable part of class method dispatch. For any (intended) method
call on string class name, correctness required first verifying that no file
handle of that name existed. So he added a cache to the method dispatch
code, PL_stashcache, which recorded the stash if the code resolved a name to
a stash, so that subsequent calls on the same class would skip the checks
and be much faster. To preserve existing behaviour, the cache was flushed if
a new file handle was created (commit 081fc587427bbcef). Early bugs were
soon resolved (commit 7e8961ecc77ac069, 2 weeks later), and the code has
performed reliably ever since, first appearing in v5.8.1.

Looking at the code, I was wondering whether it's really necessary to flush
the cache when creating all file handles, as anonymous file handles assigned
to lexicals can't affect method resolution. So I wondered whether it would
be possible to move the flush from the "create PVIO" code to the points
where PVIOs were assigned to typeglobs. In the process of investigating
this, I noticed that there seemed to be a hole in the existing flushing
logic - if an existing file handle is assigned to a typeglob, nothing is
flushed. So I wrote a test for this, and then spent quite a while scratching
my head as to why it failed to fail.

Digging and bisecting revealed that the stash cache had inadvertently been
disabled as a result of commit da6b625f78f5f133 in August 2011. As it's a
only cache, quite correctly everything carried on working without it, hence
nobody noticed. Fixing that made no difference to existing tests, but my new
test cases then correctly started to fail. These are regressions between
v5.8.0, which didn't have the cache, and v5.8.1 and all releases until
maint-5.14, which do. So I then added code to flush PL_stashcache when
assigning a file handle to a typeglob. This fixes all the problems which I'm
aware of. In the process I also added more debugging for -Do, displaying the
behaviour of the stash cache, and a lot of tests for the interaction of file
handles and package names during method dispatch.

I haven't yet had time to take things further on moving the PVIO creation
flush code to more targeted locations, as (a) commit 7e8961ecc77ac069 fixed
bugs, but didn't add any test cases, so it's not obvious what it's actually
fixing, and whether other code also catches it and (b) the "impending doom"
of moving house is now much less "impending" than before.

After Florian Ragwitz shipped v5.17.4, I removed the MPE/iX port.
MPE/iX was a business-oriented minicomputer operating system made by
Hewlett-Packard. Support from HP terminated at the end of 2010.

 MANIFEST                      |   9 -
 Porting/exec-bit.txt          |   2 -
 Porting/ |   2 +-
 README.mpeix                  | 711 -------------------------------------
 ext/DynaLoader/dl_mpeix.xs    | 146 --------
 ext/File-Glob/t/basic.t       |   2 +-
 ext/POSIX/t/posix.t           |   2 -
 ext/re/hints/         |   3 -
 hints/                | 136 -------
 installperl                   |  27 +-
 lib/File/              |  10 +-
 lib/FileHandle.t              |   4 -
 lib/                |   3 +-
 mpeix/mpeix.c                 | 802 ------------------------------------------
 mpeix/mpeix_setjmp.c          | 355 -------------------
 mpeix/mpeixish.h              | 193 ----------
 mpeix/nm                      |  33 --
 mpeix/relink                  |  49 ---
 perl.c                        |   4 -
 perl.h                        |   5 -
 pod/perl.pod                  |   1 -
 pod/perl5005delta.pod         |   2 +-
 pod/perlport.pod              |  20 +-
 pp_sys.c                      |   4 +-
 t/io/pipe.t                   |   5 +-
 t/op/die_exit.t               |   2 -
 t/op/fork.t                   |   3 -
 t/op/magic.t                  |   3 +-
 t/op/pack.t                   |   2 -
 t/op/stat.t                   |   1 -
 t/op/sysio.t                  |   3 +-
 win32/Makefile                |  11 +-
 win32/             |  11 +-
 33 files changed, 30 insertions(+), 2536 deletions(-)

The list of suspected "special biologist word for stable" platforms is here:

It's worth noting that getting rid of stuff completely is actually hard.
It's easy to get *most* of it, but often some obscure small part remains.
Because of this, cruft slowly accumulates. This week, I noticed a couple of
lines still referencing MiNT, support for which was removed in August 2009
by cd86ed9d430a95bb. These may be the last, or possibly there are still some
tentacles hiding in obscure corners, waiting to scare innocent passers-by.

A more detailed breakdown summarised from the weekly reports. In these:

16 hex digits refer to commits in
RT #... is a bug in
CPAN #... is a bug in
BBC is "bleadperl breaks CPAN" - Andreas K├Ânig's test reports for CPAN modules
ID YYYYMMDD.### is an bug number in the old bug system. The RT # is given
  afterwards. You can look up the old IDs at

[Hours]		[Activity]
  4.00		$^L
  8.50		-DPERL_POISON failure
  1.75		ByteLoader
  1.25		COW
  0.75		MPE/iX
		MPE/iX, mint
 14.50		PL_main_start/PL_main_root
  6.00		PL_stashcache
  0.50		Perl_magic_setdbline
  0.25		RT #114312
 16.50		benchmarking
		benchmarking VIEWs
		benchmarking, optimising
  1.50		bootstrapping Will Braswell
  0.25		darwin SEGV
  1.00		failing smoke reports
  0.50		investigating security tickets
 16.50		method_named
		method_named benchmarking
  1.00		optimising sub entry
  1.25		process, scalability, mentoring
 44.50		reading/responding to list mail
  4.25		readonly ops
  0.50		state
125.25 hours total

Nicholas Clark

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About