develooper Front page | perl.perl5.porters | Postings from September 2012

NWCLARK TPF grant report #54

Thread Next
Nicholas Clark
September 22, 2012 01:51
NWCLARK TPF grant report #54
Message ID:
[Hours]		[Activity]
2012/09/10	Monday
 2.25		method_named
 4.50		reading/responding to list mail
 0.50		state

2012/09/11	Tuesday
 0.50		method_named
 1.00		method_named benchmarking
 0.50		process, scalability, mentoring
 2.75		reading/responding to list mail

2012/09/12	Wednesday
 1.00		benchmarking
 9.00		method_named benchmarking
 0.50		reading/responding to list mail

2012/09/13	Thursday
 8.25		-DPERL_POISON failure

2012/09/14	Friday
 0.25		-DPERL_POISON failure
 5.75		benchmarking

2012/09/16	Sunday
 2.50		benchmarking

Two distinct activities dominated this week.

The smaller was dealing with the black smoke issuing from the smoke testers
after Dave merged a bug fix back to blead, a fix that "worked on my
machine". However, it didn't work on several of the smokers, which build
with -DPERL_POISON. The intent of -DPERL_POISON is to try scribble all over
freed memory, to spot use-after free errors (by promptly SEGVing). Yes,
memory checkers such as Purify, valgrind or Address Sanitizer can spot
these, but not on all platforms, and at a cost on the platforms that they

So, is this a use after free bug, or is it a bug in the PERL_POISON code
itself? Turns out that it was a bug in the PERL_POISON code, specifically
in part of the state freeing code related to recursive regular expressions,
which had never previously been covered in the regression test suite.
Run time back to commit 94010e71b67db040 in 2005 and it's possible to get
PERL_POISON then to SEGV with the same test case.

So, easy, fix it...

Only that wasn't so easy, because the code in question was deeply tangled
with code enabled under PERL_OLD_COPY_ON_WRITE. This is conditionally enabled
code for a previous attempt at a copy-on-write scheme. It's not enabled by
default because (a) it still has at least one unexplained unresolved test
failure and (b) it wasn't measurably faster. But it's intentionally left in
as a placeholder because (a) it's zero code by default (b) it serves as a
prototype/placeholder for any future more successful copy-on-write scheme.

Both of these are my fault. So I figured it wasn't fair to ask Dave to fix
them. It turned out that Dave's recent improvements to avoid big copies
in regex matching had broken the build with PERL_OLD_COPY_ON_WRITE. So I
fixed these, and then fixed the PERL_POISON code. Good. Then I tried
both together. Kaboom. Spin time back, and it turns out from day one a bug
meant that the two never worked together. So I fixed that, and finally
could verify that there are no (known or obvious) bugs remaining in the
save code in question.

The bigger activity of the week was trying out an idea about optimising
named method calls. I'm sure I'm going to get frowned upon for this, but
at the social meeting we actually had a conversation about Perl.
Specifically, Ash Berlin mentioned an blog post about embracing the dynamic
nature of JRuby, and optimising it by trying very hard to avoid hash lookups:

The author, Charles Nutter, stated that they'd found that a single entry
cache at each method call point had turned out to be most effective. ie, for
code such as this:


assume that at this location $obj is (nearly) always going to be the same
class as you saw last time, and so record which method C<bark> ended up
resolving to.

This tallied with something Chip had said - that he thought that we could
speed method calls up with some sort of caching. Thanks to the @ISA caching
added to support pluggable MROs, we have pretty much all the infrastructure
needed to detect when a cache invalidation is necessary, and a single entry
(monomorphic) cache is fairly easy, so I had a go at this while offline
visiting my parents. I took the simplest possible approach. First time
through, remember the object's stash, and cache the address of the method.
Next time through, if it's the same stash, and nothing has been redefined,
use the method you remembered. If something has been redefined in the class
(or its parents), repeat the lookup and remember the new address. However,
if the object is blessed into a different class, assume the worst, that this
call site will continue to be called with objects of different classes, and
force the cache to be disabled (ie assume that it's megamorphic). I think I
had it working in 3 to 4 hours.

*That* was the *easy* part.

The hard part was trying to assess how much faster it was, and more
importantly how much slower it was for a cache miss.

Easy. Build two copies (optimised), and run some benchmarks. Well, apart
from "what is a good benchmark?" (which is a separate thread on

On one machine, the speed order was

fastest: cache hit
median:  cache miss
slowest: original code (no caching)

"cache miss" is more code and data than the original code, yet it is faster

On another machine, the speed order was

fastest: original code
median:  cache hit
slowest: cache miss

ie from that it seems that the only thing that I can be really confident
about is the relative ordering *results from the same binary*. Which is less
than effective at comparing the effects of different source code. :-(

So, what to do? Well, I remembered that a while back I'd had a lot of "fun"
trying to assess the effectiveness of code tweaks with perlbench, and
discovering that functionality completely unrelated to anything I'd touched
would run in measurably different time (sometimes faster, sometimes slower).
This, I eventually suspected, was due to gcc defaulting to align functions,
loops, labels and jump targets heuristically - if a better alignment is
nearby, add padding to get there, else don't pad. I got far more consistent
results by forcing the alignment of everything and avoiding the pad-or-not
roulette. So I tried this. And still I had unexplainable answers.

"when you have eliminated the impossible, whatever remains, however
improbable, must be the truth"

Or at least, something like that. I noticed that the way I had originally
implemented my test case, I selected the two behaviours (cache miss/cache hit)
by an optional (true) command line argument. So I tried passing 0 as the
command line argument. And the times were *different* from no command line
argument, despite every part of the Perl code executed being identical.

At which point I discovered that there was a repeatable 10% speed difference
between no command line arguments and any command line arguments. I reduced
it down to this code, and ran it on a clean build of (then) blead:

    sub baz {

    baz() for 0 .. 1e6;


Note that it doesn't even *read* @ARGV, yet it is influenced by it. For some
reason, the sub call is necessary.

valgrind shows that both have pretty much the same behaviour, with slightly
more work done by, and slightly more cache misses from, running with
arguments. As you would expect. Yet *that* is the case with the shorter run

No, it makes no sense. But it did mean that I re-wrote my test case to
always take one command line argument, and call exactly the same number of
Perl OPs either way.

And yet my illogical results did not go away.

Finally, I found something that seemed to make a difference. I built all the
object files with functions aligned on 64 byte boundaries. That means that
they are all aligned with L1 cache lines. So order *should* not matter. (And
there is sufficient RAM that nothing is swapping.)

Yet order *does* matter.

I took the same object files that I'd built in the morning. Without touching
them, I re-linked them in the default order specified by make, and in
(forwards and reverse) lexical order. The same object files (same cache
alignment) linked in reverse lexical order were 4% faster than when linked
in make's default (mixed up) order.

No, I don't understand.

But yes, I *can* replicate this.

So I tried forcibly linking my build tree for the patched-under-test version
in reverse lexical order, and comparing it with the clean version, linked
in reverse lexical order. (Same compiler flags, obviously.)

Finally, I seem to have results that make some sense. For a tight loop
repeatedly calling the same empty method, a cache hit was a 6% speedup.
Forcing the call site to be treated as a miss (the megamorphic case) was a
1.5% slowdown. In lieu of a decent "typical program" benchmark, I used the
build tool C<mktables> and the install tool C<installman>, which showed 0.7%
to 1% speedups. Which is promising.

But I'm not sure how "typical" they are. 1.5% slowdown will hurt, if in
reality most call sites are megamorphic. So that's the next problem. How to
find "typical program" benchmarks. I asked perl5-porters and for

There have been several useful suggestions so far on which areas or modules
should form a reasonable area to work on. But I was hoping more for
something closer to "module X already has this load test script", than "get
a corpus of docs, build an inverted index out of it and then do some
searches". For that I *first* need to write a search engine, tune it,
assemble sane test data, test it, and only *then* can I start to turn it
into a benchmark.

[To be fair to Simon Wistow, it is actually a really good suggestion of the
right sort of things to test. It's just I was hoping for the final paragraph
to be "and here's where you download it from CPAN" :-)]

Nicholas Clark

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About