develooper Front page | perl.perl5.porters | Postings from November 2014

ANNOUNCE: Porting/

Thread Previous | Thread Next
Dave Mitchell
November 29, 2014 23:30
ANNOUNCE: Porting/
Message ID:
On Tue, Oct 21, 2014 at 03:54:56PM +0100, Dave Mitchell wrote:
> I have not yet done (and don't intend to do immediately):
> create tools to run timings and cachegrinds of  t/perf/benchmarks code
>     snippets, possibly across multiple perls

Well, now I have.

(I can say that I am genuinely very excited by this.)

TL;DR: I've created a new tool, Porting/, loosely inspired by
perlbench, that measures the performance of perl code snippets across
multiple perls, by using the output of cachegrind, rather than timing.
It's very fast (and can run tests in parallel), and *very* precise (like
to five significant digits), in the sense that multiple runs and/or
similar perls give virtually the same values each time. It is sensitive
enough to allow bisecting of tiny performance changes in small code

I'd recommend that anyone interested in benchmarking perl itself should
read the rest of this email.

For people who aren't aware, cachegrind is a tool in the valgrind suite,
available on a few common hardware/OS platform combinations. It works
by executing the code in a sort of emulation mode: much more slowly than
usual, but it records as it goes along how many instruction reads, branch
misses, cache misses etc., occur. It can later be used to annotate the
source code line by line, but for our purposes, it also prints an overall
summary for the execution. Since we are not doing time measurements,
nor using hardware performance counters (like perf), it is unaffected by
load, interrupts, or CPU swaps etc. In fact, as long as we set
PERL_HASH_SEED=0, the results are virtually identical across runs.

We can use this to run an empty or active loop 10 or 20 times,
(so four runs of cachegrind), and subtract out the startup and loop
overhead to get an exact set of counts for a single execution of the code

Cachegrind isn't perfect of course; for example, it may detect a cache miss,
but won't tell you how long the stall was for. A real CPU may have
anticipated an upcoming stall, and may have already scheduled a read into
the cache, meaning that the actual stall is shortened. But overall, I've
found that as a surrogate outcome, it gives a good general indication of
expected performance, without the 5-10% noise you usually get in
timing-based benchmarks.

Anyway, the easiest way to explain it is via a demo. The file
t/perf/benchmarks contains the code snippets that this tool will
benchmark by default. A typical entry in this file looks like:

    'expr::assign::scalar_lex' => {
        desc    => 'lexical $x = 1',
        setup   => 'my $x',
        code    => '$x = 1',

The key is the test name, arranged into a hierarchy (and is also used as
the package name to run the test in). Other than that, there's a
description, the setup code, and the actual code to run. It's easy to add
new tests, or to create a temporary benchmarks file with some custom code
you wish to profile. Currently that file just contains three tests, but I
expect it to expand hugely over time, and to have 100's or even 1000's of

(In the following, things like perl5201o are just -O perl binaries in my

So lets run the tests against the most recent released perls from each

    $ Porting/ -j 8 \
        perl5125o perl5144o perl5163o perl5182o perl5201o perl5216o

Note that the -j 8 option means that it runs 8 cachegrinds in parallel.
On my laptop, took 8 seconds to run, or about 0.5s per test/perl

The output is similar to perlbench, in that it gives percentages for each
test, followed by an average. The difference is that by default
displays 13 lines of results per test rather than just 1, since there are
multiple values being measured (instruction reads, write misses etc.). The
output which produces is as follows, except for brevity I've
elided all the tests except one, (but kept the explanatory header):

        Ir   Instruction read
        Dr   Data read
        Dw   Data write
        COND conditional branches
        IND  indirect branches
        _m   branch predict miss
        _m1  level 1 cache miss
        _mm  last cache (e.g. L3) miss
        -    indeterminate percentage (e.g. 1/0)

    The numbers represent relative counts per loop iteration, compared to
    perl5125o at 100.0%.
    Higher is better: for example, using half as many instructions gives 200%,
    while using twice as many gives 50%.

    lexical $x = 1

           perl5125o perl5144o perl5163o perl5182o perl5201o perl5216o
           --------- --------- --------- --------- --------- ---------
        Ir    100.00    107.05    101.83    106.37    107.74    103.73
        Dr    100.00    103.64    100.00    105.56    105.56    100.00
        Dw    100.00    100.00     96.77    100.00    100.00     96.77
      COND    100.00    120.83    116.00    126.09    126.09    120.83
       IND    100.00     80.00     80.00     80.00     80.00     80.00

    COND_m    100.00    100.00    100.00    100.00    100.00    100.00
     IND_m    100.00    100.00    100.00    100.00    100.00    100.00

     Ir_m1    100.00    100.00    100.00    100.00    100.00    100.00
     Dr_m1    100.00    100.00    100.00    100.00    100.00    100.00
     Dw_m1    100.00    100.00    100.00    100.00    100.00    100.00

     Ir_mm    100.00    100.00    100.00    100.00    100.00    100.00
     Dr_mm    100.00    100.00    100.00    100.00    100.00    100.00
     Dw_mm    100.00    100.00    100.00    100.00    100.00    100.00


           perl5125o perl5144o perl5163o perl5182o perl5201o perl5216o
           --------- --------- --------- --------- --------- ---------
        Ir    100.00    104.96    100.77    111.61    111.91    107.97
        Dr    100.00    103.78    100.61    113.12    111.11    106.55
        Dw    100.00    100.84     98.44    109.38    109.66    106.35
      COND    100.00    114.40    110.86    129.65    132.39    124.43
       IND    100.00     82.86     82.86     96.72     99.08     99.08

    COND_m    100.00     75.00     60.00     75.00     60.00     42.86
     IND_m    100.00    100.00     97.96    112.15    117.65    117.65

     Ir_m1    100.00    100.00    100.00    100.00    100.00    100.00
     Dr_m1    100.00    100.00    100.00    100.00    100.00    100.00
     Dw_m1    100.00    100.00    100.00    100.00    100.00    100.00

     Ir_mm    100.00    100.00    100.00    100.00    100.00    100.00
     Dr_mm    100.00    100.00    100.00    100.00    100.00    100.00
     Dw_mm    100.00    100.00    100.00    100.00    100.00    100.00

Note that a simple $x = 1 assignment seems to have gotten worse between
5.20.1 and 5.21.6: more instruction reads, data reads, data writes etc.
(Don't worry about the cache misses; for code this small the counts
tend to be very small (often just 0 or 1), and so the variations between
perls tend to remain stubbornly at exactly 100.00%.)

Now, you're probably thinking at this point (based on previous experiences
with perlbench) that these variations are merely noise. Well, lets run
just the '$x =1' test on each of the 5.21.x releases, and see what we get.
This time we'll display the raw counts rather than % differences, and
we'll only display some fields of interest:

    $ Porting/ -j 8 --raw --fields=Ir,Dr,Dw \
       --tests=expr::assign::scalar_lex \
       perl5210o perl5211o perl5212o perl5213o perl5214o perl5215o perl5216o


    lexical $x = 1

       perl5210o perl5211o perl5212o perl5213o perl5214o perl5215o perl5216o
       --------- --------- --------- --------- --------- --------- ---------
    Ir     155.0     155.0     155.0     155.0     161.0     161.0     161.0
    Dr      54.0      54.0      54.0      54.0      57.0      57.0      57.0
    Dw      30.0      30.0      30.0      30.0      31.0      31.0      31.0

Note how the number of instruction reads remains completely constant at
155 reads until 5.21.4, at which point it consistently increases to 161.
Similarly for Dr, Dw etc. Can we bisect this? Of course we can :-)

Run this shell script:


    $D/Porting/              \
     --start=v5.21.3                  \
     --end=v5.21.4                    \
     -Doptimize=-O2                   \
     -j 16                            \
     --target=miniperl                \
     -- perl5201o $D/Porting/ \
          -j 8                             \
          --benchfile=$D/t/perf/benchmarks \
          --tests=expr::assign::scalar_lex \
          --perlargs='-Ilib'               \
          --bisect='Ir,153,157'            \

(D points to a perl directory outside the current one, so that access to etc isn't affected by the bisecting.)

The important argument here is --bisect='Ir,153,157', which instructs to exit 0 only if the result for field Ir is in the range
153..157. Lets see when it goes outside that range. (5 minutes later...)

    commit 7a82ace00e3cf19205df455cc1ec8a444d99d79d
    Author: Jarkko Hietaniemi <>
    Date:   Thu Sep 11 21:19:07 2014 -0400

        Use -fstack-protector-strong, if available.
Hmmm, perhaps -fstack-protector-strong has a slight performance overhead
we need to consider? (I don't know, I haven't looked any closer at it

But have you noticed what we just did? Successfully bisected a minute
performance regression in the micro code sample '$x=1'. Whoo hoo!

There are a few other command-line options of particular interest.
--write=<file> and --read=<file> allow you to write the raw data from the
run to a file, and read it back later. --sort allows you to sort the
order of the test output, based on a particular field/perl combination.
Combined with --read and --write, it allows you to run the tests just
once, then view the results in various ways

    $ Porting/ --write=/tmp/results \
        perl5125o perl5144o perl5163o perl5182o perl5201o perl5216o
    $ Porting/ --read=/tmp/results --sort=Ir,perl5216o
    $ Porting/ --read=/tmp/results --sort=Dw,perl5125o

It's easy to create a temporary test while you're working on something.
For example, suppose you're trying to improve the performance of regexes 
that use '$': just create a small test file and run it:

    $ cat /tmp/benchmarks 
        'foo' => {
            desc    => 'foo',
            setup   => '$_ = "abc"',
            code    => '/c$/',

    $ Porting/ -j 8 --benchfile=/tmp/benchmarks ./perl.orig ./perl

Finally, I think it might be an idea to add a new step to the Release
Manager's Guide along the lines of "early on, use to compare the
new release against the previous release, and see if anything has got
noticeably worse" - probably using the --sort option to spot the worst

Dave's first rule of Opera:
If something needs saying, say it: don't warble it.

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About