develooper Front page | perl.perl5.porters | Postings from December 2014

Re: ANNOUNCE: Porting/bench.pl

Thread Previous | Thread Next
From:
Jarkko Hietaniemi
Date:
December 1, 2014 13:58
Subject:
Re: ANNOUNCE: Porting/bench.pl
Message ID:
CAJueppt1UJzoJ9RegTyh__V_2kiQ4Vnp_eprAHh9rQRJFT+wyw@mail.gmail.com
Excellent!

Though I must admit to a human flaw that I'd be even more enthusiastic
if you didn't keep finding my patches being performance problems :-)

On Sat, Nov 29, 2014 at 6:29 PM, Dave Mitchell <davem@iabyn.com> wrote:
> On Tue, Oct 21, 2014 at 03:54:56PM +0100, Dave Mitchell wrote:
>> I have not yet done (and don't intend to do immediately):
> ...
>> create tools to run timings and cachegrinds of  t/perf/benchmarks code
>>     snippets, possibly across multiple perls
>
> Well, now I have.
>
> (I can say that I am genuinely very excited by this.)
>
> TL;DR: I've created a new tool, Porting/bench.pl, loosely inspired by
> perlbench, that measures the performance of perl code snippets across
> multiple perls, by using the output of cachegrind, rather than timing.
> It's very fast (and can run tests in parallel), and *very* precise (like
> to five significant digits), in the sense that multiple runs and/or
> similar perls give virtually the same values each time. It is sensitive
> enough to allow bisecting of tiny performance changes in small code
> snippets.
>
> I'd recommend that anyone interested in benchmarking perl itself should
> read the rest of this email.
>
> For people who aren't aware, cachegrind is a tool in the valgrind suite,
> available on a few common hardware/OS platform combinations. It works
> by executing the code in a sort of emulation mode: much more slowly than
> usual, but it records as it goes along how many instruction reads, branch
> misses, cache misses etc., occur. It can later be used to annotate the
> source code line by line, but for our purposes, it also prints an overall
> summary for the execution. Since we are not doing time measurements,
> nor using hardware performance counters (like perf), it is unaffected by
> load, interrupts, or CPU swaps etc. In fact, as long as we set
> PERL_HASH_SEED=0, the results are virtually identical across runs.
>
> We can use this to run an empty or active loop 10 or 20 times,
> (so four runs of cachegrind), and subtract out the startup and loop
> overhead to get an exact set of counts for a single execution of the code
> snippet.
>
> Cachegrind isn't perfect of course; for example, it may detect a cache miss,
> but won't tell you how long the stall was for. A real CPU may have
> anticipated an upcoming stall, and may have already scheduled a read into
> the cache, meaning that the actual stall is shortened. But overall, I've
> found that as a surrogate outcome, it gives a good general indication of
> expected performance, without the 5-10% noise you usually get in
> timing-based benchmarks.
>
> Anyway, the easiest way to explain it is via a demo. The file
> t/perf/benchmarks contains the code snippets that this tool will
> benchmark by default. A typical entry in this file looks like:
>
>     'expr::assign::scalar_lex' => {
>         desc    => 'lexical $x = 1',
>         setup   => 'my $x',
>         code    => '$x = 1',
>     },
>
> The key is the test name, arranged into a hierarchy (and is also used as
> the package name to run the test in). Other than that, there's a
> description, the setup code, and the actual code to run. It's easy to add
> new tests, or to create a temporary benchmarks file with some custom code
> you wish to profile. Currently that file just contains three tests, but I
> expect it to expand hugely over time, and to have 100's or even 1000's of
> tests.
>
> (In the following, things like perl5201o are just -O perl binaries in my
> path.)
>
> So lets run the tests against the most recent released perls from each
> branch:
>
>     $ Porting/bench.pl -j 8 \
>         perl5125o perl5144o perl5163o perl5182o perl5201o perl5216o
>
> Note that the -j 8 option means that it runs 8 cachegrinds in parallel.
> On my laptop, bench.pl took 8 seconds to run, or about 0.5s per test/perl
> combination.
>
> The output is similar to perlbench, in that it gives percentages for each
> test, followed by an average. The difference is that by default bench.pl
> displays 13 lines of results per test rather than just 1, since there are
> multiple values being measured (instruction reads, write misses etc.). The
> output which bench.pl produces is as follows, except for brevity I've
> elided all the tests except one, (but kept the explanatory header):
>
>     Key:
>         Ir   Instruction read
>         Dr   Data read
>         Dw   Data write
>         COND conditional branches
>         IND  indirect branches
>         _m   branch predict miss
>         _m1  level 1 cache miss
>         _mm  last cache (e.g. L3) miss
>         -    indeterminate percentage (e.g. 1/0)
>
>     The numbers represent relative counts per loop iteration, compared to
>     perl5125o at 100.0%.
>     Higher is better: for example, using half as many instructions gives 200%,
>     while using twice as many gives 50%.
>
>     expr::assign::scalar_lex
>     lexical $x = 1
>
>            perl5125o perl5144o perl5163o perl5182o perl5201o perl5216o
>            --------- --------- --------- --------- --------- ---------
>         Ir    100.00    107.05    101.83    106.37    107.74    103.73
>         Dr    100.00    103.64    100.00    105.56    105.56    100.00
>         Dw    100.00    100.00     96.77    100.00    100.00     96.77
>       COND    100.00    120.83    116.00    126.09    126.09    120.83
>        IND    100.00     80.00     80.00     80.00     80.00     80.00
>
>     COND_m    100.00    100.00    100.00    100.00    100.00    100.00
>      IND_m    100.00    100.00    100.00    100.00    100.00    100.00
>
>      Ir_m1    100.00    100.00    100.00    100.00    100.00    100.00
>      Dr_m1    100.00    100.00    100.00    100.00    100.00    100.00
>      Dw_m1    100.00    100.00    100.00    100.00    100.00    100.00
>
>      Ir_mm    100.00    100.00    100.00    100.00    100.00    100.00
>      Dr_mm    100.00    100.00    100.00    100.00    100.00    100.00
>      Dw_mm    100.00    100.00    100.00    100.00    100.00    100.00
>
>     AVERAGE
>
>            perl5125o perl5144o perl5163o perl5182o perl5201o perl5216o
>            --------- --------- --------- --------- --------- ---------
>         Ir    100.00    104.96    100.77    111.61    111.91    107.97
>         Dr    100.00    103.78    100.61    113.12    111.11    106.55
>         Dw    100.00    100.84     98.44    109.38    109.66    106.35
>       COND    100.00    114.40    110.86    129.65    132.39    124.43
>        IND    100.00     82.86     82.86     96.72     99.08     99.08
>
>     COND_m    100.00     75.00     60.00     75.00     60.00     42.86
>      IND_m    100.00    100.00     97.96    112.15    117.65    117.65
>
>      Ir_m1    100.00    100.00    100.00    100.00    100.00    100.00
>      Dr_m1    100.00    100.00    100.00    100.00    100.00    100.00
>      Dw_m1    100.00    100.00    100.00    100.00    100.00    100.00
>
>      Ir_mm    100.00    100.00    100.00    100.00    100.00    100.00
>      Dr_mm    100.00    100.00    100.00    100.00    100.00    100.00
>      Dw_mm    100.00    100.00    100.00    100.00    100.00    100.00
>
>
> Note that a simple $x = 1 assignment seems to have gotten worse between
> 5.20.1 and 5.21.6: more instruction reads, data reads, data writes etc.
> (Don't worry about the cache misses; for code this small the counts
> tend to be very small (often just 0 or 1), and so the variations between
> perls tend to remain stubbornly at exactly 100.00%.)
>
> Now, you're probably thinking at this point (based on previous experiences
> with perlbench) that these variations are merely noise. Well, lets run
> just the '$x =1' test on each of the 5.21.x releases, and see what we get.
> This time we'll display the raw counts rather than % differences, and
> we'll only display some fields of interest:
>
>     $ Porting/bench.pl -j 8 --raw --fields=Ir,Dr,Dw \
>        --tests=expr::assign::scalar_lex \
>        perl5210o perl5211o perl5212o perl5213o perl5214o perl5215o perl5216o
>
>     ....
>
>     expr::assign::scalar_lex
>     lexical $x = 1
>
>        perl5210o perl5211o perl5212o perl5213o perl5214o perl5215o perl5216o
>        --------- --------- --------- --------- --------- --------- ---------
>     Ir     155.0     155.0     155.0     155.0     161.0     161.0     161.0
>     Dr      54.0      54.0      54.0      54.0      57.0      57.0      57.0
>     Dw      30.0      30.0      30.0      30.0      31.0      31.0      31.0
>
> Note how the number of instruction reads remains completely constant at
> 155 reads until 5.21.4, at which point it consistently increases to 161.
> Similarly for Dr, Dw etc. Can we bisect this? Of course we can :-)
>
> Run this shell script:
>
>     D=/home/davem/perl5/git/bleed
>
>     $D/Porting/bisect.pl              \
>      --start=v5.21.3                  \
>      --end=v5.21.4                    \
>      -Doptimize=-O2                   \
>      -j 16                            \
>      --target=miniperl                \
>      -- perl5201o $D/Porting/bench.pl \
>           -j 8                             \
>           --benchfile=$D/t/perf/benchmarks \
>           --tests=expr::assign::scalar_lex \
>           --perlargs='-Ilib'               \
>           --bisect='Ir,153,157'            \
>           ./miniperl
>
>
> (D points to a perl directory outside the current one, so that access to
> bisect.pl etc isn't affected by the bisecting.)
>
> The important argument here is --bisect='Ir,153,157', which instructs
> bench.pl to exit 0 only if the result for field Ir is in the range
> 153..157. Lets see when it goes outside that range. (5 minutes later...)
>
>     commit 7a82ace00e3cf19205df455cc1ec8a444d99d79d
>     Author: Jarkko Hietaniemi <jhi@iki.fi>
>     Date:   Thu Sep 11 21:19:07 2014 -0400
>
>         Use -fstack-protector-strong, if available.
>
> Hmmm, perhaps -fstack-protector-strong has a slight performance overhead
> we need to consider? (I don't know, I haven't looked any closer at it
> yet).
>
> But have you noticed what we just did? Successfully bisected a minute
> performance regression in the micro code sample '$x=1'. Whoo hoo!
>
> There are a few other command-line options of particular interest.
> --write=<file> and --read=<file> allow you to write the raw data from the
> run to a file, and read it back later. --sort allows you to sort the
> order of the test output, based on a particular field/perl combination.
> Combined with --read and --write, it allows you to run the tests just
> once, then view the results in various ways
>
>     $ Porting/bench.pl --write=/tmp/results \
>         perl5125o perl5144o perl5163o perl5182o perl5201o perl5216o
>     $ Porting/bench.pl --read=/tmp/results --sort=Ir,perl5216o
>     $ Porting/bench.pl --read=/tmp/results --sort=Dw,perl5125o
>     ...
>
> It's easy to create a temporary test while you're working on something.
> For example, suppose you're trying to improve the performance of regexes
> that use '$': just create a small test file and run it:
>
>     $ cat /tmp/benchmarks
>     [
>         'foo' => {
>             desc    => 'foo',
>             setup   => '$_ = "abc"',
>             code    => '/c$/',
>         },
>     ];
>
>     $ Porting/bench.pl -j 8 --benchfile=/tmp/benchmarks ./perl.orig ./perl
>
>
> Finally, I think it might be an idea to add a new step to the Release
> Manager's Guide along the lines of "early on, use bench.pl to compare the
> new release against the previous release, and see if anything has got
> noticeably worse" - probably using the --sort option to spot the worst
> cases.
>
>
> --
> Dave's first rule of Opera:
> If something needs saying, say it: don't warble it.



-- 
There is this special biologist word we use for 'stable'. It is
'dead'. -- Jack Cohen

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About