Front page | perl.perl5.porters |
Postings from December 2014
Re: ANNOUNCE: Porting/bench.pl
Thread Previous
|
Thread Next
From:
Jarkko Hietaniemi
Date:
December 1, 2014 13:58
Subject:
Re: ANNOUNCE: Porting/bench.pl
Message ID:
CAJueppt1UJzoJ9RegTyh__V_2kiQ4Vnp_eprAHh9rQRJFT+wyw@mail.gmail.com
Excellent!
Though I must admit to a human flaw that I'd be even more enthusiastic
if you didn't keep finding my patches being performance problems :-)
On Sat, Nov 29, 2014 at 6:29 PM, Dave Mitchell <davem@iabyn.com> wrote:
> On Tue, Oct 21, 2014 at 03:54:56PM +0100, Dave Mitchell wrote:
>> I have not yet done (and don't intend to do immediately):
> ...
>> create tools to run timings and cachegrinds of t/perf/benchmarks code
>> snippets, possibly across multiple perls
>
> Well, now I have.
>
> (I can say that I am genuinely very excited by this.)
>
> TL;DR: I've created a new tool, Porting/bench.pl, loosely inspired by
> perlbench, that measures the performance of perl code snippets across
> multiple perls, by using the output of cachegrind, rather than timing.
> It's very fast (and can run tests in parallel), and *very* precise (like
> to five significant digits), in the sense that multiple runs and/or
> similar perls give virtually the same values each time. It is sensitive
> enough to allow bisecting of tiny performance changes in small code
> snippets.
>
> I'd recommend that anyone interested in benchmarking perl itself should
> read the rest of this email.
>
> For people who aren't aware, cachegrind is a tool in the valgrind suite,
> available on a few common hardware/OS platform combinations. It works
> by executing the code in a sort of emulation mode: much more slowly than
> usual, but it records as it goes along how many instruction reads, branch
> misses, cache misses etc., occur. It can later be used to annotate the
> source code line by line, but for our purposes, it also prints an overall
> summary for the execution. Since we are not doing time measurements,
> nor using hardware performance counters (like perf), it is unaffected by
> load, interrupts, or CPU swaps etc. In fact, as long as we set
> PERL_HASH_SEED=0, the results are virtually identical across runs.
>
> We can use this to run an empty or active loop 10 or 20 times,
> (so four runs of cachegrind), and subtract out the startup and loop
> overhead to get an exact set of counts for a single execution of the code
> snippet.
>
> Cachegrind isn't perfect of course; for example, it may detect a cache miss,
> but won't tell you how long the stall was for. A real CPU may have
> anticipated an upcoming stall, and may have already scheduled a read into
> the cache, meaning that the actual stall is shortened. But overall, I've
> found that as a surrogate outcome, it gives a good general indication of
> expected performance, without the 5-10% noise you usually get in
> timing-based benchmarks.
>
> Anyway, the easiest way to explain it is via a demo. The file
> t/perf/benchmarks contains the code snippets that this tool will
> benchmark by default. A typical entry in this file looks like:
>
> 'expr::assign::scalar_lex' => {
> desc => 'lexical $x = 1',
> setup => 'my $x',
> code => '$x = 1',
> },
>
> The key is the test name, arranged into a hierarchy (and is also used as
> the package name to run the test in). Other than that, there's a
> description, the setup code, and the actual code to run. It's easy to add
> new tests, or to create a temporary benchmarks file with some custom code
> you wish to profile. Currently that file just contains three tests, but I
> expect it to expand hugely over time, and to have 100's or even 1000's of
> tests.
>
> (In the following, things like perl5201o are just -O perl binaries in my
> path.)
>
> So lets run the tests against the most recent released perls from each
> branch:
>
> $ Porting/bench.pl -j 8 \
> perl5125o perl5144o perl5163o perl5182o perl5201o perl5216o
>
> Note that the -j 8 option means that it runs 8 cachegrinds in parallel.
> On my laptop, bench.pl took 8 seconds to run, or about 0.5s per test/perl
> combination.
>
> The output is similar to perlbench, in that it gives percentages for each
> test, followed by an average. The difference is that by default bench.pl
> displays 13 lines of results per test rather than just 1, since there are
> multiple values being measured (instruction reads, write misses etc.). The
> output which bench.pl produces is as follows, except for brevity I've
> elided all the tests except one, (but kept the explanatory header):
>
> Key:
> Ir Instruction read
> Dr Data read
> Dw Data write
> COND conditional branches
> IND indirect branches
> _m branch predict miss
> _m1 level 1 cache miss
> _mm last cache (e.g. L3) miss
> - indeterminate percentage (e.g. 1/0)
>
> The numbers represent relative counts per loop iteration, compared to
> perl5125o at 100.0%.
> Higher is better: for example, using half as many instructions gives 200%,
> while using twice as many gives 50%.
>
> expr::assign::scalar_lex
> lexical $x = 1
>
> perl5125o perl5144o perl5163o perl5182o perl5201o perl5216o
> --------- --------- --------- --------- --------- ---------
> Ir 100.00 107.05 101.83 106.37 107.74 103.73
> Dr 100.00 103.64 100.00 105.56 105.56 100.00
> Dw 100.00 100.00 96.77 100.00 100.00 96.77
> COND 100.00 120.83 116.00 126.09 126.09 120.83
> IND 100.00 80.00 80.00 80.00 80.00 80.00
>
> COND_m 100.00 100.00 100.00 100.00 100.00 100.00
> IND_m 100.00 100.00 100.00 100.00 100.00 100.00
>
> Ir_m1 100.00 100.00 100.00 100.00 100.00 100.00
> Dr_m1 100.00 100.00 100.00 100.00 100.00 100.00
> Dw_m1 100.00 100.00 100.00 100.00 100.00 100.00
>
> Ir_mm 100.00 100.00 100.00 100.00 100.00 100.00
> Dr_mm 100.00 100.00 100.00 100.00 100.00 100.00
> Dw_mm 100.00 100.00 100.00 100.00 100.00 100.00
>
> AVERAGE
>
> perl5125o perl5144o perl5163o perl5182o perl5201o perl5216o
> --------- --------- --------- --------- --------- ---------
> Ir 100.00 104.96 100.77 111.61 111.91 107.97
> Dr 100.00 103.78 100.61 113.12 111.11 106.55
> Dw 100.00 100.84 98.44 109.38 109.66 106.35
> COND 100.00 114.40 110.86 129.65 132.39 124.43
> IND 100.00 82.86 82.86 96.72 99.08 99.08
>
> COND_m 100.00 75.00 60.00 75.00 60.00 42.86
> IND_m 100.00 100.00 97.96 112.15 117.65 117.65
>
> Ir_m1 100.00 100.00 100.00 100.00 100.00 100.00
> Dr_m1 100.00 100.00 100.00 100.00 100.00 100.00
> Dw_m1 100.00 100.00 100.00 100.00 100.00 100.00
>
> Ir_mm 100.00 100.00 100.00 100.00 100.00 100.00
> Dr_mm 100.00 100.00 100.00 100.00 100.00 100.00
> Dw_mm 100.00 100.00 100.00 100.00 100.00 100.00
>
>
> Note that a simple $x = 1 assignment seems to have gotten worse between
> 5.20.1 and 5.21.6: more instruction reads, data reads, data writes etc.
> (Don't worry about the cache misses; for code this small the counts
> tend to be very small (often just 0 or 1), and so the variations between
> perls tend to remain stubbornly at exactly 100.00%.)
>
> Now, you're probably thinking at this point (based on previous experiences
> with perlbench) that these variations are merely noise. Well, lets run
> just the '$x =1' test on each of the 5.21.x releases, and see what we get.
> This time we'll display the raw counts rather than % differences, and
> we'll only display some fields of interest:
>
> $ Porting/bench.pl -j 8 --raw --fields=Ir,Dr,Dw \
> --tests=expr::assign::scalar_lex \
> perl5210o perl5211o perl5212o perl5213o perl5214o perl5215o perl5216o
>
> ....
>
> expr::assign::scalar_lex
> lexical $x = 1
>
> perl5210o perl5211o perl5212o perl5213o perl5214o perl5215o perl5216o
> --------- --------- --------- --------- --------- --------- ---------
> Ir 155.0 155.0 155.0 155.0 161.0 161.0 161.0
> Dr 54.0 54.0 54.0 54.0 57.0 57.0 57.0
> Dw 30.0 30.0 30.0 30.0 31.0 31.0 31.0
>
> Note how the number of instruction reads remains completely constant at
> 155 reads until 5.21.4, at which point it consistently increases to 161.
> Similarly for Dr, Dw etc. Can we bisect this? Of course we can :-)
>
> Run this shell script:
>
> D=/home/davem/perl5/git/bleed
>
> $D/Porting/bisect.pl \
> --start=v5.21.3 \
> --end=v5.21.4 \
> -Doptimize=-O2 \
> -j 16 \
> --target=miniperl \
> -- perl5201o $D/Porting/bench.pl \
> -j 8 \
> --benchfile=$D/t/perf/benchmarks \
> --tests=expr::assign::scalar_lex \
> --perlargs='-Ilib' \
> --bisect='Ir,153,157' \
> ./miniperl
>
>
> (D points to a perl directory outside the current one, so that access to
> bisect.pl etc isn't affected by the bisecting.)
>
> The important argument here is --bisect='Ir,153,157', which instructs
> bench.pl to exit 0 only if the result for field Ir is in the range
> 153..157. Lets see when it goes outside that range. (5 minutes later...)
>
> commit 7a82ace00e3cf19205df455cc1ec8a444d99d79d
> Author: Jarkko Hietaniemi <jhi@iki.fi>
> Date: Thu Sep 11 21:19:07 2014 -0400
>
> Use -fstack-protector-strong, if available.
>
> Hmmm, perhaps -fstack-protector-strong has a slight performance overhead
> we need to consider? (I don't know, I haven't looked any closer at it
> yet).
>
> But have you noticed what we just did? Successfully bisected a minute
> performance regression in the micro code sample '$x=1'. Whoo hoo!
>
> There are a few other command-line options of particular interest.
> --write=<file> and --read=<file> allow you to write the raw data from the
> run to a file, and read it back later. --sort allows you to sort the
> order of the test output, based on a particular field/perl combination.
> Combined with --read and --write, it allows you to run the tests just
> once, then view the results in various ways
>
> $ Porting/bench.pl --write=/tmp/results \
> perl5125o perl5144o perl5163o perl5182o perl5201o perl5216o
> $ Porting/bench.pl --read=/tmp/results --sort=Ir,perl5216o
> $ Porting/bench.pl --read=/tmp/results --sort=Dw,perl5125o
> ...
>
> It's easy to create a temporary test while you're working on something.
> For example, suppose you're trying to improve the performance of regexes
> that use '$': just create a small test file and run it:
>
> $ cat /tmp/benchmarks
> [
> 'foo' => {
> desc => 'foo',
> setup => '$_ = "abc"',
> code => '/c$/',
> },
> ];
>
> $ Porting/bench.pl -j 8 --benchfile=/tmp/benchmarks ./perl.orig ./perl
>
>
> Finally, I think it might be an idea to add a new step to the Release
> Manager's Guide along the lines of "early on, use bench.pl to compare the
> new release against the previous release, and see if anything has got
> noticeably worse" - probably using the --sort option to spot the worst
> cases.
>
>
> --
> Dave's first rule of Opera:
> If something needs saying, say it: don't warble it.
--
There is this special biologist word we use for 'stable'. It is
'dead'. -- Jack Cohen
Thread Previous
|
Thread Next