Front page | perl.perl5.porters |
Postings from November 2014
ANNOUNCE: Porting/bench.pl
Thread Previous
|
Thread Next
From:
Dave Mitchell
Date:
November 29, 2014 23:30
Subject:
ANNOUNCE: Porting/bench.pl
Message ID:
20141129232956.GE15713@iabyn.com
On Tue, Oct 21, 2014 at 03:54:56PM +0100, Dave Mitchell wrote:
> I have not yet done (and don't intend to do immediately):
...
> create tools to run timings and cachegrinds of t/perf/benchmarks code
> snippets, possibly across multiple perls
Well, now I have.
(I can say that I am genuinely very excited by this.)
TL;DR: I've created a new tool, Porting/bench.pl, loosely inspired by
perlbench, that measures the performance of perl code snippets across
multiple perls, by using the output of cachegrind, rather than timing.
It's very fast (and can run tests in parallel), and *very* precise (like
to five significant digits), in the sense that multiple runs and/or
similar perls give virtually the same values each time. It is sensitive
enough to allow bisecting of tiny performance changes in small code
snippets.
I'd recommend that anyone interested in benchmarking perl itself should
read the rest of this email.
For people who aren't aware, cachegrind is a tool in the valgrind suite,
available on a few common hardware/OS platform combinations. It works
by executing the code in a sort of emulation mode: much more slowly than
usual, but it records as it goes along how many instruction reads, branch
misses, cache misses etc., occur. It can later be used to annotate the
source code line by line, but for our purposes, it also prints an overall
summary for the execution. Since we are not doing time measurements,
nor using hardware performance counters (like perf), it is unaffected by
load, interrupts, or CPU swaps etc. In fact, as long as we set
PERL_HASH_SEED=0, the results are virtually identical across runs.
We can use this to run an empty or active loop 10 or 20 times,
(so four runs of cachegrind), and subtract out the startup and loop
overhead to get an exact set of counts for a single execution of the code
snippet.
Cachegrind isn't perfect of course; for example, it may detect a cache miss,
but won't tell you how long the stall was for. A real CPU may have
anticipated an upcoming stall, and may have already scheduled a read into
the cache, meaning that the actual stall is shortened. But overall, I've
found that as a surrogate outcome, it gives a good general indication of
expected performance, without the 5-10% noise you usually get in
timing-based benchmarks.
Anyway, the easiest way to explain it is via a demo. The file
t/perf/benchmarks contains the code snippets that this tool will
benchmark by default. A typical entry in this file looks like:
'expr::assign::scalar_lex' => {
desc => 'lexical $x = 1',
setup => 'my $x',
code => '$x = 1',
},
The key is the test name, arranged into a hierarchy (and is also used as
the package name to run the test in). Other than that, there's a
description, the setup code, and the actual code to run. It's easy to add
new tests, or to create a temporary benchmarks file with some custom code
you wish to profile. Currently that file just contains three tests, but I
expect it to expand hugely over time, and to have 100's or even 1000's of
tests.
(In the following, things like perl5201o are just -O perl binaries in my
path.)
So lets run the tests against the most recent released perls from each
branch:
$ Porting/bench.pl -j 8 \
perl5125o perl5144o perl5163o perl5182o perl5201o perl5216o
Note that the -j 8 option means that it runs 8 cachegrinds in parallel.
On my laptop, bench.pl took 8 seconds to run, or about 0.5s per test/perl
combination.
The output is similar to perlbench, in that it gives percentages for each
test, followed by an average. The difference is that by default bench.pl
displays 13 lines of results per test rather than just 1, since there are
multiple values being measured (instruction reads, write misses etc.). The
output which bench.pl produces is as follows, except for brevity I've
elided all the tests except one, (but kept the explanatory header):
Key:
Ir Instruction read
Dr Data read
Dw Data write
COND conditional branches
IND indirect branches
_m branch predict miss
_m1 level 1 cache miss
_mm last cache (e.g. L3) miss
- indeterminate percentage (e.g. 1/0)
The numbers represent relative counts per loop iteration, compared to
perl5125o at 100.0%.
Higher is better: for example, using half as many instructions gives 200%,
while using twice as many gives 50%.
expr::assign::scalar_lex
lexical $x = 1
perl5125o perl5144o perl5163o perl5182o perl5201o perl5216o
--------- --------- --------- --------- --------- ---------
Ir 100.00 107.05 101.83 106.37 107.74 103.73
Dr 100.00 103.64 100.00 105.56 105.56 100.00
Dw 100.00 100.00 96.77 100.00 100.00 96.77
COND 100.00 120.83 116.00 126.09 126.09 120.83
IND 100.00 80.00 80.00 80.00 80.00 80.00
COND_m 100.00 100.00 100.00 100.00 100.00 100.00
IND_m 100.00 100.00 100.00 100.00 100.00 100.00
Ir_m1 100.00 100.00 100.00 100.00 100.00 100.00
Dr_m1 100.00 100.00 100.00 100.00 100.00 100.00
Dw_m1 100.00 100.00 100.00 100.00 100.00 100.00
Ir_mm 100.00 100.00 100.00 100.00 100.00 100.00
Dr_mm 100.00 100.00 100.00 100.00 100.00 100.00
Dw_mm 100.00 100.00 100.00 100.00 100.00 100.00
AVERAGE
perl5125o perl5144o perl5163o perl5182o perl5201o perl5216o
--------- --------- --------- --------- --------- ---------
Ir 100.00 104.96 100.77 111.61 111.91 107.97
Dr 100.00 103.78 100.61 113.12 111.11 106.55
Dw 100.00 100.84 98.44 109.38 109.66 106.35
COND 100.00 114.40 110.86 129.65 132.39 124.43
IND 100.00 82.86 82.86 96.72 99.08 99.08
COND_m 100.00 75.00 60.00 75.00 60.00 42.86
IND_m 100.00 100.00 97.96 112.15 117.65 117.65
Ir_m1 100.00 100.00 100.00 100.00 100.00 100.00
Dr_m1 100.00 100.00 100.00 100.00 100.00 100.00
Dw_m1 100.00 100.00 100.00 100.00 100.00 100.00
Ir_mm 100.00 100.00 100.00 100.00 100.00 100.00
Dr_mm 100.00 100.00 100.00 100.00 100.00 100.00
Dw_mm 100.00 100.00 100.00 100.00 100.00 100.00
Note that a simple $x = 1 assignment seems to have gotten worse between
5.20.1 and 5.21.6: more instruction reads, data reads, data writes etc.
(Don't worry about the cache misses; for code this small the counts
tend to be very small (often just 0 or 1), and so the variations between
perls tend to remain stubbornly at exactly 100.00%.)
Now, you're probably thinking at this point (based on previous experiences
with perlbench) that these variations are merely noise. Well, lets run
just the '$x =1' test on each of the 5.21.x releases, and see what we get.
This time we'll display the raw counts rather than % differences, and
we'll only display some fields of interest:
$ Porting/bench.pl -j 8 --raw --fields=Ir,Dr,Dw \
--tests=expr::assign::scalar_lex \
perl5210o perl5211o perl5212o perl5213o perl5214o perl5215o perl5216o
....
expr::assign::scalar_lex
lexical $x = 1
perl5210o perl5211o perl5212o perl5213o perl5214o perl5215o perl5216o
--------- --------- --------- --------- --------- --------- ---------
Ir 155.0 155.0 155.0 155.0 161.0 161.0 161.0
Dr 54.0 54.0 54.0 54.0 57.0 57.0 57.0
Dw 30.0 30.0 30.0 30.0 31.0 31.0 31.0
Note how the number of instruction reads remains completely constant at
155 reads until 5.21.4, at which point it consistently increases to 161.
Similarly for Dr, Dw etc. Can we bisect this? Of course we can :-)
Run this shell script:
D=/home/davem/perl5/git/bleed
$D/Porting/bisect.pl \
--start=v5.21.3 \
--end=v5.21.4 \
-Doptimize=-O2 \
-j 16 \
--target=miniperl \
-- perl5201o $D/Porting/bench.pl \
-j 8 \
--benchfile=$D/t/perf/benchmarks \
--tests=expr::assign::scalar_lex \
--perlargs='-Ilib' \
--bisect='Ir,153,157' \
./miniperl
(D points to a perl directory outside the current one, so that access to
bisect.pl etc isn't affected by the bisecting.)
The important argument here is --bisect='Ir,153,157', which instructs
bench.pl to exit 0 only if the result for field Ir is in the range
153..157. Lets see when it goes outside that range. (5 minutes later...)
commit 7a82ace00e3cf19205df455cc1ec8a444d99d79d
Author: Jarkko Hietaniemi <jhi@iki.fi>
Date: Thu Sep 11 21:19:07 2014 -0400
Use -fstack-protector-strong, if available.
Hmmm, perhaps -fstack-protector-strong has a slight performance overhead
we need to consider? (I don't know, I haven't looked any closer at it
yet).
But have you noticed what we just did? Successfully bisected a minute
performance regression in the micro code sample '$x=1'. Whoo hoo!
There are a few other command-line options of particular interest.
--write=<file> and --read=<file> allow you to write the raw data from the
run to a file, and read it back later. --sort allows you to sort the
order of the test output, based on a particular field/perl combination.
Combined with --read and --write, it allows you to run the tests just
once, then view the results in various ways
$ Porting/bench.pl --write=/tmp/results \
perl5125o perl5144o perl5163o perl5182o perl5201o perl5216o
$ Porting/bench.pl --read=/tmp/results --sort=Ir,perl5216o
$ Porting/bench.pl --read=/tmp/results --sort=Dw,perl5125o
...
It's easy to create a temporary test while you're working on something.
For example, suppose you're trying to improve the performance of regexes
that use '$': just create a small test file and run it:
$ cat /tmp/benchmarks
[
'foo' => {
desc => 'foo',
setup => '$_ = "abc"',
code => '/c$/',
},
];
$ Porting/bench.pl -j 8 --benchfile=/tmp/benchmarks ./perl.orig ./perl
Finally, I think it might be an idea to add a new step to the Release
Manager's Guide along the lines of "early on, use bench.pl to compare the
new release against the previous release, and see if anything has got
noticeably worse" - probably using the --sort option to spot the worst
cases.
--
Dave's first rule of Opera:
If something needs saying, say it: don't warble it.
Thread Previous
|
Thread Next