develooper Front page | perl.perl5.porters | Postings from September 2012

WANTED: "whole program" benchmarks (was Re: CPU noise (was Re:optimising JRuby by avoiding hashes))

Thread Previous | Thread Next
From:
Nicholas Clark
Date:
September 17, 2012 08:29
Subject:
WANTED: "whole program" benchmarks (was Re: CPU noise (was Re:optimising JRuby by avoiding hashes))
Message ID:
20120917152922.GX94462@plum.flirble.org
tl;dr: 4% different from link order of object files. "WTF"?
       need "whole program" Perl benchmarks.

On Fri, Sep 14, 2012 at 05:14:22PM +0100, Nicholas Clark wrote:
> On Tue, Sep 11, 2012 at 01:34:42PM +0100, Nicholas Clark wrote:

> Note that *no* command line arguments is measurably slower than having
> command line arguments. (This is repeatable on the same binary, and
> repeatable on a different binary.)
> 
> Yet cachegrind shows that the *faster* code actually executes *more
> instructions* and reads *more data*. Not the slower.

> And even the disabled case (*more work*) is faster than the comparable run
> above. Even though cachegrind shows that the "less work" case is fewer
> I and D references:

So, more fun. Keep @ARGV, the same. Keep all the *code* the same...

I'm using the same object files (compiled yesterday morning), functions
aligned at 64 bytes (so a cache line), everything else at 8. But this shouldn't
matter that much, as all I'm doing is relinking them. The default Makefile
links like this:

rm -f libperl.a
/usr/bin/ar rcu libperl.a op.o perl.o   gv.o toke.o perly.o pad.o regcomp.o dump.o util.o mg.o reentr.o mro.o keywords.o hv.o av.o run.o pp_hot.o sv.o pp.o scope.o pp_ctl.o pp_sys.o doop.o doio.o regexec.o utf8.o taint.o deb.o universal.o globals.o perlio.o perlapi.o numeric.o mathoms.o locale.o pp_pack.o pp_sort.o   DynaLoader.o
ccache gcc -o perl  -fstack-protector -L/usr/local/lib -Wl,-E perlmain.o  libperl.a `cat ext.libs` -lnsl -ldl -lm -lcrypt -lutil -lc

and I can repeatedly get this:

[nicholas@dromedary perl4]$ dumbbench -- ./perl ~/test/test_method_cache.pl A
cmd: Ran 23 iterations (2 outliers).
cmd: Rounded run time per iteration: 3.5210e+00 +/- 9.4e-03 (0.3%)
[nicholas@dromedary perl4]$ dumbbench -- ./perl ~/test/test_method_cache.pl B
cmd: Ran 21 iterations (1 outliers).
cmd: Rounded run time per iteration: 3.6366e+00 +/- 7.1e-03 (0.2%)

(ie the 'B' case, with 17,611,172,167 I refs and 9,370,412,893 D refs
  is *slower* than the 17,901,159,575 I refs and 9,450,408,371 D refs
 of the 'A' case.)


but, if I hack Makefile.SH like this
(warning, blatant GNU make isms. Look away now if this offends):

diff --git a/Makefile.SH b/Makefile.SH
index 5194ecf..412cb89 100755
--- a/Makefile.SH
+++ b/Makefile.SH
@@ -795,7 +795,7 @@ $(LIBPERL): $& $(obj) $(DYNALOADER) $(LIBPERLEXPORT)
        true)
                $spitshell >>$Makefile <<'!NO!SUBS!'
        rm -f $@
-       $(LD) -o $@ $(SHRPLDFLAGS) $(obj) $(DYNALOADER) $(libs)
+       $(LD) -o $@ $(SHRPLDFLAGS) $(shell ls -1 $(obj) | tac);  $(DYNALOADER) $
 !NO!SUBS!
                case "$osname" in
                aix)
@@ -810,7 +810,7 @@ $(LIBPERL): $& $(obj) $(DYNALOADER) $(LIBPERLEXPORT)
        *)
                $spitshell >>$Makefile <<'!NO!SUBS!'
        rm -f $(LIBPERL)
-       $(AR) rcu $(LIBPERL) $(obj) $(DYNALOADER)
+       $(AR) rcu $(LIBPERL) $(shell ls -1 $(obj) | tac) $(DYNALOADER)
        @$(ranlib) $(LIBPERL)
 !NO!SUBS!
                ;;

so that the link line now looks like this:

rm -f libperl.a
/usr/bin/ar rcu libperl.a util.o utf8.o universal.o toke.o taint.o sv.o scope.o run.o regexec.o regcomp.o reentr.o pp_sys.o pp_sort.o pp_pack.o pp_hot.o pp_ctl.o pp.o perly.o perlio.o perlapi.o perl.o pad.o op.o numeric.o mro.o mg.o mathoms.o locale.o keywords.o hv.o gv.o globals.o dump.o doop.o doio.o deb.o av.o DynaLoader.o
ccache gcc -o perl  -fstack-protector -L/usr/local/lib -Wl,-E perlmain.o  libperl.a `cat ext.libs` -lnsl -ldl -lm -lcrypt -lutil -lc

Then the run times change:

[nicholas@dromedary perl4]$ dumbbench -- ./perl ~/test/test_method_cache.pl A
cmd: Ran 21 iterations (0 outliers).
cmd: Rounded run time per iteration: 3.559e+00 +/- 1.3e-02 (0.4%)
[nicholas@dromedary perl4]$ dumbbench -- ./perl ~/test/test_method_cache.pl B
cmd: Ran 22 iterations (2 outliers).
cmd: Rounded run time per iteration: 3.489e+00 +/- 1.0e-02 (0.3%)

Remember, with the other order, it's 3.6366e+00 +/- 7.1e-03 (0.2%) for B

ie 4% difference.

So, what on earth could be going on there?

Please note that the output for the "B" case for the two link orders is
pretty much identical:

==29304== I   refs:      17,611,172,167
==29304== I1  misses:             7,922
==29304== L2i misses:             3,432
==29304== I1  miss rate:           0.00%
==29304== L2i miss rate:           0.00%
==29304==
==29304== D   refs:       9,370,412,893  (6,460,281,355 rd   + 2,910,131,538 wr)
==29304== D1  misses:             9,573  (        6,111 rd   +         3,462 wr)
==29304== L2d misses:             5,294  (        2,486 rd   +         2,808 wr)
==29304== D1  miss rate:            0.0% (          0.0%     +           0.0%  )
==29304== L2d miss rate:            0.0% (          0.0%     +           0.0%  )
==29304==
==29304== L2 refs:               17,495  (       14,033 rd   +         3,462 wr)
==29304== L2 misses:              8,726  (        5,918 rd   +         2,808 wr)
==29304== L2 miss rate:             0.0% (          0.0%     +           0.0%  )

vs

==15487== I   refs:      17,611,172,408
==15487== I1  misses:             7,743
==15487== L2i misses:             3,434
==15487== I1  miss rate:           0.00%
==15487== L2i miss rate:           0.00%
==15487== 
==15487== D   refs:       9,370,412,923  (6,460,281,385 rd   + 2,910,131,538 wr)
==15487== D1  misses:             9,584  (        6,116 rd   +         3,468 wr)
==15487== L2d misses:             5,297  (        2,488 rd   +         2,809 wr)
==15487== D1  miss rate:            0.0% (          0.0%     +           0.0%  )
==15487== L2d miss rate:            0.0% (          0.0%     +           0.0%  )
==15487== 
==15487== L2 refs:               17,327  (       13,859 rd   +         3,468 wr)
==15487== L2 misses:              8,731  (        5,922 rd   +         2,809 wr)
==15487== L2 miss rate:             0.0% (          0.0%     +           0.0%  )


ie they seem to be doing the same amount of CPU work, and suffering the same
cache misses. Yet there's a repeatable 4% difference in the elapsed time.


Anyway, if I then hack Makefile.SH of the tree with the monomorphic cache to
link the object files in identical order to that of the clean build:

[nicholas@dromedary perl]$ dumbbench -- ./perl ~/test/test_method_cache.pl A
cmd: Ran 21 iterations (1 outliers).
cmd: Rounded run time per iteration: 3.2797e+00 +/- 7.3e-03 (0.2%)
[nicholas@dromedary perl]$ dumbbench -- ./perl ~/test/test_method_cache.pl B
cmd: Ran 20 iterations (0 outliers).
cmd: Rounded run time per iteration: 3.540e+00 +/- 1.3e-02 (0.4%)

A hits the monomorphic cache and uses it 1e6 times
B causes the cache to be disabled, so make 1e6 lookups, "old style"

Compare "B" to that without any cache code:

[nicholas@dromedary perl4]$ dumbbench -- ./perl ~/test/test_method_cache.pl A
cmd: Ran 21 iterations (0 outliers).
cmd: Rounded run time per iteration: 3.559e+00 +/- 1.3e-02 (0.4%)
[nicholas@dromedary perl4]$ dumbbench -- ./perl ~/test/test_method_cache.pl B
cmd: Ran 22 iterations (2 outliers).
cmd: Rounded run time per iteration: 3.489e+00 +/- 1.0e-02 (0.3%)

(remember "A" is slower here due to the problem reported as RT #114864)

and it now seems to be a sane enough order to make sense:

No cache code:  3.489e+00 +/- 1.0e-02 (0.3%)
Cache miss:     3.540e+00 +/- 1.3e-02 (0.4%)
Cache hit:      3.2797e+00 +/- 7.3e-03 (0.2%)

so a roughly 1.5% slowdown for the cases where the cache misses and is
disabled, and a roughly 6% speedup for the cases where it hits.

Without a cache:

mktables:   2.2561e+01 +/- 2.0e-02 (0.1%)
installman: 3.0756e+01 +/- 3.8e-02 (0.1%)

With a cache:

mktable:    2.2124e+01 +/- 1.5e-02 (0.1%)
installman: 3.0567e+01 +/- 3.0e-02 (0.1%)

So that seems to be a 2% and 0.7% speedup.
But I've no idea if that code is "typical", or whether most code is 
sufficiently polymorphic that it would bust the cache more than it wins,
and so run more slowly.

So, "typical program" benchmarks useful.

Nicholas Clark

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About