develooper Front page | perl.perl5.porters | Postings from September 2021

Re: Robin Hood Hashing for the perl core

Thread Previous | Thread Next
From:
Nicholas Clark
Date:
September 12, 2021 08:43
Subject:
Re: Robin Hood Hashing for the perl core
Message ID:
YT29pD5Vfshpj2JJ@etla.org
On Fri, Sep 10, 2021 at 04:31:41PM +0200, Tomasz Konojacki wrote:
> On Fri, 10 Sep 2021 12:21:06 +0000
> Nicholas Clark <nick@ccl4.org> wrote:
> 
> > Aha, --branch-sim=yes

> Valgrind's branch prediction numbers are almost completely made up and
> shouldn't be trusted. According to their documentation, valgrind's

The cynic in me wants to say:
    So that fits quite nicely with benchmarks generally, doesn't it?

> simulated branch predictor is based on early Pentium 4. Modern CPUs work
> completely differently!
> 
> https://valgrind.org/docs/manual/cg-manual.html#branch-sim

Thanks for the reminder about how ancient this is, and hence whether it's
really even useful.

> I'm also sceptical about its cache simulation. The only metrics I trust
> are instruction and branch counts.
> 
> perf and AMD uProf are better tools for measuring those things.
> Unfortunately, they aren't nearly as pleasant to use as valgrind.

I've not met uProf, but if it's like perf, one needs root access to run it?

Right now I have user-level access to powerful hardware owned by (nice)
other people, but root access only to a couple of small things at home.

> BTW, while branch prediction numbers are very useful for figuring out
> what the CPU is doing, it's important to remember that, in the end,
> wall-clock time is the only metric that actually matters.

Yes, totally. In this case the "problem" being that tuning things (the code
or the benchmark) pushes work between one end member of "more instruction
effort, but stress the data cache less" and the other of "less instruction
effort, but stress the data cache more"

Well, not even two - I guess that can also be broken down into
"do more instruction dispatch but less branching" vs "less/more"
and the data side also looks like "trading more reads for fewer writes"
and even "more L1 work for fewer L3 misses".


So yes, wallclock is the only metric. And that needs benchmarks of the
appropriate shape and scale for the real workloads on the CPUs.

Dammit, which really means "benchmark something close to your production
code".

Nicholas Clark

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About