On Wed, Nov 16, 2011 at 11:56:53PM -0800, Michael G Schwern wrote: > On 2011.11.16 11:16 PM, H.Merijn Brand wrote: > >> * Make a realistic benchmark suite of both performance and memory [4] > >> * Set up a smoker to the benchmarks and report significant differences > >> and performance creeps to p5p, like with tests I think that at least part of this is what Steffen Schwigon is trying to do with Benchmark::Perl::Formance https://metacpan.org/module/Benchmark::Perl::Formance > > I know we're slow on this, but the new setup of Test::Smoke will store > > all core test run times in the database, so one can select runs for the > > same machine and compare them over time. > > That's a good start. > > It's hard to tease useful information out of that as the test time is > monolithic. It's difficult to know what caused a performance change... which > test got slower? Did perl get slower, or did the test change? How do you > usefully compare the performance of test runs between different versions of > Perl when the tests are constantly changing? > > A benchmark suite has to be: > > * Fine grained > * Repeatable > * Deterministic (ie. each set of runs produces the same result) > * Comparable between commits > * Applicable to real world performance situations > > It should be able automatically answer the questions: > > * What slowed down / sped up? > * When did it slow down / speed up? > > It should provide the tools to answer: > > * Why did it slow down? > > Unfortunately the test suite can't tell us that. Agree totally. As I mailed the list about 2 months ago, in relation to RT #98662 Whilst it's potentially useful to know if the regression tests start taking significantly different time, I'm still not convinced that they make a good benchmark suite. They serve different purposes: regression tests * try to test obscure corner cases * focus on one thing in isolation * should run as quickly as possible, to avoid programmers getting bored or complacent * often end up being dominated by startup time benchmarks should * focus on common code * perform complex behaviour using multiple features * should stress things with the scale of data needed to detect real problems [such as a change to O(bad) behaviour where previously it was O(acceptable)] * unless benchmarking startup time, strive to avoid it influencing the result and I don't think it's useful to try to make the *regression* tests pretend to be a benchmark. I've not looked at it yet, but I'm hoping that Steffen Schwigon's work on Benchmark::Perl::Formance is going to produce a more comprehensive benchmark than perlbench: https://metacpan.org/module/Benchmark::Perl::Formance https://rt.perl.org/rt3/Ticket/Display.html?id=98662#txn-963700 [I got his name wrong in the original. Corrected here. Gah. I'd better add another drink to my budget for Erlangen http://conferences.yapceurope.org/gpw2012/ ] [Reading again, I think I might be conflicting on "fine grained". I can see benefits of both fine grained tests, and bigger tests that might catch bad interactions between seemingly unrelated features] I'd love to see more work on this. I fear I can't offer much help other than moral support and buying drinks at conferences as "thank you" Nicholas Clark