Front page | perl.perl5.porters |
Postings from January 2017
Re: test randomization (Re: slowness of ext/XS-APItest/t/handy.t,utf8.t)
Thread Previous
|
Thread Next
From:
demerphq
Date:
January 27, 2017 09:56
Subject:
Re: test randomization (Re: slowness of ext/XS-APItest/t/handy.t,utf8.t)
Message ID:
CANgJU+W_biybN9i-4pgymOtf+ua__g_=WYLS5Djt4YMtLHvDhQ@mail.gmail.com
On 25 January 2017 at 11:33, Dave Mitchell <davem@iabyn.com> wrote:
> On Tue, Jan 24, 2017 at 11:10:50AM -0600, Craig A. Berry wrote:
>> Setting expectations would help, but it's still a major sea change
>> that I don't think has been adequately discussed (sorry if it was and
>> I wasn't paying attention). Things like BBC reports, CPAN smokes, and
>> even basic bisecting depend on everything being the same except the
>> one thing you want to vary. Randomizing input data in tests takes
>> away people's choice about what gets varied.
>>
>> I've always thought the purpose of the test suite was to validate that
>> things known to be good are still good with a different
>> platform/toolchain/configuration/version, etc. There is nothing wrong
>> with exhaustively hunting down things whose goodness is not known, but
>> the core test suite, which is included with the release tarball to
>> certify the release, seems an odd place for that.
>
> I think random subset selection in the test suite should be a method of
> last resort.
I don't think there is any controversy over this point; if it is
possible and practical we should do complete testing. The question is
what should our policy be when that is not true, and to a certain
extent how do we define "practical". IMO something that would make the
test suite run too long counts as "impractical" but I can imagine
there could be reasonable differences of opinion about how long too
long is. On the other hand it is not always obvious that complete
testing is not possible.
> I believe we do it in one or two places already with Unicode
> stuff, since it would take far too long to test every permutation in that
> case.
I thought in the Unicode case the problem isn't permutation but rather
simply the vast scale of Unicode.
I bring this up because I think there is a big difference between
testing permutation issues and testing partitionable sets.
> When we do this, I think ideally:
>
> 1) the test space should be logically divided into N subsets, and one of
> those subsets is randomly chosen for testing - i.e. there is only a single
> random number generated at the start of the test suite, and that selects
> which set of tests to run - so no doing a 1000-times loop and for each
> iteration choosing a random character and testing it.
This sounds like partitioning not permuting.
Permuting means we have to test every permutation of K items.
Permuting is also super quadratic. Permuting even a moderately large
set is extremely expensive. IMO in cases where we have to tests
permutation we should be able to use rand() and srand() to manage the
permutations we handle.
> 2) if at least one test fails, then the random number chosen should
> be displayed on stderr (e.g. using diag()) so that it can be seen in smoke
> reports etc. It should also be reported to stdout always;
I think this should be rephrased to be "must" not "should". IOW, if
you depend on randomness to test a feature the test must provide a
mechanism and the data required to recreate the test.
> 3) there should be a way to run a test script with a specified random number
> (e.g. via an environment variable);
Again this should be "must".
> 4) N should be small enough (e.g. 10,20,30...) that all permutations are
> likely to have been tried by at least one smoker after a small number of
> days, so we don't get a sudden failure 6 months down the line.
I would inject a "where possible" in here.
> 5) there should be a method (e.g. via an env var) to make all test scripts
> in the test suite run all tests rather than a random subset.
This is a non-starter. It is not possible or practical. Anything that
requires permutation could become very expensive very fast, and
anything that might depend on hash key order could not be written to
test every order.
> (Now Karl's going to point out how Unicode makes that nice simple scheme
> impractical... ;-)
Since Karl didnt step up, I will.
Consider the fencepost error in the undefined hash key discovery logic
that you added and that was discovered by hash randomization.
In that case there was no way that you could write good tests for the
feature without using randomization. Any fixed keyset would have
resulted in the keys ending up in a specific bucket. Obviously the
problem can't be solved by partitioning, it needs to be solved with
permutation.
On the other hand, if you had generated a largish volume of random
keys you would have had very high confidence that your tests would
have actually tested a key in every slot, and even if there was
non-zero chance that it did not you would have very high confidence
that over time every possibility would be tried. (I guess this assumes
a good rng, but lets assume drand48() is "good enough").
I think the rules should be your 1 through 4, with 2 and 3 becoming
"must", and 1 and 4 emphasizing that "should" means "where possible
and practical", not "must" in disguise.
I believe that what we saw as fallout from the hash-randomization case
demonstrates that a) subtle bugs can and will be overlooked without
randomization and b) the sky didn't fall when we started seeing
intermittent failures in various modules and code. On the contrary, we
ended up fixing several core bugs, and another handful of subtle bugs
in modules. (This ignores the places people were testing hash dumps,
and thus depending on undefined behavior.)
Yves
--
perl -Mre=debug -e "/just|another|perl|hacker/"
Thread Previous
|
Thread Next