Front page | perl.perl5.porters |
Postings from September 2013
Re: [perl #119635] deprecate and remove qsort?
Thread Previous
|
Thread Next
From:
John P. Linderman
Date:
September 12, 2013 15:46
Subject:
Re: [perl #119635] deprecate and remove qsort?
Message ID:
CAC0cEp9bFPYPiuu6oOQKWn8caJy3ZGW=up217Gr=+=KQDoGQBg@mail.gmail.com
I'm actually in complete agreement. I just wanted to acknowledge that
perl's mergesort was not always better than perl's quicksort on a fairly
extensive set of inputs. But if you know you are dealing with an input of
the class I mentioned, you'd be better off using a hash than with either
sort. In fact, I am so much in agreement about choosing algorithms being
less meaningful than choosing that I gave the algorithm-choosing subpragmas
funny names, starting with an underscore, and warned that they might not
persist beyond 5.8. They had better longevity than I would have
predicted. Obviously, if qsort is deprecated, so, too, must be the funny
subpragmas.
A line in Peter's SODA paper has always intrigued me
Versus the fastest readily available qsort, [a citation here to Jon
Bentley's and Doug Mcilroy's "Engineering a Sort Function" paper, which had
been submitted, but not yet accepted, for publication] the hybrid [Peter's
mergesort] is 25% slower on random integers; for random strings, time
differences are negligible.
Qsort is constrained to elements of fixed size, and the elements themselves
are moved about. As partitions get smaller and smaller, the elements are
brought closer and closer together, "finding" faster and faster cache
memory, if it exists. Mergesort sorts pointers to elements; the elements
themselves are never moved. Although the pointers will be packed together,
enjoying some of the benefits of whatever cache memory may exist, the
elements remain scattered, probably leading to a cache miss when
referenced. And the qsort comparison routine for comparing integers is
very short and efficient, so saving compares may not be so important. I
figured maybe what Peter reported was a consequence of cache memory and
simple comparisons. But if that were true for integers, why wasn't it also
true for strings? String comparison is more complicated than integer
comparison. Saving comparisons becomes more important. Or maybe because
integers are short, and easy to move about, and strings are (probably, it's
not stated) longer, the extra time spent moving data counterbalances the
advantages of cache memory. In any event, the observed behavior had
plausible explanations.
But now I wonder if there may not have been another contributing cause.
Qsort, unconstrained by stability, is free to ignore the position of equal
integers. Stable mergesort is not. As I mentioned elsewhere, this can
handicap mergesort when the input has many repeated items. If the "random
integers" were selected from a pool significantly smaller than the total
number of elements (we don't know), qsort would have another advantage.
But if (lots of "ifs" piling up here) that factors into the performance
difference Peter mentioned, and the performance difference alluded to in
pp_sort.c
The original merge sort, in use since 5.7, was as fast as, or faster than,
qsort on many platforms, but slower than qsort, conspicuously so, on others.
then the ability to ignore stability will take away that advantage of
quicksort, another reason why quicksort might be deprecated without much
pain. Since I got lots of +1s for taking a look at ignoring stability
(even though most of them came from one person :-), I'll do that. It looks
like the changes to the merge phase will be quite simple, but changes to
the setup of initial runs will be anything but.
On Thu, Sep 12, 2013 at 8:53 AM, Nicholas Clark <nick@ccl4.org> wrote:
> On Fri, Sep 06, 2013 at 02:59:44PM -0400, John P. Linderman wrote:
> > > Part of the reason, I think, was that v5.6 and earlier's sort was a
> > > quicksort, and hence not a stable sort.
> >
> > Stability was definitely a feature, but my best argument to Jarkko was
> that
> > far too many common sequences (organ pipe, sawtooth, etc.) went
> quadratic,
> > and Peter Mcilroy's merge sort did a particularly nice job on sequences
> > with pre-existing order (linear time on sorted input).
> >
> > The only sequences (that I could think of, anyway) where it was clear
> that
> > qsort would outperform merge sort were those containing many repetitions
> of
> > a small number (k) of keys. Qsort's "fat pivot" guaranteed that only k
> > passes would be required, where merge sort might need log N.
>
> The bit that keeps bugging me is "how many people know that their data is
> usually going to be in a particular form that qsort favours"?
>
> In that, the amount of use cases where it's correct to specify qsort
> instead
> of mergesort are small, the amount of effort needed to determine this is
> high, and the gain itself is small?
>
> In which case, seems that choosing a specific sort algorithm is not the
> first place to look at optimising.
>
> Nicholas Clark
>
Thread Previous
|
Thread Next