develooper Front page | perl.perl5.porters | Postings from September 2013

Re: [perl #119635] deprecate and remove qsort?

Thread Previous | Thread Next
From:
John P. Linderman
Date:
September 9, 2013 14:55
Subject:
Re: [perl #119635] deprecate and remove qsort?
Message ID:
CAC0cEp_pZNtQ40pR6Js8opkMJvPq5Oo06BOFa1cx4_eh0iM7BA@mail.gmail.com
On Fri, 06 Sep 2013 18:54:12, Steffen Mueller <smueller@cpan.org> wrote:

>> +1 to see it go. I'm trying not to grandly suggest
>> that somebody (not me, likely not Nicholas) should try
>> implementing timsort for us now instead of the current
>> merge-alike. :)

I hope it's clear who copied whom.  If not, read on.

Anything indented following a "> " is text extracted from

http://svn.python.org/projects/python/trunk/Objects/listsort.txt

Tim Peters' description of "timsort".  Let's clear the air right away.
I'm not a fan of Tim Peters, mostly because he acknowledges that one of
the important ideas used in "timsort" originated in a paper by
Peter Mcilroy:

> I first learned about the galloping strategy in a related context; see:
>
>     "Adaptive Set Intersections, Unions, and Differences" (2000)
>     Erik D. Demaine, Alejandro López-Ortiz, J. Ian Munro
>
> and its followup(s).  An earlier paper called the same strategy
> "exponential search":
>
>    "Optimistic Sorting and Information Theoretic Complexity"
>    Peter McIlroy
>    SODA (Fourth Annual ACM-SIAM Symposium on Discrete Algorithms), pp
>    467-474, Austin, Texas, 25-27 January 1993.
>
> and it probably dates back to an earlier paper by Bentley and Yao.  The
> McIlroy paper in particular has good analysis of a mergesort that's
> probably strongly related to this one in its galloping strategy.

In http://svn.python.org/projects/python/trunk/Objects/listobject.c,
the closest thing I can find to the official source for "timsort",
Peter's name is never mentioned.  Nor does it appear in Wikipedia's
entry for "timsort".  That may or may not be Tim's fault.

>   It turns out that Perl is moving to a stable mergesort

Seems that Tim was aware of the sort code Jarkko and I were kicking about.
I never attempted to hide the fact that all the good ideas came from
Peter's code.  Indeed, we came within a gnat's eyelash of rejecting
my implementation of Peter's algorithm because of licensing issues.

http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2000-09/msg00027.html
http://www.xray.mpe.mpg.de/cgi-bin/w3glimpse2html/perl5-porters/2000-09/msg00207.html?56#mfs

Anyway, we have, at the very start

> Intro
> -----
> This describes an adaptive, stable, natural mergesort, modestly called
> timsort (hey, I earned it <wink>).  It has supernatural performance on
many
> kinds of partially ordered arrays (less than lg(N!) comparisons needed,
and
> as few as N-1)

<wink> doesn't cut the mustard.  He borrowed some of the best parts
of timsort from Peter's paper, and, to my taste, doesn't adequately
acknowledge that.  OK, air cleared.  Let's move on, with the
understanding that I am not totally objective.

Since "Peter" can too easily be confused with "Peters", I'll refer to
the sorts as Tim's and Peter's, with the understanding that "Peter"
refers to Peter Mcilroy.

I'm unaware of any algorithms whose performance is "supernatural".
The best algorithms I know of are the product of careful analysis.
Peter did that in the the Soda paper, which, I admit, I mis-cited
in the perl sort code, but is accurately cited above. He further
enhanced the algorithms, and shared them with me, and Doug Mcilroy,
at Bell Labs.  I contributed only modestly to Peter's ideas (with
suggestions like, "stability is really worthwhile, and here's how
we can achieve it efficiently".)  One of the improvements Peter made,
which is in the perl source code, but not the Soda paper, was how to
detect "natural runs" without wasting a lot of comparisons.  That
deserves a paragraph or so of its own, because it differs from the
way that (I understand) "timsort" to work.  I get my understanding
of how "timsort" works from

http://code.google.com/p/timsort/source/browse/trunk/timSort.c?spec=svn17&r=17

(which I believe to have a few bugs, but is at least comprehensible
to someone fluent in C, but not in python.)  The source I cited above
is too full of python-stuff to be worth my effort to grok.

So let's start with some simple combinatorics, and let's assume that
no element to be sorted is identical with something else to be sorted,
since that simplifies the analysis without doing too much damage to
the truth.  With a sequence of two (distinct) elements,
we have two possibilities.  (Forgive me if this is dead obvious, I'd like
to make it clear to people to whom it is not dead obvious.)

1,2      and
2,1

in which case we either have a strictly increasing sequence or a
strictly decreasing sequence (where we have chosen two integers
whose relative values to stand in for the way in which two distinct
objects compare to each other).  So any two distinct values are
either in strictly increasing or strictly decreasing order.

Add another (distinct) value, 3, to our two original values and we have

1,2,3    started increasing, continued to be
1,3,2     started increasing, then reversed
2,1,3    started decreasing, then reversed
2,3,1     started increasing, then reversed
3,1,2     started decreasing, then reversed
3,2,1    started decreasing, continued to be

In random (distinct) data, there's only 1 chance in 3 that looking for
an extension of the prevailing run will pay off.  Not so good.  But it
gets worse.  Of the sequences that have initial runs of 3

1,2,3,4  started increasing, continued to be
1,2,4,3  started increasing, then reversed
1,3,4,2  started increasing, then reversed
2,3,4,1  started increasing, then reversed
3,2,1,4  started decreasing, then reversed
4,2,1,3  started decreasing, then reversed
4,3,1,2  started decreasing, then reversed
4,3,2,1  started decreasing, continued to be

Now there's only 2 chances in 8 that the comparison will extend the
run.  Or, of the N! ways N distinct elements can appear in a sequence,
only two will yield a run of N.  Unless you have some reason to
believe the data are "orderly", looking for runs longer than 2 is
a sucker bet.  Tim knows that

>  ... timsort knows darned well it's doing compares that
>  won't pay on random data ...

But, of course, if you want to detect runs longer than 2, you must
risk some comparisons that aren't going to pan out.  That's what
the "Optimistic" part of Peter's paper is about.  But the devil is
in the details.  Timsort appears to bust large sequences into shorter
ones of length between 32 and 64, ostensibly to better balance the
merge phase.  Then it always tries to extend the initial run of 2,
in ascending or descending order, to include as many additional
elements as possible.  If it extends beyond the fixed-length sequence,
it keeps on going, possibly extending to the very end of the sequence
to be sorted.  If the run ends before the fixed-length segment is
exhausted, timsort reverses the run if it was strictly decreasing,
then uses a binary insertion sort to put the entire fixed-length
sequence in sorted order.  Two problems.  As we observed (and Tim
knows), runs as long as 32 to 64 are, a priori, spectacularly
improbable.  Unless there's a lot of structure to the input sequence,
the setup stage is going to end up "wasting" a comparison that
doesn't extend a run, then doing an insertion sort on what's
left over.  The binary insertion sort holds the number of
comparisons to an acceptable level, but the number of moves needed
to insert additional elements into the sorted run can get ugly.
Second, and more important, runs are detected only if they
start at the beginning of a fixed-length segment.  A nice long
run might start somewhere else in the segment, but it will be
overlooked.  In fact, if you want to lay a world of hurt on timsort,
figure out a number of inputs that will lead to fixed length
segments of size, say, 64, take a strictly decreasing sequence
of length 64, move the last element to the front, and fill up
the input with repetitions of this sequence.  Each fixed length
sequence will start with an increasing sequence of length 2,
but the next (wasted) comparison will encounter a reversal,
so the decreasing sequence of length 62 will be insertion sorted.
Insertion sort on decreasing elements is worst case for moves.
Each element after the first one (which was, by construction,
the minimal element) will have to moved each time a new element
is added.  OK, 1800-some moves per segment isn't going to take
terribly long, but if the idea behind fixed length segments was
to reduce moves in the merge phase, you have started out in a large
hole.

Peter's sort addressed both of these problems.  (The run setup
phase was not part of his Soda paper, but it has been in pp_sort.c
from the beginning, because he came up with it while he was working
at Bell Labs, and sharing the source with me.)  Rather than trying
to extend every run of length 2, we look at the sense of PTHRESH
(defaulting to 8) pairs in a row.  If they are all the same, either
increasing or strictly decreasing, we have fairly compelling evidence
of a run.  Only then do we "risk" a comparison to see if adjacent
runs can be "bridged" into a longer one.  If all PTHRESH pairs turn
out to form a run, we have a run of 16, a fine sign of orderly data.
An attempt is made to extend the run, one element at a time, until
the sense of a comparison is no longer compatible with extending the
run (or until we have reached the end of the input).  If we discover
a reversal, we keep whatever runs we got by bridging pairs, and push
ahead, inspecting the sense of adjacent pairs.  PTHRESH is a measure
of "risk aversion" about "wasting" a comparison.  Higher values
demand more evidence of a possible long run before taking that risk,
but also overlook any runs shorter than 2*PTHRESH.  The merge phase,
as described in the SODA paper (and as implemented by "galloping"
in timsort) benefits from having adjacent runs in nearly sorted order,
(although they do not benefit as much as if the runs had previously
been combined into a single run).  So a conservative PTHRESH seems
appropriate.  I like to think about PTHRESH this way:  in random data,
a pair is equally likely to be in increasing or strictly decreasing
order.  PTHRESH pairs in a row is like flipping heads (or tails)
PTHRESH times in a row.  That can happen with a fair coin, and will
happen about once every 2**(PTHRESH-1) times, but even if we lose,
it cannot happen often enough to push us far from the theoretical
limit on comparisons used.  Note, too, that runs of length 2*PTHRESH
or more need not start on any particular boundary.  The same sequence
that gives timsort heartburn presents no problems to Peter's sort.
We get a run of length 2 at the start, then discover the run of
length 63 that follows it (lapping into timsort's next fixed length
segment), then a bunch of runs of length 64 (each lapping into the
next timsort segment), ending up with a run of length 63.  We've gotten
just one more sorted segment to deal with (the first one), and we got
them with many fewer comparisons and moves.  Of course, one shouldn't
use a single pathological input sequence to claim universal superiority,
but Peter's ability to find runs anywhere looks like a winner, unless
balanced merges can be shown to matter more.  I await non-supernatural
evidence.

I won't go into much detail about the merge phase, which is really the
essence of Peter's SODA paper.  When merging two runs, the basic idea
is to determine which run has the greater element.  We then find *all*
the elements in the other run for which it is also greater, and add them
to the merged run.  Now we have found a greater element in the opposite
run, and, as before, we use it as a "wedge" in the other array to find
*all* the elements in the other array for which it is greater.  And so on.
The obvious way to "position the wedge" is to compare it to each element
in turn, and stop when we reach a greater one.  If the input is random,
long runs are improbable, much as long runs of adjacent pairs are
improbable in the original setup.  So if we haven't found a greater
element after several comparisons, we have evidence of order, and we
can justify skipping an element.  If we find the following element
is still not greater, we know the element we skipped couldn't have been
greater, either.  So we skip 2 elements, then 4, then 8, and so on,
until we either find a greater element, or we reach the end of the list.
We refer to this as "ramping up" in pp_sort.c.  If we encounter a greater
element before reaching the end of the list, we need to look back for the
first such element.  We go back half the size of the last step, then
half that size, until we encounter an element that is not greater.
We call this "ramping down".  Then we turn around again and look in
the other direction, reducing the step size by half, until we finally
home in on the least element where we compared greater.  Peter refers
to this in the SODA paper as "insertion sort with differential search".
Timsort refers to it as "galloping".  As with taking a chance on extending
a run in the initial setup, this sometimes "wastes" a comparison.
For example, if we skip an element, and find we are greater than the
next element, we have to check the element we skipped.  If we are also
greater than that element, we would have found it immediately if
we had just considered every element.  So we don't want to start
ramping up too soon.  In Peter's sort, this is parameter RTHRESH,
defaulting to 6.  We check up to RTHRESH elements in a row before
ramping up.  That should avoid wasted compares on random data, but
go sub-linear on ordered data.  This is a fixed threshold in Peter's
sort, an adjustable value in timsort.  As far as I can tell, this is
a religious issue.  Should previous behavior be predictive of future
behavior?  If you believe yes, you should adjust the threshold up or
down.  If you think no, just leave it fixed.  Tim seems to agree it
doesn't matter much.  Pay particular attention to the last sentence.

> We can't guess in advance when it's going to win, though,
> so we do one pair at a time until the evidence seems strong
> that galloping may pay.  MIN_GALLOP is 7, and that's pretty
> strong evidence.  However, if the data is random, it simply
> will trigger galloping mode purely by luck every now and
> again, and it's quite likely to hit one of the losing cases
> next.  On the other hand, in cases like ~sort, galloping
> always pays, and MIN_GALLOP is larger than it "should
> be" then.  So the MergeState struct keeps a min_gallop
> variable that merge_lo and merge_hi adjust:  the longer
> we stay in galloping mode, the smaller min_gallop gets,
> making it easier to transition back to galloping mode (if
> we ever leave it in the current merge, and at the start
> of the next merge).  But whenever the gallop loop doesn't
> pay, min_gallop is increased by one, making it harder to
> transition back to galloping mode (and again both within a
> merge and across merges).  For random data, this all but
> eliminates the gallop penalty:  min_gallop grows large
> enough that we almost never get into galloping mode.
> And for cases like ~sort, min_gallop can fall to as low
> as 1.  This seems to work well, but in all it's a minor
> improvement over using a fixed MIN_GALLOP value.

When using element w of one run as a "wedge" into the other run,
we cannot just use cmp(w,b) <= 0 as a means of determining whether
w is less than element b from the other.  That's ok if
w comes from the earlier run, and b from the later.  But if w
comes from the later run, then, to preserve stability over
equal elements, we must treat equal elements as though b,
the element from the earlier run, is strictly less than w.
In Peter's sort, we do this with the simple expedient of
setting integer "sense" to 0 if w comes from the earlier run,
and to -1 if it comes from the later run, then doing all
comparisons as

    cmp(w,b) <= sense

This technique is one of a very few contributions I made as Peter
and I worked on the source.  I believe that timsort could be greatly
simplified if it employed this technique, rather than having separate
(but closely related) routines for galloping left and galloping right
or merging left and merging right.  Maybe somebody should suggest
it to Tim.  He could call it timCmp and claim he earned it <wink>.
OK, no more snarky comments.

My other contribution, done without Peter watching over my shoulder
to ensure I wasn't doing something stupid, was to merge the initial
set of runs using a divide-and-conquer approach.  The code is
extensively commented in pp_sort.c, I won't repeat it here.
It accomplishes two goals.  It balances runs early, while they
are still short.  And it revisits newly created runs as soon as
balance allows, so it should be more cache-friendly than the
original "grand sweep" approach.

I don't pretend to know how much balanced run lengths contribute to
performance.  A huge run is, in an important sense, a celebration
of the reduced number of comparisons it took to create it, N-1
comparisons versus NlgN comparisons.  If you blow a few comparisons
(or, much less importantly, moves, which are done in line), you
are still probably well ahead.  Maybe some computer science major
can compare Peter's sort to Tim's sort.  Let the best sort win!
My money is on Peter's sort.

Here's a more accurate citation for Peter's sort, prior to our working
together.  Thanks, Tim!  Much more useful than mine.

   "Optimistic Sorting and Information Theoretic Complexity"
   Peter McIlroy
   SODA (Fourth Annual ACM-SIAM Symposium on Discrete Algorithms), pp
   467-474, Austin, Texas, 25-27 January 1993.

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About