develooper Front page | perl.perl5.porters | Postings from September 2013

unstable mergesort?

Thread Next
From:
John P. Linderman
Date:
September 11, 2013 16:01
Subject:
unstable mergesort?
Message ID:
CAC0cEp_nXAUqcu3SsCgMWeF=jCjqmDDW1kREcv=AKGcwv30vjA@mail.gmail.com
Writing my opinion piece about timsort

See Re: [perl #119635] deprecate and remove qsort

got me thinking about the cost of stability.  Consider the "stuttering
countdown sequence"

100 99 99 98 98 97 97 ... 3 3 2 2 1

It's decreasing, so if we were free to ignore stability, we could knock it
off in N-1 comparisons.  But stability obliges us to consider the first 99
as strictly less than the second 99, and so on.  We get 99 runs of length 2
(and, for those who waded through my analysis, we get a kind of worst-case
initial setup for our mergesort.  Every PTHRESH (8) pairs, we say, wow, 8
pairs in a row in descending order, this sure looks like it could be a long
descending run.  So we risk a compare to see if we can bridge the last two
pairs, and find we cannot.  So we start over, and find 8 more pairs with
the same sense, and risk another compare, and get disappointed again.
Every 16 elements, we make a comparison that doesn't pay off.  1/16th worse
than if we hadn't tried to find long runs.  But that's not so awful.)
Merging those runs of length 2 takes us more like NlgN comparisons rather
than N-1.  And if were really just sorting integers or floats, that would
be rather sad, because there's no way of looking at the final result and
knowing which 99 was which (more about this later).

As I mentioned in the analysis, when we compare two equal elements, a and
b, using

cmp(a,b)

then we can regard a as low if a preceded b in the input sequence, but, if
we want stability, we must regard a as greater than b if a came later in
the sequence than b.  In the merge phase, we'll be comparing the same a
against assorted b's, all of which either came before it, or all of which
come later.  So we do something like

if ( a comes before all the b's ) sense  = 0;
else sense = -1;

and then do our compares as

if (cmp(a,b) <= sense) treat a as low
else treat b as low

This "does the right thing" for all a and b, whether they are equal or
not.  But it occurs to me that we could "ignore stability" by doing

if (stable) sensestab = -1;
else sensestab = 0;

at the start of the sort, and then replace the "else sense = -1;" above with

else sense = sensestab;

Our comparison logic would then treat a as low to equal b, regardless of
which came first.  It looks like things will be a little less trivial in
the identification of initial runs phase, but probably not too bad.  After
which

no stable;

would allow us to ignore stability, potentially to considerable gain.  In
fact, one might be tempted to make it the default if the comparison routine
were { $a <=> $b }.  Alas, we cannot, because given input like qw( 2b 2a 1b
1a ), numerical comparison will work on the leading numerical part, but
carry along the string form as well, so you *can* tell which 2 is which.
But we can change the documentation (which we must anyway) to emphasize
that most numerical sorts can benefit from not enforcing stability.

If this gets enough +1s, then if someone points me at the documentation for
creating a git branch, I'll take a (leisurely) whack at making stability
optional.

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About