Front page | perl.perl5.porters | Postings from September 2013

## unstable mergesort?

From:
John P. Linderman
Date:
September 11, 2013 16:01
Subject:
unstable mergesort?
Message ID:
CAC0cEp_nXAUqcu3SsCgMWeF=jCjqmDDW1kREcv=AKGcwv30vjA@mail.gmail.com
```Writing my opinion piece about timsort

See Re: [perl #119635] deprecate and remove qsort

got me thinking about the cost of stability.  Consider the "stuttering
countdown sequence"

100 99 99 98 98 97 97 ... 3 3 2 2 1

It's decreasing, so if we were free to ignore stability, we could knock it
off in N-1 comparisons.  But stability obliges us to consider the first 99
as strictly less than the second 99, and so on.  We get 99 runs of length 2
(and, for those who waded through my analysis, we get a kind of worst-case
initial setup for our mergesort.  Every PTHRESH (8) pairs, we say, wow, 8
pairs in a row in descending order, this sure looks like it could be a long
descending run.  So we risk a compare to see if we can bridge the last two
pairs, and find we cannot.  So we start over, and find 8 more pairs with
the same sense, and risk another compare, and get disappointed again.
Every 16 elements, we make a comparison that doesn't pay off.  1/16th worse
than if we hadn't tried to find long runs.  But that's not so awful.)
Merging those runs of length 2 takes us more like NlgN comparisons rather
than N-1.  And if were really just sorting integers or floats, that would
be rather sad, because there's no way of looking at the final result and

As I mentioned in the analysis, when we compare two equal elements, a and
b, using

cmp(a,b)

then we can regard a as low if a preceded b in the input sequence, but, if
we want stability, we must regard a as greater than b if a came later in
the sequence than b.  In the merge phase, we'll be comparing the same a
against assorted b's, all of which either came before it, or all of which
come later.  So we do something like

if ( a comes before all the b's ) sense  = 0;
else sense = -1;

and then do our compares as

if (cmp(a,b) <= sense) treat a as low
else treat b as low

This "does the right thing" for all a and b, whether they are equal or
not.  But it occurs to me that we could "ignore stability" by doing

if (stable) sensestab = -1;
else sensestab = 0;

at the start of the sort, and then replace the "else sense = -1;" above with

else sense = sensestab;

Our comparison logic would then treat a as low to equal b, regardless of
which came first.  It looks like things will be a little less trivial in
the identification of initial runs phase, but probably not too bad.  After
which

no stable;

would allow us to ignore stability, potentially to considerable gain.  In
fact, one might be tempted to make it the default if the comparison routine
were { \$a <=> \$b }.  Alas, we cannot, because given input like qw( 2b 2a 1b
1a ), numerical comparison will work on the leading numerical part, but
carry along the string form as well, so you *can* tell which 2 is which.
But we can change the documentation (which we must anyway) to emphasize
that most numerical sorts can benefit from not enforcing stability.

If this gets enough +1s, then if someone points me at the documentation for
creating a git branch, I'll take a (leisurely) whack at making stability
optional.

```