Front page | perl.perl5.porters | Postings from June 2004

## Re: optimizing Short circuiting sort method based on [M]

From:
Paul Fenwick
Date:
June 30, 2004 01:18
Subject:
Re: optimizing Short circuiting sort method based on [M]
Message ID:
40E2774F.1070106@perltraining.com.au
```G'day Glenn / P5P,

Glenn Linderman wrote:

> As an alternative to the copy cost above, I posit an algorithm I'll call
> scan-select (maybe someone has a name for it) where the original data
> isn't modified, but on the first pass you go through all N entries and
> look for the highest/lowest number, and keep track of where it was, and
> select it.  Then on each subsequent scan you go through all the data
> that hasn't already been selected, and look for the highest/lowest
> number, and select it, until you have accumulated M entries.  This would

This sounds just like Selection Sort to me, although stopping after you
have the required number of elements.  It's O(N^2), although in this
case it would be O(N*M).

In the case where M approaches N, Selection Sort will be a very poor
choice, and we'd be better off using an O(N*logN) sort and then
returning only the wanted elements.

Many of the more efficient sorts (eg, Quicksort and Mergesort) use a
divide-and-conquer approach.  I would propose that a good solution is to
use a modified version of one of these sorts.  Allow the algorithm to
divide, but only conquer each segment if it could possibly contain
elements that we care about in the output list.

Let's use Quicksort with an array of N=100 elements, and we desire the
smallest M=10 elements.

* Quicksort starts by choosing a pivot, an dividing elements
into greater than or less than this pivot.  On average,
this will break our array in half (50/50).  Since we know
that we're only interested in M=10 smallest elements, we can
drop the top 50 elements immediately.

* Next pass we choose another pivot, and repeat.  On average
this will split the array in half again (25/25).  Again,
we can dump the top 25 elements.

* Rinse, repeat.  Next step we get 12/13 on average, dump the
top.

* Next pass we end up dividing into two parts, both which will
(on average) contain elements for our final list.  We
keep all the elements and continue to sort until done, and
scoop up the required M=10 elements.

The big advantage of this is that it discards on average half of the
elements in the array each pass, until we're only dealing with elements
that are candidates for our final array.  This is *very* fast.

The downside is that quicksort has pathological cases when it performs
very poorly, such as sorting an already sorted list.  In this case we
don't get to discard any elements at all (except our pivots), and
performance plummets.  This is why Perl doesn't use quicksort any more
to implement sort().

I haven't thought about any other sorting algorithms yet.  I agree that
heapsort is definitely worth examining as an efficient way to gathering
only the first sorted M elements, as at each stage the next element is
known (the root), and there are only two choices for the next element
(the immediate children of the root).

Cheers,

Paul

--
Paul Fenwick <pjf@perltraining.com.au> | http://perltraining.com.au/
Director of Training                   | Ph:  +61 3 9354 6001
Perl Training Australia                | Fax: +61 3 9354 2681

```

nntp.perl.org: Perl Programming lists via nntp and http.