develooper Front page | perl.perl5.porters | Postings from June 2004

Re: optimizing Short circuiting sort method based on [M]

Thread Previous | Thread Next
From:
Paul Fenwick
Date:
June 30, 2004 01:18
Subject:
Re: optimizing Short circuiting sort method based on [M]
Message ID:
40E2774F.1070106@perltraining.com.au
G'day Glenn / P5P,

Glenn Linderman wrote:

> As an alternative to the copy cost above, I posit an algorithm I'll call 
> scan-select (maybe someone has a name for it) where the original data 
> isn't modified, but on the first pass you go through all N entries and 
> look for the highest/lowest number, and keep track of where it was, and 
> select it.  Then on each subsequent scan you go through all the data 
> that hasn't already been selected, and look for the highest/lowest 
> number, and select it, until you have accumulated M entries.  This would 

This sounds just like Selection Sort to me, although stopping after you 
have the required number of elements.  It's O(N^2), although in this 
case it would be O(N*M).

In the case where M approaches N, Selection Sort will be a very poor 
choice, and we'd be better off using an O(N*logN) sort and then 
returning only the wanted elements.

Many of the more efficient sorts (eg, Quicksort and Mergesort) use a 
divide-and-conquer approach.  I would propose that a good solution is to 
use a modified version of one of these sorts.  Allow the algorithm to 
divide, but only conquer each segment if it could possibly contain 
elements that we care about in the output list.

Let's use Quicksort with an array of N=100 elements, and we desire the 
smallest M=10 elements.

	* Quicksort starts by choosing a pivot, an dividing elements
	  into greater than or less than this pivot.  On average,
	  this will break our array in half (50/50).  Since we know
	  that we're only interested in M=10 smallest elements, we can
	  drop the top 50 elements immediately.

	* Next pass we choose another pivot, and repeat.  On average
	  this will split the array in half again (25/25).  Again,
	  we can dump the top 25 elements.

	* Rinse, repeat.  Next step we get 12/13 on average, dump the
	  top.

	* Next pass we end up dividing into two parts, both which will
	  (on average) contain elements for our final list.  We
	  keep all the elements and continue to sort until done, and
	  scoop up the required M=10 elements.

The big advantage of this is that it discards on average half of the 
elements in the array each pass, until we're only dealing with elements 
that are candidates for our final array.  This is *very* fast.

The downside is that quicksort has pathological cases when it performs 
very poorly, such as sorting an already sorted list.  In this case we 
don't get to discard any elements at all (except our pivots), and 
performance plummets.  This is why Perl doesn't use quicksort any more 
to implement sort().

I haven't thought about any other sorting algorithms yet.  I agree that 
heapsort is definitely worth examining as an efficient way to gathering 
only the first sorted M elements, as at each stage the next element is 
known (the root), and there are only two choices for the next element 
(the immediate children of the root).

Cheers,

	Paul


-- 
Paul Fenwick <pjf@perltraining.com.au> | http://perltraining.com.au/
Director of Training                   | Ph:  +61 3 9354 6001
Perl Training Australia                | Fax: +61 3 9354 2681

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About