develooper Front page | perl.perl5.porters | Postings from June 2004

Re: optimizing Short circuiting sort method based on [M]

Glenn Linderman
June 30, 2004 11:27
Re: optimizing Short circuiting sort method based on [M]
Message ID:
On approximately 6/30/2004 1:18 AM, came the following characters from
the keyboard of Paul Fenwick:
> G'day Glenn / P5P,
> Glenn Linderman wrote:
>> As an alternative to the copy cost above, I posit an algorithm I'll 
>> call scan-select (maybe someone has a name for it) where the original 
>> data isn't modified, but on the first pass you go through all N 
>> entries and look for the highest/lowest number, and keep track of 
>> where it was, and select it.  Then on each subsequent scan you go 
>> through all the data that hasn't already been selected, and look for 
>> the highest/lowest number, and select it, until you have accumulated M 
>> entries.  This would 
> This sounds just like Selection Sort to me, although stopping after you 
> have the required number of elements.  It's O(N^2), although in this 
> case it would be O(N*M).
> In the case where M approaches N, Selection Sort will be a very poor 
> choice, and we'd be better off using an O(N*logN) sort and then 
> returning only the wanted elements.

Thanks for knowing the name.

Another possible way of saving the copy cost is to intergrate the copy 
into the first pass of the sort, when all the data has to be examined 
once anyway, and it could be moved (and possibly reordered) at the same 
time.  This would probably require a customized bit of code for the 
first pass, whereas subsequent passes would operate in place.

> Many of the more efficient sorts (eg, Quicksort and Mergesort) use a 
> divide-and-conquer approach.  I would propose that a good solution is to 
> use a modified version of one of these sorts.  Allow the algorithm to 
> divide, but only conquer each segment if it could possibly contain 
> elements that we care about in the output list.
> Let's use Quicksort with an array of N=100 elements, and we desire the 
> smallest M=10 elements.
>     * Quicksort starts by choosing a pivot, an dividing elements
>       into greater than or less than this pivot.  On average,
>       this will break our array in half (50/50).  Since we know
>       that we're only interested in M=10 smallest elements, we can
>       drop the top 50 elements immediately.
>     * Next pass we choose another pivot, and repeat.  On average
>       this will split the array in half again (25/25).  Again,
>       we can dump the top 25 elements.
>     * Rinse, repeat.  Next step we get 12/13 on average, dump the
>       top.
>     * Next pass we end up dividing into two parts, both which will
>       (on average) contain elements for our final list.  We
>       keep all the elements and continue to sort until done, and
>       scoop up the required M=10 elements.
> The big advantage of this is that it discards on average half of the 
> elements in the array each pass, until we're only dealing with elements 
> that are candidates for our final array.  This is *very* fast.
> The downside is that quicksort has pathological cases when it performs 
> very poorly, such as sorting an already sorted list.  In this case we 
> don't get to discard any elements at all (except our pivots), and 
> performance plummets.  This is why Perl doesn't use quicksort any more 
> to implement sort().

Because of the pathological cases, it is probably better not to choose 
Quicksort for obtaining just a few elements either.  But thanks for 
describing the technique one could use for it, and possibly for some 
other sorts as well.

> I haven't thought about any other sorting algorithms yet.  I agree that 
> heapsort is definitely worth examining as an efficient way to gathering 
> only the first sorted M elements, as at each stage the next element is 
> known (the root), and there are only two choices for the next element 
> (the immediate children of the root).

Yes, it was because Heapsort is also O(N lg N), as well as having 
producing a sorted element on each pass, that made me think it might 
work effectively for this scenario, hopefully being more efficient than 
bubble sort as M increases.  Of course, as M approaches N, it is 
probably more efficient to just sort the array with an efficient O(N lg 
N) sort algorithm, and extract the needed elements... and it also seems 
likely that as M approaches N, that it is less likely that there'll be a 
hint as to how many elements are needed, unless the coder uses a large 
slice to contrive the hint, knowing about the optimization.  And for 
small enough N, one should use the bubble sort anyway, so even if M is 
close to N, using the hint could be worthwhile.

> Cheers,
>     Paul

Glenn --
The best part about procrastination is that you are never bored,
because you have all kinds of things that you should be doing. Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About