G'day Glenn / P5P, Glenn Linderman wrote: > As an alternative to the copy cost above, I posit an algorithm I'll call > scan-select (maybe someone has a name for it) where the original data > isn't modified, but on the first pass you go through all N entries and > look for the highest/lowest number, and keep track of where it was, and > select it. Then on each subsequent scan you go through all the data > that hasn't already been selected, and look for the highest/lowest > number, and select it, until you have accumulated M entries. This would This sounds just like Selection Sort to me, although stopping after you have the required number of elements. It's O(N^2), although in this case it would be O(N*M). In the case where M approaches N, Selection Sort will be a very poor choice, and we'd be better off using an O(N*logN) sort and then returning only the wanted elements. Many of the more efficient sorts (eg, Quicksort and Mergesort) use a divide-and-conquer approach. I would propose that a good solution is to use a modified version of one of these sorts. Allow the algorithm to divide, but only conquer each segment if it could possibly contain elements that we care about in the output list. Let's use Quicksort with an array of N=100 elements, and we desire the smallest M=10 elements. * Quicksort starts by choosing a pivot, an dividing elements into greater than or less than this pivot. On average, this will break our array in half (50/50). Since we know that we're only interested in M=10 smallest elements, we can drop the top 50 elements immediately. * Next pass we choose another pivot, and repeat. On average this will split the array in half again (25/25). Again, we can dump the top 25 elements. * Rinse, repeat. Next step we get 12/13 on average, dump the top. * Next pass we end up dividing into two parts, both which will (on average) contain elements for our final list. We keep all the elements and continue to sort until done, and scoop up the required M=10 elements. The big advantage of this is that it discards on average half of the elements in the array each pass, until we're only dealing with elements that are candidates for our final array. This is *very* fast. The downside is that quicksort has pathological cases when it performs very poorly, such as sorting an already sorted list. In this case we don't get to discard any elements at all (except our pivots), and performance plummets. This is why Perl doesn't use quicksort any more to implement sort(). I haven't thought about any other sorting algorithms yet. I agree that heapsort is definitely worth examining as an efficient way to gathering only the first sorted M elements, as at each stage the next element is known (the root), and there are only two choices for the next element (the immediate children of the root). Cheers, Paul -- Paul Fenwick <pjf@perltraining.com.au> | http://perltraining.com.au/ Director of Training | Ph: +61 3 9354 6001 Perl Training Australia | Fax: +61 3 9354 2681Thread Previous | Thread Next