On Wed, 2004-06-30 at 03:18, Paul Fenwick wrote: > Glenn Linderman wrote: > > > As an alternative to the copy cost above, I posit an algorithm I'll call > > scan-select (maybe someone has a name for it) where the original data > > isn't modified, but on the first pass you go through all N entries and > > look for the highest/lowest number, and keep track of where it was, and > > select it. Then on each subsequent scan you go through all the data > > that hasn't already been selected, and look for the highest/lowest > > number, and select it, until you have accumulated M entries. This would > > This sounds just like Selection Sort to me, although stopping after you > have the required number of elements. It's O(N^2), although in this > case it would be O(N*M). > In the case where M approaches N, Selection Sort will be a very poor > choice, and we'd be better off using an O(N*logN) sort and then > returning only the wanted elements. I am wondering if I was somehow unclear in my proposed algorithm, archived at http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2004-06/msg00592.html which describes both bubble and inserting to place new elements in the short list. I believe the algorithm there to be worst case N*M for bubble and N lg M for insertion placement mechanisms. Expected case for random data should be the sum of: M lg M # sort the first M elements N - M # the comparison of each element to the current gatekeeper element # a term that represents the base two log of M-1 multiplied by the probability that an element will compare left of the gatekeeper element, which itself gets adjusted left as the long array is traversed. The probability of an element belonging leftward from the gatekeeper falls to M/N for the last element considered, from some larger amount that represents the probability of $Large[$M] sorting leftward of the rightmost element in a @Short = sort @Large[0..(M-1)]. My grasp of combinatorics is currently not up to describing that term. And that's without considering the effect of stability in preventing placements of elements that are merely equivalent to the gatekeeper instead of leftward of it. Important points: * One pass over all elements in long array, comparing each to the "gatekeeper" element which is the leftmost element in the short array. * Additional comparisons are only required to determine placement of an element that belongs left of the gatekeeper. * We are only concerned with optimizing away unncessary sorting when M is smaller than N. * we only consider comparisons as the metric. After the short circuiting sort optimization is in place a hint about when the comparison function is cheap enough that we should care about all the data shifting required for insertion sorts and reoptimize short circuited sorts back to full sorts for sufficiently large M and sufficiently small N-M -- david nicol "cat and buttered toast"