develooper Front page | perl.perl5.porters | Postings from July 2004

Re: optimizing Short circuiting sort method based on [M]

david nicol
July 5, 2004 17:33
Re: optimizing Short circuiting sort method based on [M]
Message ID:
On Wed, 2004-06-30 at 03:18, Paul Fenwick wrote:
> Glenn Linderman wrote:
> > As an alternative to the copy cost above, I posit an algorithm I'll call 
> > scan-select (maybe someone has a name for it) where the original data 
> > isn't modified, but on the first pass you go through all N entries and 
> > look for the highest/lowest number, and keep track of where it was, and 
> > select it.  Then on each subsequent scan you go through all the data 
> > that hasn't already been selected, and look for the highest/lowest 
> > number, and select it, until you have accumulated M entries.  This would 
> This sounds just like Selection Sort to me, although stopping after you 
> have the required number of elements.  It's O(N^2), although in this 
> case it would be O(N*M).

> In the case where M approaches N, Selection Sort will be a very poor 
> choice, and we'd be better off using an O(N*logN) sort and then 
> returning only the wanted elements.

I am wondering if I was somehow unclear in my proposed algorithm,
archived at

which describes both bubble and inserting to place new elements in the
short list.  I believe the algorithm there to be worst case N*M for
bubble and N lg M for insertion placement mechanisms. Expected case for
random data should be the sum of:
         M lg M    # sort the first M elements

	 N - M     # the comparison of each element to the current 
                     gatekeeper element

         # a term that represents the base two log of M-1 multiplied
           by the probability that an element will compare left of the
           gatekeeper element, which itself gets adjusted left as the
           long array is traversed.  The probability of an element
           belonging leftward from the gatekeeper falls to M/N for the
           last element considered, from some larger amount that
           represents the probability of $Large[$M] sorting leftward of
           the rightmost element in a @Short = sort @Large[0..(M-1)].
           My grasp of combinatorics is currently not up to describing
           that term.  And that's without considering the effect of
           stability in preventing placements of elements that are
           merely equivalent to the gatekeeper instead of leftward
           of it.

Important points: 

* One pass over all elements in long array, comparing each to
  the "gatekeeper" element which is the leftmost element in the
  short array.

* Additional comparisons are only required to determine placement of
  an element that belongs left of the gatekeeper.

* We are only concerned with optimizing away unncessary sorting when
  M is smaller than N.

* we only consider comparisons as the metric.

After the short circuiting sort optimization is in place a hint about
when the comparison function is cheap enough that we should care about
all the data shifting required for insertion sorts and reoptimize short
circuited sorts back to full sorts for sufficiently large M and
sufficiently small N-M

david nicol
 "cat and buttered toast" Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About