develooper Front page | perl.perl5.porters | Postings from June 2004

Re: optimizing Short circuiting sort method based on [M]

Glenn Linderman
June 29, 2004 16:24
Re: optimizing Short circuiting sort method based on [M]
Message ID:
On approximately 6/29/2004 3:10 PM, came the following characters from
the keyboard of david nicol:
> On Fri, 2004-06-25 at 10:55, Glenn Linderman wrote:
>>One trivial way to take advantage of the hint is to recall that the 
>>lowly N-squared bubble sort (and variations) produces one high (or low) 
>>value per pass, and that for small M and large N, that M*N is smaller 
>>than N lg N.
> With regard to yesterday's post in which I'm thinking aloud,
> Considering bubbling or inserting the new, smaller element, based
> on expected average-case number of comparisons required, I think that
> bubbling will be better for M less than or equal to five.
> The reason is, binary search _always_ has to do lg(M-1) comparisons
> while bubbling can do anywhere from one to M-1 comparisons because
> bubbling stops when the new element has done like a mature executive and
> risen to the point where it can't rise any further, an expected number
> of (M-1)/2 comparisons.
> (M-1)/2 is equal to lg2(M-1) at M=5.

I'm not sure what you are calculating here?  The cost of inserting the 
data into the short list?  I'm not sure that is the most interesting 
part of the task.  Since M is small, whether entries are placed by 
bubble or insertion sort is somewhat ho-hum.  In fact, depending on how 
you generate the M elements, it may be that there is no work to sort 
them... they might get handed to you sorted.  On the other hand, maybe 
I'm missing a key unstated assumption that would make me understand what 
you are talking about.

So I find the task of obtaining the correct M elements from the N 
elements to be the more interesting task.  Depending on where the M 
elements are stored.... if already in a "temporary array", then they 
could be sorted "in place", and after M passes of the bubble sort, the 
first (or last, depending on how it is coded) M entries are sorted. 
This takes (N-1) + ... + (N-M)  comparison/swap operations.  If you sort 
the whole array, M approaches N, and the complete cost of sorting is 
O(N**2/2) IIRC.  But after M passes, the short array is just a slice of 
the partially sorted large array.  If it is not already in a temporary 
array, there might be a copy cost to put it in a temporary array, unless 
the result is to be stored replacing the original array, in which case 
we can do the same in-place sort.

With something like a heap sort, the highest/lowest elements can also be 
generated in order.  It being 25 years or more since I last analyzed the 
  costs of heap sort, I'm not going to attempt to regurgitate them in 
this email.  IIRC the heap sort is a better sort choice than quicksort 
if you only want a few of the top/bottom few entries.... IIRC both are 
o(N lg N), but for complete sorting, quicksort has a lower coefficient 
than heapsort.

I'm not sure offhand how many result elements are at the crossover 
between where bubble sort is best and heapsort becomes best, but I 
believe both of  those will generate entries in sorted order faster than 
sorting the whole array to throw away most of the results.

As an alternative to the copy cost above, I posit an algorithm I'll call 
scan-select (maybe someone has a name for it) where the original data 
isn't modified, but on the first pass you go through all N entries and 
look for the highest/lowest number, and keep track of where it was, and 
select it.  Then on each subsequent scan you go through all the data 
that hasn't already been selected, and look for the highest/lowest 
number, and select it, until you have accumulated M entries.  This would 
have a similar cost to the bubble sort, but higher overhead to avoid 
previously selected entries.  So that higher overhead has to be traded 
off against the cost of a copy + partial bubble sort or a copy + partial 
heap sort.

Glenn --
The best part about procrastination is that you are never bored,
because you have all kinds of things that you should be doing. Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About