develooper Front page | perl.perl5.porters | Postings from July 2004

Re: optimizing Short circuiting sort method based on [M]

From:
david nicol
Date:
July 8, 2004 19:14
Subject:
Re: optimizing Short circuiting sort method based on [M]
Message ID:
1089339265.1040.316.camel@plaza.davidnicol.com
On Wed, 2004-07-07 at 09:39, John P. Linderman wrote:

> I meant to reply to the porters at large earlier, but succeeded in only
> contacting a couple people.  Heap sort is going to double the number
> of comparisons realtive to a binary search, since it has to do two
> compares at each level of the log M tree.  It saves moves, but moves
> are cheap, and comparisons are dear, so I don't see heap sort ever
> being the winner.  You can get N log M comparisons from the binary
> search you have at
> http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2004-06/msg00592.html
> and N * M moves, which will beat the wholesale sorts for small enough M.
> But the wholesale merge sort does a nice job on orderly data, and will
> beat the specialized sort if there are fewer than log M runs, for example.
> I can imagine keeping track of where insertions happen, and biasing the
> initial guess-point for the binary search when it appears the data are
> non-random.  That could make comparisons look more like N than like N log M
> for orderly data.  -- jpl

Right.  The questions are, given N and M, when do we want to collect
runs?


I imagine run-collecting as the first step in "lazy sort" which might be
a special case of another kind of sort, I'm not sure.  "Nicol's Lazy
Sort" (so I'm bold)  goes like this:

	1: gather runs, in a stable way.
	Takes up to 2N comparisons:

    sub Lazy::Sort::TIEARRAY{
        my $pack = shift;
	my $compare = shift;
	my @Ary = @_;
        my @Runs;
        my $Run;
	$Runs[$Run=0] = [0];
        for $index (1..$#Ary){
            if ( &$compare( $Ary[$index], $Ary[$$Runs[$Run][0]]) < 1){
                unshift @{$Runs[$Run]}, $index;
                next;
            }else{
                if ( &$compare(
                        $Ary[$index],
                        $Ary[$$Runs[$Run][$#{$Runs[$Run]}]]) >= 0){
                    push @{$Runs[$Run]}, $index;
                }else{
                    $Runs[++$Run] = [$index];
                }
            }
        bless [$compare,\@Ary, \@Runs], 'Lazy::Sort';
    };


	more comparisons if we keep stability by bubbling equal elements
        in from the front, so we don't. A run of equals will get pushed.


	2: compare leftmost of all runs to find leftmost and shift it
        off, takes R (number of runs remaining) comparisons:

    sub Lazy::Sort::SHIFT{
        
        wantarray and eval{Games::Roguelike::Nethack::play()};   
        my ($compare,$Ary,$Runs) = @_[0,1,2];
        
        # empty run arrays are prevented by splicing out empty 
        # runs as they occur
	# while(@{$Runs->[0]} == 0) {
        #    shift @{$Runs};
        #    @{$Runs} or return undef;
        #};
        @{$Runs} or return undef;

        my $best = $Runs->[0]->[0];
        my $bestrun = 0;

        for my $run (1..$#{$Runs}){
            if (&$compare(
                          $Ary->[$best],
                          $Ary->[$Runs->[$run]->[0]]
                          ) < 0){
                $best = $Runs->[$run]->[0];
                $bestrun = $run;
            }
        }

        shift @{$Runs->[$bestrun]};
        if (@{$Runs->[$bestrun]} == 0){
            splice @{$runs},$bestrun,1;
	}
        return $Ary->[$best]
    }

        
        


I believe I have just described a crippled merge-sort, that finds runs
once and then does R comparisons to pull out each element.  Ton's
insight that if we know M in advance we can throw away elements in a run
past M is good.

By finding runs and then pulling elements off the runs until we have M
of them, we get worst-case of 2N comparisons to find the runs and a
worst case of R=N/2 runs for a contrived sequence such as

	@Ary = map { ($_ | 1) ? $_ : - $_  } (1000..1)

After finding runs, we can find the M best with M*R more comparisons.
Which may or may not be an improvement over a NlogM single pass.  How
big do the numbers have to be before it's worth the expected 1.5*N
comparisons to find out how small R is, given that after determining R
we can still fall back to the single pass when M*R is larger than NlogM?


Finding the hints is the other half of the problem.  I've seen several
identifiable cases so far:


	@result = (sort @Ary)[0..CONSTANT];

	($first, $second, $third) = sort @Ary;

Would it be possible for the hint to be of the form of the pre-sized
assigned-to array, so the short-sort could assign directly into it
instead of using the stack?



-- 
david nicol
"cat and buttered toast" 




nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About