develooper Front page | perl.beginners | Postings from February 2009

Re: processing large datafiles

Thread Previous | Thread Next
Jim Gibson
February 17, 2009 09:40
Re: processing large datafiles
Message ID:
On 2/17/09 Tue  Feb 17, 2009  9:06 AM, "Pedro Soto"
<> scribbled:

> Dear all,
> I need to read a huge file and then write only the columns that match
> with ids from another file (with less ids) in a sorted fashion.
> I made a script thatdoes the work but it takes a lot of time. I tried
> the script with few columns from the huge and it took 5 sec to do the
> job. Because I have over 403 000 ids, I calculated more and less 3hr
> to run the complete files, but the script is taking longer than that.
> I wonder if someone has a better way to do this... I really need to
> write the huge file by sorted ids. Any help will be greatly
> appreciated
> Here is the code:
> #!usr/local/bin/perl/

That should probably be

> use warnings;
> use strict;
> open(MAP,"") || die;

It is better to use lexical variables for file handles, say why the open
fails if it does not succeed, and use the low-precedence 'or' instead of

open( my $map, "") or die("Can't open $!");

> my %map;
> my %locus;
> while(<MAP>) {

Indenting loops and conditionals makes your source file more readable to
those who are trying to help you.

while( <$map> ) {

> chomp;
> my @snp =split /\s+/;

You can use the default split if you don't have any leading whitespace,
your program will be easier to understand if you use explicit variable names
and faster if you only use the first three fields (you should replace
$field2 and $field3 with better names):

    my( $type, $field2, $field3 ) = split;

> if ($snp[0] =~ /Chromosome/) {next};

The eq operator will be faster than a regular expression:

    if( $type eq 'Chromosome' ) {

> push(@{$map{$snp[0]}},$snp[3]);
> $locus{$snp[3]} = $snp[2];

    $locus{$field3} = $field2;

(but see below).

> }
> close MAP;
> open(IN,"trialped.csv") || die;
> my @AoA =();
> while(<IN>) {
> chomp;
> my @temp =split/,/;
> push(@AoA,[@temp]);
> }
> close IN;
> $out1= "outfile.txt";
> open(OUT1,">$out1") || die;
> for (my $x=1;$x<=$#AoA;$x++) {

You are skipping the first element of AoA. Is that what you mean to do?

> print OUT1 "$x $AoA[$x][0] 0 0 0 1\t";
> foreach my $k (sort {$a <=>$b} keys%map) {
>  foreach my $val(sort {$a <=>$b} @{$map{$k}}){
>      for (my $y=1;$y <$sca;$y++) {

$sca seems to be undefined here.

>      if($locus{$val} eq $AoA[0][$y]) {

You are searching %locus for a value that matches $AoA[0][$y] (and once
again skipping the first element!). It is faster to look up a key than
search all (key,value) pairs for a matching value. That indicates that your
hash should be defined in an inverse manner:

    $locus{$field2} = $field3;

Then you can simply test whether $locus{$AoA[0][$y] exists as a key in

   foreach my $val ( sort {$a<=>$b} @{$map{$k}} ) {
      if( exists $locus{$AoA[0][$y] ) {
        print "$AoA[$x]$y]";

>        print "$AoA[$x][$y]";
>       last;
>       }
> }
> }
> }
> print OUT1 "\n";
> }

You can close OUT1 explicitly and check for errors.

close(OUT1) and die("Error writing to $out1: $!");

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About