develooper Front page | perl.beginners | Postings from April 2022

Please help: perl run out of memory

Thread Next
From:
wilson
Date:
April 17, 2022 09:33
Subject:
Please help: perl run out of memory
Message ID:
c9c844df-2d5c-838a-07ae-65d9d3ce24f6@bigcount.xyz
hello the experts,

can you help check my script for how to optimize it?
currently it was going as "run out of memory".

$ perl count.pl
Out of memory!
Killed


My script:
use strict;

my %hash;
my %stat;

# dataset: userId, itemId, rate, time
# AV056ETQ5RXLN,0000031887,1.0,1397692800

open HD,"rate.csv" or die $!;
while(<HD>) {
     my ($item,$rate) = (split /\,/)[1,2];
     $hash{$item}{total} += $rate;
     $hash{$item}{count} +=1;
}
close HD;

for my $key (keys %hash) {
     $stat{$key} = $hash{$key}{total} / $hash{$key}{count};
}

my $i = 0;
for (sort { $stat{$b} <=> $stat{$a}} keys %stat) {
     print "$_: $stat{$_}\n";
     last if $i == 99;
     $i ++;
}

The purpose is to aggregate and average the itemId's scores, and print 
the result after sorting.

The dataset has 80+ million items:

$ wc -l rate.csv
82677131 rate.csv

And my memory is somewhat limited:

$ free -m
               total        used        free      shared  buff/cache 
available
Mem:           1992         152          76           0        1763 
   1700
Swap:          1023         802         221



What confused me is that Apache Spark can make this job done with this 
limited memory. It got the statistics done within 2 minutes. But I want 
to give perl a try since it's not that convenient to run a spark job always.

The spark implementation:

scala> val schema="uid STRING,item STRING,rate FLOAT,time INT"
val schema: String = uid STRING,item STRING,rate FLOAT,time INT

scala> val df = 
spark.read.format("csv").schema(schema).load("skydrive/rate.csv")
val df: org.apache.spark.sql.DataFrame = [uid: string, item: string ... 
2 more fields]

scala> 
df.groupBy("item").agg(avg("rate").alias("avg_rate")).orderBy(desc("avg_rate")).show()
+----------+--------+ 

|      item|avg_rate|
+----------+--------+
|0001061100|     5.0|
|0001543849|     5.0|
|0001061127|     5.0|
|0001019880|     5.0|
|0001062395|     5.0|
|0000143502|     5.0|
|000014357X|     5.0|
|0001527665|     5.0|
|000107461X|     5.0|
|0000191639|     5.0|
|0001127748|     5.0|
|0000791156|     5.0|
|0001203088|     5.0|
|0001053744|     5.0|
|0001360183|     5.0|
|0001042335|     5.0|
|0001374400|     5.0|
|0001046810|     5.0|
|0001380877|     5.0|
|0001050230|     5.0|
+----------+--------+
only showing top 20 rows


I think my perl script should be possible to be optimized to run this 
job as well. So ask for your helps.

Thanks in advance.

wilson

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About