develooper Front page | perl.libwww | Postings from July 2001

Memory usage by HTML::TreeBuilder

Thread Previous | Thread Next
From:
Curt Powell
Date:
July 19, 2001 10:31
Subject:
Memory usage by HTML::TreeBuilder
Message ID:
LPBBICBICFBDGHPLFHFMGEOIDJAA.cpowell@sierraridge.com
All,

I've run into a problem with treebuilder when processing large pages.  My
process size jumps enormously, e.g. an 8MB process increases to 72MB when
processing a 2.6MB web page, but when finished that memory is not released.
(Is this an artifact of perl's memory management architecture?)
Furthermore, another large page retrieval will result in more process
growth, although not as much the first time.  By commenting out each
function call in the test script I've been able to pinpoint by far the
largest amount of memory growth to the buildtree() function call which calls
HTML::TreeBuilder.  (Test script and sample output may be found at the end
of this message.)

Is there a way to get this memory back after processing a large page?  Does
perl have a way to force garbage collection ala Java?

I've seen this behavior under perl 5.003 (RH Linux 6.1), 5.6 (RH Linux 7.0)
and 5.6.1 (RH Linux 7.1).

Curt
-------------------------

#!/usr/bin/perl
#usage: ./memtest < input_file

sub formattext  # called by buildtree()
{
	use HTML::FormatText;
	my $html = shift;
	my $formatter = HTML::FormatText->new(leftmargin=>0, rightmargin=>250);
	my $ascii = $formatter->format($html);
}
sub buildtree()  #called by geturllength()
{
	my $Response = shift;
	use HTML::TreeBuilder;
	my $html = HTML::TreeBuilder->new();
	$html->parse($Response->content);
	&formattext($html);
	$html = $html->delete;
}

sub geturllength()  #called by main loop
{
	use LWP::UserAgent;
	use HTTP::Request;
	my $URL = shift;
	my $UA = LWP::UserAgent->new();
	my $Request = HTTP::Request->new(GET => $URL);
	my $Response = $UA->request($Request);
	print "Error retrieving $URL\n" if ($Response->is_error());
	&buildtree($Response);
	return length($Response->as_string);
}

# return amount of memory used
sub memused
{
	local *memused_TMP_FILE;
	open(memused_TMP_FILE, "</proc/$$/stat") or return "N/A";
	my $a = <memused_TMP_FILE>;
	close memused_TMP_FILE;
	my @b = split(' ', $a);
	return $b[22];
}

print "resp. lngth\tMem. used\tChange\tURL retrieved\n";
while (<STDIN>)
{
	chomp $_;
	$length = &geturllength($_);
	sleep 5; # wait for proc file to be updated?
	$used = &memused();
	$delta = $used - $lastused;
	print "$length\t$used\t$delta\t$_\n";
	$lastused = $used;
}

-----------------

results from arbitrarily selected web pages:

resp. lngth	Mem. used	Change	URL retrieved
204210	8081408	8081408	http://www.iawa.org/members.html
119468	9183232	1101824	http://www.iaw.on.ca/~fridguy/cgi-bin/db.cgi
345981	11227136	2043904	http://www.ibabowl.com/LocalData.htm
123037	11137024	-90112	http://www.ibac.org/Bulletins/ibac_b00-2.htm
2641177	82894848	71757824
http://www.furman.edu/admin/alumni/registry/visitors2.html
2580581	82657280	-237568
http://www.furman.edu/admin/alumni/registry/visitors.html
164452	75087872	-7569408
http://www.i-base.org.uk/publications/bulletins/htb2/htb2.html
152871	74932224	-155648	http://www.ibasis.net/news/pr01302001.htm
463743	75554816	622592	http://www.ibasis.net/news/pr07192000a.htm
100676	74719232	-835584	http://www.ibat.org/Vend2.htm
170515	74891264	172032	http://www.ibb.hr/komponente.html
188279	74809344	-81920	http://www.ibcmc.com/browsedb2.asp
123066	74866688	57344
http://www.ibcsports.com/west_virginia_state_bb_2_27.htm
Error retrieving http://www.ibegcom.com/company.htm
120	74620928	-245760	http://www.ibegcom.com/company.htm
Error retrieving http://www.ibegcom.com/units.htm
120	74620928	0	http://www.ibegcom.com/units.htm
159397	74780672	159744
http://www.iberbyte.es/iberbyte/F_Productos_todos.html
140987	74764288	-16384	http://www.ibertel.com/atlantis/tarcon.html
199722	75022336	258048
http://www.ibfnet.de/katalog/software/softwarekommunikation.htm
125819	74870784	-151552	http://www.ib.hu-berlin.de/~wumsta/uk/plan.html
103928	74723328	-147456	http://www.ibia.org/news.htm
135435	74891264	167936	http://www.ibia.org/policy.htm
121747	75218944	327680
http://www.ibiblio.org/london/agriculture/faqs/1/msg00027.html
200740	75657216	438272
http://www.ibiblio.org/london/permaculture/mailarchives/sanet2/maillist.html
149005	75476992	-180224
http://www.ibiblio.org/london/permaculture/mailarchives/sanet2/threads.html
121177	75587584	110592
http://www.ibiblio.org/pub/academic/agriculture/agronomy/AGMODELS-L/199602xx
.agm.html
197070	75722752	135168
http://www.ibiblio.org/pub/academic/agriculture/agronomy/AGMODELS-L/log9503.
agmodels-l.html
117475	75325440	-397312	http://www.ibisnet.org/200102/index.html
106425	75325440	0	http://www.iblcham.ch/promo/market.htm
208249	75743232	417792	http://www.ibl.com/worldinfo/appc.html
203534	75530240	-212992
http://www.ibl.com/writerinfo/caribbean/dominicanrepublic.htm
134846	75567104	36864
http://www.ibmlink.ibm.com/cgi-bin/master?xu=guest&xp=&xh=logon&request=anno
uncements&parms=G_294-519
100902	75427840	-139264	http://www.ibmlink.ibm.com/usalets&parms=H_200-288
117587	75325440	-102400	http://www.ibmlink.ibm.com/usalets&parms=H_299-023
161490	75489280	163840	http://www.ibo-ny.com/members3.htm
135041	75325440	-163840	http://www.ibo-ny.com/members4.htm
110164	75436032	110592	http://www.ibo-ny.com/members.htm
115960	75563008	126976	http://www.ibpinetsp.com.br/rede/not_informe.html
147154	75845632	282624	http://www.ibpmt.com/search_0.htm
103657	75325440	-520192	http://www.ibrc.indiana.edu/affiliates.html
133738	75460608	135168	http://www.i-b-r.org/ir00020b.htm
128832	75325440	-135168	http://www.ibss.iuf.net/common/irsbabs.html
147379	75575296	249856	http://www.ibt.ku.dk/nsfk/Newsletter/nk242.html
121759	75325440	-249856	http://www.ibt.ku.dk/nsfk/Newsletter/nk243.htm
151322	75755520	430080	http://www.ibt.ku.dk/nsfk/Newsletter/nk251.htm
111194	75452416	-303104	http://www.ibt.ku.dk/nsfk/Newsletter/nk252.htm
140327	75730944	278528	http://www.ibt.ku.dk/nsfk/newsletter/nk253.htm
142067	75325440	-405504	http://www.ibt.ku.dk/nsfk/newsletter/nk261.htm
189773	75722752	397312	http://www.ibunka.com/translation/dataj.html
107048	75325440	-397312	http://www.ibuyer.net/rate_list.html?cid=338
271704	76124160	798720	http://www.ibw.com.ni/~chiste/
134602	75403264	-720896	http://www.icamo.ind.br/revend_sudeste.htm
118548	75444224	40960
http://www.icann.org/correspondence/cerf-testimony-08feb01.htm
136866	75583488	139264
http://www.icann.org/registrars/accreditation-qualified-list.html
2595444	80711680	5128192	http://www.icann.org/tlds/africa1/APPLICATION  AND
REGISTRY OPERATOR'S PROPOSAL.htm


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About