develooper Front page | perl.libwww | Postings from February 2001

Lessons learned: writing a linkcheck script w/ LWP

Phil Mitchell
February 20, 2001 14:21
Lessons learned: writing a linkcheck script w/ LWP
Message ID:
I have recently written a script to validate a list of about 10,000 urls 
that are embedded in the Harvard Library catalog. Although LWP out of the 
box will do fine on the vast majority of these, it misses a few percent -- 
which in my case added up to hundreds of spurious bad url reports. Here are 
the things that I learned in the course of trying to chase down this few 
percent -- thought others might find it useful:

1. As previously posted to this list, there is some kind of interaction 
between Solaris and certain HTTP servers by which the termination character 
of the HTTP response is dropped. (I have posted about this to the LWP list 
previously.) To handle this, you need some way to flush the response buffer 
when LWP times out (it's waiting for the termination character). What I did 
was use Net::Telnet in these cases to re-send the GET, b/c Telnet exposes 
the input_log even when it times out.

2. It took me a while to realize that when you create a GET request using 
the HTTP module, the default is not HTTP/1.0. A fair number of spurious 
errors result from not using HTTP/1.0.

3. I spent some effort determining the best combination of timeout and 
retry parameters. My conclusions are that your agent timeout should be 
about 30 sec. Increasing it to, say, 60 sec doesn't really help, and it can 
add a lot to the running time of your script. OTOH, setting it as low as 10 
sec will cause a lot of spurious errors. It is important to do retries 
spread over a fairly wide amount of time -- preferably more than 24 hours. 
The current settings on my script are to recheck errors about twenty times, 
spread over about 24 hours. This provides a very high degree of protection 
from spurious error reports -- at the risk of increasing unreported errors 
-- the tradeoff is inevitable.

4. Although the standard LWP user agent request will follow redirects for 
you, b/c of the problem mentioned in (1), I wound up using simple requests 
and handling all redirection myself. This proved non-trivial. There two 
different types of redirects and both can lead to page cycles (ie closed 
loops of urls). I identified five cases:
a. simple redirect using HTTP header;
b. redirect to same page, using HTTP header, to set cookies;
c. simple redirect using <meta http-equiv="refresh"...> tag;
d. refresh to same page using <meta http-equiv="refresh"...> tag;
e. refresh to a series of pages using <meta http-equiv="refresh"...> tag; Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About