develooper Front page | perl.libwww | Postings from December 2001

Extracting Titles from Multiple URLs

Thread Next
From:
Michael Bauer
Date:
December 4, 2001 00:13
Subject:
Extracting Titles from Multiple URLs
Message ID:
Pine.LNX.4.10.10112031830520.6011-100000@proxima.michaelbauer.com

Hi.  I'm trying to take a list of urls:

  www.foo.com
  www.bar.edu
  www.baz.gov

and get the titles.  Problem is, some of the urls don't exist.  Setting
the timeout low in User Agent from what I understand doesn't really apply
until after a connection is made and data is being processed, so I can
obviously wait quite a while to begin timing out for a non-existent url!
Here's the code I was using (copied from LWP examples):

<code>
#!/usr/bin/perl

use LWP::UserAgent;
use HTTP::Request; 
use HTTP::Response;
use URI::Heuristic;

$ifile = "$ARGV[0]\n";

open (I, "< $ifile") || print "can't open $ifile - $!\n";

while (<I>) {
    chop;
    my $raw_url = $_;
    my $url = URI::Heuristic::uf_urlstr($raw_url);
    $| = 1;
    print $url."\t";
    my $ua = LWP::UserAgent->new();
    $ua->agent("LoverlyBrower");
    $ua->timeout(10);
    my $req = HTTP::Request->new(GET => $url);
    $req->referer("http://www.toto.oz");
    my $response = $ua->request($req);
    if ($response->is_error()) {
	print $response->status_line."\n";
    } else {
	my $title = $response->title();
	print $title."\n";
    }
}
</code>

Should I embed this code into something that checks to see if the host can
be found via first before trying to get the title from a web site running
at the host?

--------------------------------------------------------------------------
Michael Bauer      http://www.michaelbauer.com      bauer@michaelbauer.com



Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About