Front page | perl.libwww | Postings from January 2002

Re: simple robot?

From:
Robert Barta
Date:
January 6, 2002 14:43
Subject:
Re: simple robot?
Message ID:
20020107083540.B6976@namod.qld.bigpond.net.au
On Sun, Dec 30, 2001 at 07:05:03PM +0100, Matej Kovacic wrote:
> I have a question... does anyone have - and is willing to give - the program
> which get URL of a website as an input parameter, and then build a tree or
> list all HTML files within that site.

Yes, I have written a module I tentatively called WWW::Analyze. It is subclassing
WWW::Robot. It never made it into CPAN because it is fairly undocumented/wrongly
documented and has no decent test suite. Yet. Volunteers welcome.

NAME
WWW::Analyze - Perl extension for web site analysis

SYNOPSIS
use WWW::Analyze;

$a = new WWW::Analyze ();$a->run ('http://www.example.org/');

DESCRIPTION
The WWW::Analyzer is a specialized robot to analyze a
given web site for inconsistencies but also general
statistics (the STATISTICS manpage). When started, the
robot will iterate over the site (with a given set of pre­
defined policies) and will gather information about the
pages and the site structure. This structure can then be
output.

http://people.telecoma.net/rho/cpan/

> If possibly to the arbitrary depth.

Yes, it can do that, although the collected data gets pretty big fast.
As this was written to check student web pages for plagiarism there is
even an option to check every page against Google and find similarities
with LCS. Perl/CPAN is simply amazing.

\rho

PS: I'm open to suggestions for a better name!