develooper Front page | perl.libwww | Postings from January 2002

Re: simple robot?

Thread Previous
From:
Robert Barta
Date:
January 6, 2002 14:43
Subject:
Re: simple robot?
Message ID:
20020107083540.B6976@namod.qld.bigpond.net.au
On Sun, Dec 30, 2001 at 07:05:03PM +0100, Matej Kovacic wrote:
> I have a question... does anyone have - and is willing to give - the program
> which get URL of a website as an input parameter, and then build a tree or
> list all HTML files within that site.

Yes, I have written a module I tentatively called WWW::Analyze. It is subclassing
WWW::Robot. It never made it into CPAN because it is fairly undocumented/wrongly
documented and has no decent test suite. Yet. Volunteers welcome.

NAME
       WWW::Analyze - Perl extension for web site analysis

SYNOPSIS
         use WWW::Analyze;

         $a = new WWW::Analyze ();

         $a->run ('http://www.example.org/');


DESCRIPTION
       The WWW::Analyzer is a specialized robot to analyze a
       given web site for inconsistencies but also general
       statistics (the STATISTICS manpage). When started, the
       robot will iterate over the site (with a given set of preĀ­
       defined policies) and will gather information about the
       pages and the site structure. This structure can then be
       output.

I have uploaded it to

   http://people.telecoma.net/rho/cpan/

> If possibly to the arbitrary depth.

Yes, it can do that, although the collected data gets pretty big fast.
As this was written to check student web pages for plagiarism there is
even an option to check every page against Google and find similarities
with LCS. Perl/CPAN is simply amazing.

\rho

PS: I'm open to suggestions for a better name!

Thread Previous


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About