develooper Front page | perl.beginners | Postings from February 2002

Still can't extract data using HTML::TokeParser

Thread Next
From:
Daniel Falkenberg
Date:
February 24, 2002 20:00
Subject:
Still can't extract data using HTML::TokeParser
Message ID:
3ACA70B144BD6D45B994CAC2CA4B9F98017CE5@opal.vintek.local
Hey all,

Just wondering why I still can't get HTML::TokeParser to either download
that page I am looking for or at least store the HTML from the requested
page.  I know I could quite easily do this if I used HTML::Tableextract
except the data I want is only about 3 lines of HTML and there are no
tables at all in there.  Therefore I cannot use HTML::TableExtract.  So
I was wondering how I would go about extract data from the following
HTML...

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
<HTML><HEAD><TITLE>Get all data from H1</TITLE> </HEAD><BODY
BGCOLOR="FFFFFF"><h1>I want all if this data extracted from heading 1
(h1)</h1> </BODY></HTML>

So using the following code I figured it would be really simple to
extract the data I wanted?  Just a note that the pages I want will
change with different CGi parameters I parese to the reguested URL.
Does any one have any ideas?


use LWP::UserAgent;
use HTML::TableExtract;
use HTML::TreeBuilder;
use HTML::TokeParser;
use CGI qw(:all);
use CGI::Carp qw(fatalsToBrowser);

my $ua = LWP::UserAgent->new;

$inputSite = "<URL HERE>";
$address = "http://" . $inputSite;
$request = HTTP::Request->new('GET', $address);
$response = $ua->request($request);
my $found = 0;

my $content = $response->content;
$p = HTML::TokeParser->new($content) || die "Can't open: $!";
while ($stream->get_tag("h1")) { $data = get_trimmed_text("/h1");}

Thx,

Dan

-----Original Message-----
From: Chris Ball [mailto:chris@void.printf.net]
Sent: Friday, 22 February 2002 9:49 PM
To: Daniel Falkenberg
Cc: beginners@perl.org
Subject: Re: What would take care of this?...


>>>>> "Daniel" == Daniel Falkenberg <dan@vintek.net> writes:

    Daniel> Would I now have to go ahead and use HTML::parser or
    Daniel> something of similar nature to extract headings?

Yeah, go with HTML::TokeParser.

    Daniel> <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
    Daniel> <HTML><HEAD><TITLE>Get all data from H1</TITLE> </HEAD><BODY
    Daniel> BGCOLOR="FFFFFF"><h1>I want all if this data extracted from
    Daniel> heading 1 (h1)</h1> </BODY></HTML>

while ($stream->get_tag("h1")) { $data = get_trimmed_text("/h1"); }

(Also see perldoc HTML::TokeParser, once it's installed.)

- Chris.
-- 
$a="printf.net"; Chris Ball | chris@void.$a | www.$a | finger: chris@$a
         "In the beginning there was nothing, which exploded."


Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About