develooper Front page | perl.libwww | Postings from February 2001

Trouble understanding how HTML::TokeParser works

Thread Next
From:
Gary Nielson
Date:
February 12, 2001 21:00
Subject:
Trouble understanding how HTML::TokeParser works
Message ID:
Pine.LNX.4.21.0102122353500.1038-100000@nielson.dynip.com
I can get by programming in Perl, but my head hurts trying to
understand how object-oriented modules such as TokeParser work.
Basically, I want to parse an html file where each entry looks like
this:

<DT><FONT FACE="Arial, Helvetica, sans-serif"><B>
<A HREF="/rc/news/docs/07073706.htm">Junk DNA may not be such junk,
genome studies find</A>
</B></FONT></DT>
<DD><FONT FACE="Arial, Helvetica, sans-serif" SIZE=2>WASHINGTON -
&#0151; The first in-depth look into the
human genome shows it is much more complicated.. <P>
</FONT></DD>

I want to print a file that has the url, the headline and the summary
paragaph, separated by the pipe delimiter, as in:

/rc/news/docs/07073706.htm||Junk DNA may not be such junk, genome
studies find||The first in-depth look into the
human genome shows it is much more complicated.

I have hobbled together a script in two-steps, following the man page
examples for TokeParser and some online Web page examples. But as you
shall see, there are problems:

use CGI;
use LWP::Simple;
use HTML::TokeParser;
$webPage = "digestChunk.htm";
&head;
&font;

sub head{
$p = HTML::TokeParser->new(shift||"digestChunk.htm");
while (my $token = $p->get_tag("a")) {
        my $url = $token->[1]{href} || "-";
        my $text = $p->get_trimmed_text("/a");
        print "$url\t$text\n";
	}
}
sub font{
       #parse and output summaries
       $parser=HTML::TokeParser->new("digestChunk.htm");
       while ($parser->get_tag("font"))
        { print $parser->get_text."\n\n" ; }
        }

The big problem is I do not know how to parse the entire document.
Each subroutine will find text within specific tags. But what if, as
in this case, the tags are in separate parts of the document. How do I
"splice" them together? My output is like so:

/rc/news/docs/07076556.htm	Bush Urged to Roll Back Patients' Privacy
Rules
/rc/news/docs/07076395.htm	Four Dead After Texas Standoff

WASHINGTON -  The first in-depth look into the
human genome shows it is much more complicated than the clear
blueprint of how to make a human that scientists had hoped for.. 

I tried sucking in the entire document, which begins with a <DL> and
at the end of the document has a </DL>. But that did not work? Also I
tried parsing by <DD> as a tag to use but that did not work either.
This I don't understand. Why wouldn't TokeParser give me everything
between the <DD>...</DD> tags? I wound up using the "font" tag but I
don't understand why that is working the way it is either. It's
working as I would like, pulling in the summary paragraph, but "font"
is also used before the url and text line. Does it not show up there
because it is all on one line with no text to parse on that same line?

If I sound confused, I am :) I am taking this opportunity with
tokeparser to try using modules more than writing procedural code,
which I have gotten semi-good at (graduated to the 2nd grade!). But
this is slow to sink in. Any help explaining how this works and how I
can get this script to do what I want would be much appreciated.

Gary









Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About