develooper Front page | perl.beginners | Postings from March 2002

Re: use of HTML::Parser, HTML::FormatText

Thread Previous | Thread Next
From:
drieux
Date:
March 31, 2002 12:15
Subject:
Re: use of HTML::Parser, HTML::FormatText
Message ID:
0986CE41-44E4-11D6-A6D7-0030654D3CAE@wetware.com

On Sunday, March 31, 2002, at 11:50 , M z wrote:

> hello,
>
> in conjunction, I was looking into this module HTML to
> take out all the HTML I have in several files.
> Namely, the data I want is between tags
> <tag>data</tag>

I would look at getting the HTML::TreeBuilder module - sounds
like you need to get a copy of nmake - or find a ppm for installing
these where they belong.

As for code illustrations, try:

http://www.wetware.com/drieux/src/unix/perl/OK.UglyCode.txt

an illustration of the full on wackaDoodle code, where I was
working on an 'all singing, all dancing' - cgi and command line tool.

you would want to look at the

  sub parseTreeBack {

	....

	my $tree = HTML::TreeBuilder->new; # empty tree
	$tree->parse($res->content);


     my @title = $tree->look_down("_tag", "title");

     my $page = '';

     foreach my $t (@title) {
         foreach my $item_r ( $t->content_refs_list ) {
                 next if ref $$item_r;
                 $page .=  "$$item_r \n";
         }
         $page .= "\n";
     }

    ....
  }

that basic structure is how I get the 'content' of the 'title'
out of the html....

I repeat that basic trick set to parse out the rows and tables
for other stuff - since I need to parse out of :

" <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
<HTML><HEAD><TITLE>List Grovellor Says</TITLE>
</HEAD><BODY><H1 ALIGN="center">List Grovellor Says</H1><hr><TABLE 
WIDTH="60%" ALIGN="center"><TR VALIGN="TOP"><TH ALIGN="center" COLSPAN="2"
 > = Frodo found in hobbits =</TH></TR> <TR VALIGN="TOP"><TD>Frodo Baggins<
/TD> <TD>frodo@shire.com</TD></TR></TABLE><br><hr align="center" 
width="50%"><br></BODY></HTML> "

the fact that I found "Frodo" on the hobbits mailing lists, and
that he has the email address frodo@shire.com -


which is to say I found the TreeBuilder simpler to use than trying
to work out the HTML::Parser and HTML::FormatText stuff directly,
it provides some 'class extensions' - and the specific trick above
is bootlegged from the POD. But it works.

ciao
drieux

---


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About