develooper Front page | perl.libwww | Postings from March 2003

Re: HTML::TreeBuilder, and lookdown tips for <P>

From:
Sean M. Burke
Date:
March 2, 2003 15:17
Subject:
Re: HTML::TreeBuilder, and lookdown tips for <P>
Message ID:
5.1.0.14.1.20030302135559.00a0f650@mail.spinn.net
At 12:44 PM 2003-03-02 -0500, Ed Halley wrote:
>I've got a quick-n-dirty script almost working, which uses 
>HTML::TreeBuilder, et al, to find plain text paragraphs.  I hoped to get a 
>bunch of text from a number of sources, so I can't get too finicky about 
>each site's idiomatic use of HTML.  However, the <P> tag is so loose in 
>its semantics, it can be hard to see how I can get all the text I can.

Yup, I remember that problem from when I was doing Pod::HTML2Pod.  It's 
nasty.  I've tried writing general-purpose routines for implicating more P 
elements, but it's very tricky.  For example, consider parsing this:
   <blockquote>
   Foo
   <p>Bar
   <p>Baz
   </blockquote>

as if it were this:

   <blockquote>
   <p>Foo</p>
   <p>Bar</p>
   <p>Baz</p>
   </blockquote>

For some purposes and users, that's rightheaded and right.  For other 
purposes and users, it's surprising and scary -- two things that make me 
cringe when I think about putting code into a module and holding it up as 
The Solution.

>[...]Is there a good clean way of traversing to the "previous child",

You can check $element->left
(in scalar context)

>or tagging plain text ending with <P> as a bona-fide <P>text</P> span?

Well, you could always do something like, to make text siblings of p's into 
p's themselves:

my %parents; # a hash used as a set
foreach my $p ($root->find_by_tag_name('p')) {
   my $parent = $p->parent;
   $parents{$parent}=$parent;
}
foreach my $parent (values %parents) {
   foreach my $node (@{ $parent->content || next}) {
     # for each text node that has a p sister, replace it
     # node with a new paragraph containing itself
     next if ref $node;
     my $para = HTML::Element->new('p',
       '_parent' => $parent, '_content' => [$node]);
     $node = $para;
   }
}

I'm just writing that code off the top of my head, and not sure if it'll 
work.  Also, using ->content and direct assignments to _parent and _content 
like this is sort of a "don't try this as home, kids" thing, but not for 
any technical reason, but just because the content_list (etc) interface is 
friendlier in many ways.  But in this case, using ->content happens to be 
the easy way, since if you iterate over it with a for, the for variable is 
directly aliased to the node, so assignment alters the node in-place.
Plus I'm just in an old skool mood today.

So tell me, what kinda stuff are you doing with HTML::Tree ?  I'm always 
curious.

Here, I'll CC the libwww list, since sometimes there's not enough talk 
about HTML::Tree there.

--
Sean M. Burke    http://search.cpan.org/~sburke/




nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About