develooper Front page | perl.libwww | Postings from February 2001

Hacking HTML::TreeBuilder and HTML::Element

From:
Jason Henry Parker
Date:
February 1, 2001 03:38
Subject:
Hacking HTML::TreeBuilder and HTML::Element
Message ID:
87k87askg0.fsf@freezer.home
I'm working on a module that will be used to intelligently extract the
content from HTML pages like slashdot, lwn, or CNN---sites that use
large tables to sandwich content between columns of mostly static and
uninteresting text.

I had great success with version 0.01, but am very unhappy with the
way I've designed it.  I've created a new class, HTML::Extract, whose
source file contains some additions to the HTML::Element class to add
a `weight' attribute with a getter/setter method and an add_weights()
method which calculates scores for the subtree it's called upon.

So far, so good, right?  Well, not really, because that means standard
usage of my module looks something like:

        my $x = new HTML::Extract($html_input);
        my $x->tree->add_weights();
        my $best = $x->extract();
        # operate on the HTML::Element referenced by $best

And so the user isn't presented with an everyday object, HTML::Element
is altered without being subclassed, and Element.pm contains changes
to another module's code.  All in all, it's pretty unsatisfactory.

However, looking at TreeBuilder.pm, I see there is an internal
attribute which appears to be useful to set what sort of objects are
created by a TreeBuilder object.  Great.  Subclass TreeBuilder, and
we're away, except that won't work very well either, because
Element-derived class then really needs to have weights calculated at
object creation time, which I'm not sure is possible without serious
surgery to either or both of these modules.

In short, I don't think I can do everything I want to buy simply
subclassing or trivially altering HTML::TreeBuilder, I can't subclass
HTML::Element without at least trivially altering HTML::TreeBuilder,
and I don't want to have to rewrite the excellent HTML::TreeBuilder
module's support for parsing not-so-tidy HTML.

Has anyone on the list been here before?

jason
-- 
``Just because one proposes a measure to prevent promotion
        of a risk-filled and controversial sexual behavior
                     doesn't make them divisive or bigoted.''
                                     -- Nicholas J. Yonker,
                    Concerned Citizens for Sound Education



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About