develooper Front page | perl.libwww | Postings from April 2001

Re: HTML::Parser callbacks

From:
Gisle Aas
Date:
April 23, 2001 14:59
Subject:
Re: HTML::Parser callbacks
Message ID:
lrk84b8eju.fsf@caliper.ActiveState.com
Cris Bailiff <c.bailiff@awayweb.com> writes:

> 
> 	I appreciate your work on HTML::Parser performance, and I think the
> 'ignore_tags' and 'report_tags' feature in 1.22 is a great idea, but I can't
> currently make use of it :-(. Here's the scenario:
> 
> I'm trying to edit just some tags in a page, so currently, I've got callbacks for
> start events and a default handler for everything else. The default handler just
> outputs the original text and returns. (Actually, sometimes I have text and/or
> end events, but I don't think its relevant.)
> 
> When my start event gets called, it looks in a hash to see if its interested in
> this tag. If its not of interest, then the start event again just outputs the
> original text and returns. If the tag is 'of interest', then the handler performs
> its processing, outputs the modified tag text and then returns.
> 
> My problem is that if I setup 'report_tags' with just the 'interesting' tags,
> which could greatly speed up the processing by skipping callbacks I don't need,
> the 'skipped' text doesn't get collected for output by the default (or a text)
> handler, so the tag text is lost.
> 
> Would it be feasible to implement a parser option (like 'unbroken_text') to make
> the skipped tags be treated as 'text' and returned by the next text or default
> event? I haven't dug around in side Parser.xs (only skimmed the surface) so I'm
> not really in a position to have a go myself straight away, but if you want a
> tester ... :-)
> 
> Perhaps I missed something, and I can do it with the existing options? I couldn't
> get it to work so far but if you have any suggestions...

The only sensible thing I could think of would be to have an option so
that ignored tags/elements was given to the default handler, but then
you would actually get exactly the same number of callbacks and then
you might just as well do the filtering in the start/end handlers
themselves.

An alternative approach could be to not have any default handler, but
to just note the offset/length when you find tags you want to change.
Using this information you should be able to output the text not
covered by callbacks using substr.  Something like this:

#!/usr/bin/perl -w

my $str = <<'EOT';
<foo>
<img>
<bar>
EOT

use HTML::Parser;
my $p = HTML::Parser->new(report_tags => qw(img));

my $last_offset = 0;
sub text_out {
    my $new_offset = shift;
    if ($new_offset > $last_offset) {
	print substr($str, $last_offset, $new_offset - $last_offset);
	$last_offset = $new_offset;
    }
}

sub s_handler {
    my($tag, $attr, $offset, $length) = @_;
    text_out($offset);
    print "<XXX>";
    $last_offset = $offset + $length;
}

$p->handler(start => \&s_handler, "tag,attr,offset,length");
$p->parse($str)->eof;
text_out(length($str));  # get the rest
__END__



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About