develooper Front page | perl.libwww | Postings from April 2001

Re: documentation of ignore_tags in HTML::Parser 3.19_94

Thread Previous
From:
Gisle Aas
Date:
April 2, 2001 12:35
Subject:
Re: documentation of ignore_tags in HTML::Parser 3.19_94
Message ID:
lrae5z6qp4.fsf@caliper.ActiveState.com
Nathaniel Irons <beppo@bumppo.net> writes:

> I updated to the latest available HTML::Parser, hoping the new
> ignore_tags() method would alleviate some distressingly rare
> interruptions by rogue P tags in my documents:
> 
>     <b>merrily merrily merrily until <p> none of this is seen by
>     the text event following the initial B tag.</b>
> 
> P tags don't figure into my parsing of these files, so by my initial
> reading of ignore_tags("p"), I expected to get all of the above in the
> first text event.  As I discovered, though the start event won't fire
> for the P tag, the text event ends in the same place regardless.

If you also enable the 'unbroken_text' option you should get it in one
piece.

> I assume this was the desired outcome, and I suggest the ignore_tags()
> description would benefit from a sentence spelling it out in greater
> detail.  The word "suppressed" in particular implies to me that the
> event has been completely scrubbed.  Perhaps:
> 
>     Any start and end events involving any of the given tags will
>     not fire (but will continue to interact with neighboring events
>     as if they had).
> Of course, I think it'd be great if ignored tags acted as if they'd been
> deleted from the page en masse before parsing began, but that
> undertaking is beyond my ken.

Do you really want the 'line', 'column' and 'offset' to be reported as
if these tags where edited out first?  I think that would be wrong and
make this feature less useful.

If you actually want this you could always run a separate pass over
your documents where you remove these elements first.  Something
like this (untested) should do it:

  my $p = HTML::Parser->new(...);  # the real parser

  HTML::Parser->new( default_h => [sub { $p->parse(shift); }, "text" ],
                     ignore_tags => ["p"],
                   )->parse_file("index.html")->eof;

  $p->eof;

Regards,
Gisle

Thread Previous


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About