develooper Front page | perl.libwww | Postings from December 2000

Re: Bug in Marek::Pod::HTML

From:
Sean M. Burke
Date:
December 4, 2000 10:33
Subject:
Re: Bug in Marek::Pod::HTML
Message ID:
3.0.6.32.20001204112936.00871da0@mail.spinn.net
At 08:49 AM 2000-12-04 +0100, Marek Rouchal DAT CAD HW Tel 25849 wrote:
>[...]
>In order to include raw HTML the user supplies with =for and =begin, I
>need to parse it with HTML::Treebuilder to turn it into nodes of type
>HTML::Element for inclusion in what Pod::HTML produces (namely a
>[...]

That reminds me of a more general question:
I generally say that TreeBuilder is for parsing only whole documents -- in
the same sense that a hammer is for banging on things.  It's okay to try to
use TreeBuilder to parse document-fragments, the same as it's okay to try
to use a hammer as a foreceps -- but in both cases it will take some
improvisation and cleverness on your part.

But one thing I think might be helpful is a method I wrote, and keep
meaning to put in the next version of Element:

sub HTML::Element::highest_explicits {
  my(@stack) = ($_[0]);
  my @out;
  my $this;

  while(@stack) { # idiom for preorder traversal
    if(
      ref($this = shift @stack)
      and $this->{'_implicit'}
    ) {
      unshift @stack, @{$this->{'_content'} || next};
       # traverse it
    } else {
      push @out, $this; # and don't traverse under this
    }
  }
  return @out;
}

When you say
  $treelet->eof(); #don't forget to do this!
  @docfrag = $treelet->highest_explicits
you get the list of the highest non-implicit (=explicit) element nodes in
the tree.  It is possible to get really odd results out of this, but only
with nonsensical input code, I think.  (This might be followed by something
like:  for(@docfrag) { $_->detach if ref($_) }; )

Everything is happier, BTW, if input code is zero or more self-contained
elements (as opposed to ending on an incomplete element anywhere in there).

If you needed to see /whether/ that's the case, see what the $treelet->pos
is.  In theory, if it points to an explicit element, the source
code-fragment wasn't complete.  However, this would scream in the case of:

  <p>foo

because the pos is still on the explicit p there.  So consider something
where you forgive left-open elements whose end tags are normally omissible,
like maybe:

  require HTML::Tagset;
  @up_pos = ($pos, $pos->lineage);
  my $saw_incomplete;
  foreach my $e (@up_pos) {
    ++$saw_incomplete unless $e->implicit
                      or $HTML::Tagset::optionalEndTag{$e->tag};
  }
  die "GLEIVEN! GLAH $saw_incomplete!" if $saw_incomplete;

or maybe you could get away with just:

  require HTML::Tagset;
  @up_pos = ($pos, $pos->lineage);
  my $saw_incomplete;
  foreach my $e (@up_pos) {
    last if !$e->implicit;
    ++$saw_incomplete unless $HTML::Tagset::optionalEndTag{$e->tag};
  }
  die "GLEIVEN! GLAH $saw_incomplete!" if $saw_incomplete;

I'm not sure there'd be a practical difference, assuming sane code.
But I'm not sure how it'd behave with mildly strange code, like any of:

 <td><li>hoohah!</li>

 <td><li>hoohah!</li></td>

 <li><td>hoohah!</td>

 <li><td>hoohah!</td></li>


But having to assume sane input is not too much of a problem -- assuming
that everyone agrees on what sanity is, and that it /is/ tested on some
examples of reasonably sane code.


--
Sean M. Burke  sburke@cpan.org  http://www.spinn.net/~sburke/




nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About