develooper Front page | perl.libwww | Postings from December 2000

odd cases of tables with HTML::TreeBuilder

From:
Sean M. Burke
Date:
December 17, 2000 11:50
Subject:
odd cases of tables with HTML::TreeBuilder
Message ID:
3.0.6.32.20001217125018.0087e1e0@mail.spinn.net
Someone recently reported a problem with HTML::TreeBuilder.  It was parsing
code like this:

<table>
<form><input type="HIDDEN" name="A">
<tr><td><input type="HIDDEN" name="B"></td></tr>
</form>
</table>

into a tree like this:

<html> @0
  <head> @0.0 (IMPLICIT)
  <body> @0.1
    <table> @0.1.0
      <tr> @0.1.0.1 (IMPLICIT)
        <td> @0.1.0.1.0 (IMPLICIT)
          <form> @0.1.0.1.0.0
            <input name="A" type="HIDDEN"> @0.1.0.1.0.0.0
      <tr> @0.1.0.2
        <td> @0.1.0.2.0
          <input name="B" type="HIDDEN"> @0.1.0.2.0.0

which makes perfect sense given the HTML spec:
<!ELEMENT TABLE - -
     (CAPTION?, (COL*|COLGROUP*), THEAD?, TFOOT?, TBODY+)>
...which is necessarily used in being able to tell when to implicate what,
so that <table>foo<td>bar<tr>baz<td>quux</table> parses right.

Now, the user who wrote to me seemed to have in mind a parse like:

<table>
  <form>
    <input type="HIDDEN" name="A">
    <tr>
      <td>
        <input type="HIDDEN" name="B"></td></tr>
  </form>
</table>

...which is in fact what MS IE's parse tree for this looks like -- but
then, IE can put whatever nonsense in its parse trees, since it's the only
application that has to look at them.

And I think what underlied what the user had in mind is that while
"<table><hr>" should implicate "<table><tr><td><hr>", some kinds of
elements, presumably including FORM and hidden-INPUT, should be exempt from
the normal what-can-be-where rules that apply to things like HR.  It's not
/too/ crazy of an idea -- I considered it when I was last rewriting
TreeBuilder; but I couldn't see any way to implement it sanely, and I had
the impression that the phenomenon was rare.

However, it was just an impression.  If anyone else has run into cases
where bad parses would have come out right if I'd treated non-rendering
elements differently, do email me.



PS, actually, MSIE's parse tree looks like:

<table>
  <form>
    <input type="HIDDEN" name="A">
    <tbody>
      <tr>
        <td>
          <input type="HIDDEN" name="B"></td></tr></tbody>
  </form>
</table>

i.e., it implicates a tbody.

--
Sean M. Burke  sburke@cpan.org  http://www.spinn.net/~sburke/




nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About