develooper Front page | perl.beginners | Postings from August 2009

HTML::TreeBuilder - handle invalid html gracefully

Thread Next
From:
Roman Makurin
Date:
August 23, 2009 03:57
Subject:
HTML::TreeBuilder - handle invalid html gracefully
Message ID:
20090823105643.GA4188@blizzard
Hi All!

How can I tell HTML::TreeBuilder to parse invalid html files
gracefully ? Here is an example:

-----
#!/usr/bin/perl

use strict;
use warnings;

use HTML::TreeBuilder;

my $root = HTML::TreeBuilder->new_from_file(*DATA);

print +($root->look_down(_tag=>'div', class=>'text'))->as_text, $/;


__DATA__
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"/>
</head>
<body>
<div class="body">
  <div class="doc">
    <p>some text
    <div class="text">
      <head>
        <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"/>
      </head>
      <p> some other text      
    </div>
  </div>
</div>
</body>
</html>
--------

for some reason someone put head tag inside of div :)
all browsers handle such case correctly, but HTML::TreeBuilder
returns undefined text value if I use as_text method on
<div class="text">. Without inner head section all works
as expected.

Is there any way to tell HTML::TreeBuilder to handle
such situations ?


Thanks.

-- 
If you think of MS-DOS as mono, and Windows as stereo,
 then Linux is Dolby Digital and all the music is free...

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About