develooper Front page | perl.beginners | Postings from August 2009

HTML::TreeBuilder encode symbols as html entities

Thread Next
Roman Makurin
August 14, 2009 06:06
HTML::TreeBuilder encode symbols as html entities
Message ID:
Hi All.

I have a problem with HTML::TreeBuilder. Here is sample code without any error

$ua = new LWP::UserAgent -timeout=>10;
$resp = $ua->get($url);

$content = decode('encoding_of_web_page', $resp->content);

$r = HTML::TreeBuilder->new_from_content($content);


dump result is html encoded entities:

<h4> @
  <a class="a01" href="hidden_url" rel="bookmark"
title="&#x421;&#x441;&#x44B;&#x43B;&#x43A;&#x430; ">@

all html entities are valid unicode code points of symbols. But why
HTML::TreeBuilder convert symbols to entities ?

If I just do
print $content, $/;
everything is ok, all symbols are symbols not html encoded entities.

I must say that such problem html page begins with <feff> code point
 and contains lots of '^M' symbols (\0xD or \r)


PS: Such scheme works well for dozen of html pages,
PPS: Described url -
If you think of MS-DOS as mono, and Windows as stereo,
 then Linux is Dolby Digital and all the music is free...

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About