develooper Front page | perl.beginners | Postings from August 2009

HTML::TreeBuilder encode symbols as html entities

Thread Next
From:
Roman Makurin
Date:
August 14, 2009 06:06
Subject:
HTML::TreeBuilder encode symbols as html entities
Message ID:
46e5b4ee0908140606o87d129q9854a0cb103f99e4@mail.gmail.com
Hi All.

I have a problem with HTML::TreeBuilder. Here is sample code without any error
 checking:

$ua = new LWP::UserAgent -timeout=>10;
$resp = $ua->get($url);

$content = decode('encoding_of_web_page', $resp->content);
decode_entities($content);

$r = HTML::TreeBuilder->new_from_content($content);

$r->look_down(_tag=>'h4')->dump;

dump result is html encoded entities:

<h4> @0.1.5.1
  <a class="a01" href="hidden_url" rel="bookmark"
title="&#x421;&#x441;&#x44B;&#x43B;&#x43A;&#x430; ">@0.1.5.1.0

all html entities are valid unicode code points of symbols. But why
HTML::TreeBuilder convert symbols to entities ?

If I just do
print $content, $/;
everything is ok, all symbols are symbols not html encoded entities.

I must say that such problem html page begins with <feff> code point
 and contains lots of '^M' symbols (\0xD or \r)

Thanks

PS: Such scheme works well for dozen of html pages,
PPS: Described url - http://www.no.oskol-news.ru/
-- 
If you think of MS-DOS as mono, and Windows as stereo,
 then Linux is Dolby Digital and all the music is free...

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About