develooper Front page | perl.libwww | Postings from August 2006

LWP: Warning with utf8 data in HTML head section

From:
libwww
Date:
August 2, 2006 15:48
Subject:
LWP: Warning with utf8 data in HTML head section
Message ID:
20060802154616.E86395@spiral.corp.yahoo.com
There seems to be a bug in LWP which causes a warning in
HTML::HeadParser on fetched web documents which contain utf-8 encoded
data in the header section.

Example:

    use strict;
    use LWP;
    use 5.008;

    my $url = 'http://perlmeister.com/test/utf8.html';
    my $ua  = LWP::UserAgent->new();
    my $res = $ua->get($url);

This snippet shows the warning

    Parsing of undecoded UTF-8 will give garbage when decoding
    entities at /home/y/lib/perl5/site_perl/5.8/LWP/Protocol.pm line
    114.

with LWP-5.805 and HTML-Parser-3.55.

HTML::HeadParser issues this warning if it finds UTF-8 encoded data
but the string handed in doesn't have the utf-8 bit set.

Setting the utf-8 bit on web server responses which indicate
UTF-8 content in a content header like 'text/html; charset=utf-8'
seems to be one possible solution, but this header setting might also
occur in the HTML header section, which HTML::HeadParser is supposed
to parse:

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>

in which case the warning probably needs to be suppressed until
HTML::HeadParser is done and has verified that there's no such setting
in the HTML head.

-- Mike

Mike Schilli
m@perlmeister.com



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About