develooper Front page | perl.libwww | Postings from January 2013

HTML::Entities and unicode

Thread Next
From:
Vangelis Katsikaros
Date:
January 8, 2013 11:12
Subject:
HTML::Entities and unicode
Message ID:
50EBFF25.5070409@yahoo.gr
Hi

First many thanks for all the familly of LWP, HTML excellent modules and 
the work invested on them.



My question concerns the decode_entities, unicode and *some* HTML 
entities (the ones in the range 128-255 chr() range)

The manual says for decode_entities "This routine replaces HTML entities 
found in the $string with the corresponding Unicode character"

So I was expecting that if I decode the nbsp entity I would get the 
U+00A0 character (in perl \x{A0})



I do:
================================================================
perl -e 'use Encode; use Data::Dumper; use HTML::Entities; $str = 
" "; HTML::Entities::decode_entities( $str ); print Dumper($str)'

$VAR1 = '�';
================================================================
I see on my terminal the replacement character - black diamond with 
question mark, whereas I would expect to see sth like :
$VAR1 = "\x{a0}";




If I do the same with the euro enity:
================================================================
perl -e 'use Encode; use Data::Dumper; use HTML::Entities; $str = 
"€"; HTML::Entities::decode_entities( $str ); print Dumper($str)'

$VAR1 = "\x{20ac}";
================================================================
I do get the expected result (the perl U+20AC unicode character)





Trying to dig a bit more I noticed the following:
================================================================
$ perl -e 'use HTML::Entities; $str = " "; 
HTML::Entities::decode_entities( $str ); print $str' | hexdump -C
00000000  a0                                                |.|
00000001

perl -e 'use HTML::Entities; $str = "€"; 
HTML::Entities::decode_entities( $str ); print $str' | hexdump -C
Wide character in print at -e line 1.
00000000  e2 82 ac                                          |...|
00000003

perl -e 'use Encode; use HTML::Entities; $str = "€"; 
HTML::Entities::decode_entities( $str ); $t = 
Encode::encode("UTF-8",$str); print $t' | hexdump -C
00000000  e2 82 ac                                          |...|
00000003
================================================================

In the nbsp case I get the byte 'a0' whereas I would expect the bytes 
'c2 a0' (for utf-8).

In the 1st euro case I do get the bytes 'e2 82 ac' that are the proper 
bytes for U+20AC in utf-8. I do get a "Wide character in print" warning 
from print(), because the str isn't encoded properly.

In the 2nd euro case I get the same bytes (correct U+20AC in utf-8) and 
no warn message from print(), since I do encode properly.




So to rephrase my question: why don't I see "\x{a0}" (in the perl 
sting), or 'c2a0' in the bytes streamed, when I decode the nbsp HTML 
entity? Wouldn't these be the expected results?

Regards
Vangelis

PS Forgive my ignorance if I say sth stupid. I think I do understand 
some aspects of unicode handling in perl, but I haven't run out of room 
for improvement.

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About