develooper Front page | perl.perl5.porters | Postings from November 2010

Re: Encode::Encoder heuristics (was Re: [perl.git] branch blead, updated. v5.13.7-106-g719245b)

Thread Previous
From:
Gisle Aas
Date:
November 26, 2010 15:08
Subject:
Re: Encode::Encoder heuristics (was Re: [perl.git] branch blead, updated. v5.13.7-106-g719245b)
Message ID:
D6B4E052-B3C5-40AE-8CDA-70A51D9B4D3E@activestate.com
On Nov 26, 2010, at 11:58 , Nicholas Clark wrote:

> OK, the cause is heuristics in Encode::Encoder:
> 
> sub new {
>    my ( $class, $data, $encname ) = @_;
>    unless ($encname) {
>        $encname = Encode::is_utf8($data) ? 'utf8' : '';
>    }
>    else {
>        my $obj = find_encoding($encname)
>          or croak __PACKAGE__, ": unknown encoding: $encname";
>        $encname = $obj->name;
>    }
>    my $self = {
>        data     => $data,
>        encoding => $encname,
>    };
>    bless $self => $class;
> }
> 
> 
> Looking at the documentation, Encode::is_utf8() is true if SvUTF8() is true.
> So, if I take the *same sequence of ords* and change the internal
> representation, that changes.
> 
> However, if $encname is set to utf8, then Encode::Encoder assumes that that
> sequence of ords is a valid UTF-8 sequence. Which, well, it isn't. Because
> that's not what Encode::is_utf8() checks. So, my minimal test case, of a
> no-op

Based on your test case I fixed the Encoder bug in
<https://github.com/gisle/p5-encode/commit/36578d3bb1deb6d7e546ce9cf0d454ee68b74257>.

--Gisle


> 
> use strict;
> use warnings;
> 
> {
>    package Encode::noop;
> 
>    use parent 'Encode::Encoding';
>    __PACKAGE__->Define('noop');
> 
>    sub encode{
> 	my ($obj, $data) = @_;
> 	return $data;
>    }
> 
>    sub decode{
> 	my ($obj, $data) = @_;
> 	return $data;
>    }
> }
> 
> use Encode::Encoder qw(encoder);
> use Devel::Peek;
> 
> my $a = chr 163;
> my $b = $a . chr 256;
> chop $b;
> 
> for my $in ($a, $b) {
>    Dump($in);
>    my $e = encoder($in);
>    printf "Encoding is '%s'\n", $e->encoding;
>    my $out = $e->noop();
>    Dump($out . '');
> }
> 
> __END__
> 
> 
> You can see that the heuristic means that my no-operation "encoder" actually
> mangles anything that isn't ASCII, if it happens to have become upgraded
> at some point.
> 
> $ ./perl -Ilib encoder.pl
> SV = PV(0x84517d4) at 0x84bb2a4
>  REFCNT = 2
>  FLAGS = (PADMY,POK,pPOK)
>  PV = 0x846d5cc "\243"\0
>  CUR = 1
>  LEN = 12
> Encoding is ''
> SV = PV(0x84cb00c) at 0x8528844
>  REFCNT = 1
>  FLAGS = (PADTMP,POK,pPOK)
>  PV = 0x84f6a04 "\243"\0
>  CUR = 1
>  LEN = 12
> SV = PV(0x84517ec) at 0x848b574
>  REFCNT = 2
>  FLAGS = (PADMY,POK,pPOK,UTF8)
>  PV = 0x8471ca4 "\302\243"\0 [UTF8 "\x{a3}"]
>  CUR = 2
>  LEN = 12
> Encoding is 'utf8'
> SV = PV(0x84cb00c) at 0x8528844
>  REFCNT = 1
>  FLAGS = (PADTMP,POK,pPOK,UTF8)
>  PV = 0x84f6a04 "\357\277\275"\0 [UTF8 "\x{fffd}"]
>  CUR = 3
>  LEN = 12
> 
> 
> Nicholas Clark


Thread Previous


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About