develooper Front page | perl.perl5.porters | Postings from November 2010

Encode::Encoder heuristics (was Re: [perl.git] branch blead, updated. v5.13.7-106-g719245b)

From:
Nicholas Clark
Date:
November 26, 2010 02:58
Subject:
Encode::Encoder heuristics (was Re: [perl.git] branch blead, updated. v5.13.7-106-g719245b)
Message ID:
20101126105818.GM24189@plum.flirble.org
On Fri, Nov 26, 2010 at 09:13:29AM +0000, Nicholas Clark wrote:
> On Fri, Nov 26, 2010 at 01:47:55AM +0100, Chris 'Bingos' Williams wrote:

> >     Update MIME-Base64 to CPAN version 3.12
> >     
> >       [DELTA]
> >     
> >       2010-10-25   Gisle Aas <gisle@ActiveState.com>
> >     
> >        Release 3.12
> >     
> >        Don't change SvUTF8 flag on the strings encoded [RT#60105]
> >     
> >        Documentation tweaks
> 
> This causes a test failure in Encode. (I think, for all configurations):
> 
> ./perl -MTestInit cpan/Encode/t/Encoder.t
> 
> ...
> 
> ok 260 - decode
> Wide character in subroutine entry at cpan/Encode/t/Encoder.t line 25.
> # Looks like you planned 516 tests but ran 260.
> # Looks like your test exited with 255 just after 260.
> 
> 
> No, I don't know why. The collateral work of trying to maintain a
> coherent distribution of modules...

OK, the cause is heuristics in Encode::Encoder:

sub new {
    my ( $class, $data, $encname ) = @_;
    unless ($encname) {
        $encname = Encode::is_utf8($data) ? 'utf8' : '';
    }
    else {
        my $obj = find_encoding($encname)
          or croak __PACKAGE__, ": unknown encoding: $encname";
        $encname = $obj->name;
    }
    my $self = {
        data     => $data,
        encoding => $encname,
    };
    bless $self => $class;
}


Looking at the documentation, Encode::is_utf8() is true if SvUTF8() is true.
So, if I take the *same sequence of ords* and change the internal
representation, that changes.

However, if $encname is set to utf8, then Encode::Encoder assumes that that
sequence of ords is a valid UTF-8 sequence. Which, well, it isn't. Because
that's not what Encode::is_utf8() checks. So, my minimal test case, of a
no-op

use strict;
use warnings;

{
    package Encode::noop;

    use parent 'Encode::Encoding';
    __PACKAGE__->Define('noop');

    sub encode{
	my ($obj, $data) = @_;
	return $data;
    }

    sub decode{
	my ($obj, $data) = @_;
	return $data;
    }
}

use Encode::Encoder qw(encoder);
use Devel::Peek;

my $a = chr 163;
my $b = $a . chr 256;
chop $b;

for my $in ($a, $b) {
    Dump($in);
    my $e = encoder($in);
    printf "Encoding is '%s'\n", $e->encoding;
    my $out = $e->noop();
    Dump($out . '');
}

__END__


You can see that the heuristic means that my no-operation "encoder" actually
mangles anything that isn't ASCII, if it happens to have become upgraded
at some point.

$ ./perl -Ilib encoder.pl
SV = PV(0x84517d4) at 0x84bb2a4
  REFCNT = 2
  FLAGS = (PADMY,POK,pPOK)
  PV = 0x846d5cc "\243"\0
  CUR = 1
  LEN = 12
Encoding is ''
SV = PV(0x84cb00c) at 0x8528844
  REFCNT = 1
  FLAGS = (PADTMP,POK,pPOK)
  PV = 0x84f6a04 "\243"\0
  CUR = 1
  LEN = 12
SV = PV(0x84517ec) at 0x848b574
  REFCNT = 2
  FLAGS = (PADMY,POK,pPOK,UTF8)
  PV = 0x8471ca4 "\302\243"\0 [UTF8 "\x{a3}"]
  CUR = 2
  LEN = 12
Encoding is 'utf8'
SV = PV(0x84cb00c) at 0x8528844
  REFCNT = 1
  FLAGS = (PADTMP,POK,pPOK,UTF8)
  PV = 0x84f6a04 "\357\277\275"\0 [UTF8 "\x{fffd}"]
  CUR = 3
  LEN = 12


Nicholas Clark



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About