develooper Front page | perl.perl5.porters | Postings from April 2007

Smack!

From:
Tom Christiansen
Date:
April 18, 2007 04:34
Subject:
Smack!
Message ID:
6678.1176896058@chthon
Right.  Nobody else saw it.  That's what I thought, but I wanted to 
give it an overnight before I let the other shoe fall.

The bug is that this module won't work in a progam that
has a "use encoding" pragma in it.

    % perl -Mencoding=utf8 -MSmack -e 'smack && snarf && print "hurray"'
    Wide character in print at Smack.pm line 50.
    Wide character in print at Smack.pm line 51.
    bindata is size 24 (should be 20)
    Wide character in $/ at Smack.pm line 63.
    Exit 255

It needs a "use bytes" in Smack.pm for it to work correctly.

That's because otherwise the non-scoped use encoding reaches into the
module and alters how Perl deals with data.  The encoding::warnings pragma
can be useful for diagnosing this, but one shouldn't have to.  It's a bug.
I was hoping that use encoding had become correctly scoped, but it hasn't.

Now, in this simple example, which can be written out this way:

    use encoding 'utf8';
    use Smack;
    smack() && snarf() && print "hurray!\n";

one need but move the use encoding to after the use Smack

    use Smack;
    use encoding 'utf8';
    smack() && snarf() && print "hurray!\n";

Also in this simple example, if you put "use encoding::warnings" into
the Smack.pm module, it diagnoses and cure the problem.  

However, in more elaborate scenarios, neither of those works.
For example:

    #!/usr/bin/perl 
    use strict;
    use warnings;
    use Image::ExifTool qw(ImageInfo);
    use encoding 'utf8';  # comes after, but still screwed

    if (!@ARGV) {
	die "usage: $0 filename ...\n";
    }

    for my $filename (@ARGV) { 
	my $info = ImageInfo($filename);

	if (my $error = $info->{error}) {
	    warn "Can't parse image info on file $filename: $error\n";
	    next;
	} 

	if (my $oops = $info->{Warning}) {
	    warn "WARNING: Can't parse image info on file $filename: $oops\n";
	    # fallthrough
	}

	printf "%s is size %s\n", $filename, $info->{ImageSize};
    }

Run that on a JPEG file and it will bomb if there's a 

    use encoding 'utf8';

in the main program, even one that comes after the module load.
You have to put "use bytes" into the Image::ExifTool module for
it to work.  What's happening is that its

    "\xff" . chr($marker) 

code (amongst other places) is being shamelessly "promoted" into 
a Unicode-encoded string, starting from an assumed ISO-8859-1.
This will produce now a 4-byte string that is "\357\277\275\0".
This is of course completely nuts.  

The bug IMHO is that use encoding is not lexically scoped.  It 
affects code everywhere, and you should not think that what it
says about "no encoding" actually doing you much good, or placing
the "use encoding" after the modules are sucked in.  That's not 
always good enough.  It isn't here.

Audrey's "encoding::warnings" pragma will find these for you.  Add "use
encoding::warnings" instead of "use bytes" to the ExifTool module, and you
find this:

Bytes implicitly upgraded into wide characters as iso-8859-1 at lib/Image/ExifTool.pm line 748
Bytes implicitly upgraded into wide characters as iso-8859-1 at lib/Image/ExifTool.pm line 749
Bytes implicitly upgraded into wide characters as iso-8859-1 at lib/Image/ExifTool.pm line 2178
Bytes implicitly upgraded into wide characters as iso-8859-1 at lib/Image/ExifTool.pm line 2179
Bytes implicitly upgraded into wide characters as iso-8859-1 at lib/Image/ExifTool.pm line 2317
Bytes implicitly upgraded into wide characters as iso-8859-1 at lib/Image/ExifTool.pm line 2320
Bytes implicitly upgraded into wide characters as iso-8859-1 at lib/Image/ExifTool.pm line 2320
Bytes implicitly upgraded into wide characters as iso-8859-1 at lib/Image/ExifTool.pm line 2489
Bytes implicitly upgraded into wide characters as iso-8859-1 at lib/Image/ExifTool.pm line 2522
Bytes implicitly upgraded into wide characters as iso-8859-1 at lib/Image/ExifTool.pm line 2529
Bytes implicitly upgraded into wide characters as iso-8859-1 at lib/Image/ExifTool.pm line 2579
Bytes implicitly upgraded into wide characters as iso-8859-1 at lib/Image/ExifTool.pm line 2607
Bytes implicitly upgraded into wide characters as iso-8859-1 at lib/Image/ExifTool.pm line 2608
Bytes implicitly upgraded into wide characters as iso-8859-1 at lib/Image/ExifTool.pm line 2658

I believe one might be able to do something at all those points to get it
to behave better involving calls to specific encode or decode routines from
Encode, but it's easiest to just say use bytes and be done.  But you 
should not have to do this!

Now, it's not *always* enough to just place a use bytes in your own 
module code.  For example

    # Module BadEnc.pm
    use bytes;
    my $i = 0;
    sub main::func { return "\xff" . chr($i) }
    1;

BTW, you get a different answer writing 

    sub main::func { return "\xff" . chr(0) }

than the chr($i), because it gets optimized into what is effectively

    sub main::func { return "\xff\x00" }

and so gets encoded up differently in the implicit conversion.

You can then run this:

    use BadEnc;
    use encoding "utf8";
    $word = func();
    for $i ( 0 .. (bytes::length($word)-1) ) {
	printf "char #%d has code point %d\n", $i, bytes::ord (bytes::substr($word, $i, 1))
    }
    for $i ( 0 .. (length($word)-1) ) {
	printf "char #%d has code point %d\n", $i, ord (substr($word, $i, 1))
    }
    print $word;

And you'll see that you still have a problem 

    char #0 has code point 255
    char #1 has code point 0
    char #0 has code point 65533
    char #1 has code point 0

I've omitted the last line of output, but it is what gave me the
string that I ran through "od -c" to find that it's "\357\277\275\0".

BTW, placing use encoding::warnings into BadEnc.pm gives this now as output:

    Bytes implicitly upgraded into wide characters as iso-8859-1 at BadEnc.pm line 4
    char #0 has code point 195
    char #1 has code point 191
    char #2 has code point 0
    char #0 has code point 255
    char #1 has code point 0

Which shows you would still have a problem.  A different problem, but still
a problem. 

I did test the "use bytes" in Image/ExifTool.pm, and it makes you suddenly
able to read JPEGs correctly again.  So the module author has to do that.
And so does anyone, I guess.  That sucks.  This program needs to know that
when it uses a string like "\xFF\xFD" and jumps around a binary file, it
doesn't have its data mutilated on it.  That's a disaster.

Supposedly there are plans to make "use encoding" a properly lexically
scoped pragma in 5.9, but I don't know how far along that is.

I *did* leave clues: the strange record separator, the strange way I
constructed it, my reference to Audrey, and speaking pragmatically.  It
shows that not even perl5-porters are sensitized to this issue.  Since you
are not, it's not all that reasonable to expect all module writers to be
sensitive to it.

--tom



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About