develooper Front page | perl.perl5.porters | Postings from September 2009

Re: [perl #69414] Case-insensitive utf8 matching problem

Thread Previous | Thread Next
From:
Tom Christiansen
Date:
September 29, 2009 17:55
Subject:
Re: [perl #69414] Case-insensitive utf8 matching problem
Message ID:
16934.1254272088@chthon
>Tom Christiansen wrote:
> > [snip]
> >
> > That's enough.  I won't ask anyone to guess how to *reliably* write
> >
> >     if ($data =~ s/^$BOM//)  { $byte_order = XXX; }
> >
> > where BOM is the two-byte sequence FF FE or FE FF, depending.  It's
> > probably not what you may think it is :(, since C<use encoding "utf8">
> > renders that otherwise straightforward problem pathetically tortuous.
> >

>I would hope that
>       use charnames 'short';
>       if ($data =~ s/^\N{BOM}//) { $byte_order = XXX; }
>would work.

Karl, I do like your use of \N{BOM}.  That's much better than
a hard-coded 0xFF and 0xFE (or vice versa) since it's symbolic.
You'll see I use it in the attached program, once it gets
sent through a couple of encode/decode passes (really).

Now there are several reasons why you can't go at the problem I have
in mind with what you've written.  One is that we've no mnemonic for
a flipped-endian BOM--which you're not even allowed to THINK about.

    $ perl -wle 'my $x = chr(0xFFFE)'
    Unicode character 0xfffe is illegal at -e line 1.

Strangely enough, that's a run-time warning, not a compile-time one:

    $ perl -cwle 'my $x = chr(0xFFFE)'
    -e syntax OK

But another is how you'd write the data:

    $ perl -Mcharnames=:short -wle 'print "\xFE\xFF\x00A" =~ /^N{BOM}/ || "big lose"'
    big lose

    $ perl -Mcharnames=:short -wle 'print "\xFF\xFEA\x00" =~ /^N{BOM}/ || "little lose"'
    little lose

Perl won't let you specify literal octets this way; it's pretty
exasperating, actually.  Those two high-bit octets get implicity
upgraded into something you don't mean, so now you have the wrong
characters there. Trying the obvious thing is even worse.

    $ perl -Mcharnames=:short -wle 'print "\x{FEFF}\x00A" =~ /^N{BOM}/ || "big lose"'
    big lose

    $ perl -Mcharnames=:short -wle 'print "\x{FFFE}A\x00" =~ /^N{BOM}/ || "little lose"'
    Unicode character 0xfffe is illegal at -e line 1.
    little lose

    $ perl -M-encoding -Mcharnames=:short -wle 'print "\x{FEFF}\x00A" =~ /^\N{BOM}/ || "big lose"'
    1

Oddly, that last one finally let me do it--and I'm sorry, but
it's hardly intuitive to my thick skull.  

Oh, and don't you go trying it the other way, either.

    $ perl -M-encoding -Mcharnames=:short -wle 'print "\x{FFFE}A\x00" =~ /^N{BOM}/ || "little lose"'
    Unicode character 0xfffe is illegal at -e line 1.
    little lose

Other things are worse. Don't even dream of trying with C<use
encoding ...>, as that road leads to more misery than you'd
believe.  It's either terribly broken, or else it's terribly
wrong; mostly both.

    $ perl -Mencoding -Mcharnames=:short -wle 'print "\x{FEFF}\x00A" =~ /^\N{BOM}/ || "big lose"'
    Use of uninitialized value $name in string eq at /usr/local/lib/perl5/5.10.1/OpenBSD.i386-openbsd/encoding.pm line 107.
    Use of uninitialized value $name in string eq at /usr/local/lib/perl5/5.10.1/OpenBSD.i386-openbsd/encoding.pm line 115.
    Use of uninitialized value $name in exists at /usr/local/lib/perl5/5.10.1/OpenBSD.i386-openbsd/Encode.pm line 105.
    Use of uninitialized value $find in exists at /usr/local/lib/perl5/5.10.1/OpenBSD.i386-openbsd/Encode/Alias.pm line 25.
    Use of uninitialized value $find in hash element at /usr/local/lib/perl5/5.10.1/OpenBSD.i386-openbsd/Encode/Alias.pm line 26.

[ 45 lines deleted ]

    Use of uninitialized value $find in pattern match (m//) at /usr/local/lib/perl5/5.10.1/OpenBSD.i386-openbsd/Encode/Alias.pm line 31.
    Use of uninitialized value $find in string eq at /usr/local/lib/perl5/5.10.1/OpenBSD.i386-openbsd/Encode/Alias.pm line 44.
    Use of uninitialized value $find in hash element at /usr/local/lib/perl5/5.10.1/OpenBSD.i386-openbsd/Encode/Alias.pm line 57.
    Use of uninitialized value $find in hash element at /usr/local/lib/perl5/5.10.1/OpenBSD.i386-openbsd/Encode/Alias.pm line 77.
    Use of uninitialized value $name in string ne at /usr/local/lib/perl5/5.10.1/OpenBSD.i386-openbsd/Encode.pm line 111.
    Use of uninitialized value $name in hash element at /usr/local/lib/perl5/5.10.1/OpenBSD.i386-openbsd/Encode.pm line 115.
    Use of uninitialized value $name in concatenation (.) or string at /usr/local/lib/perl5/5.10.1/OpenBSD.i386-openbsd/encoding.pm line 121.
    encoding: Unknown encoding '' at -e line 0
    BEGIN failed--compilation aborted.

Which, while not a record in mismanaged error recovery, is neither
a lovely tribute to the same. Adding an encoding brings no joy,
either--nor reasonable diagnostics.

    $ perl -Mencoding=UTF-16 -Mcharnames=:short -wle 'print "\x{FEFF}\x00A" =~ /^\N{BOM}/ || "big lose"'
    Can't locate object method "cat_decode" via package "Encode::Unicode".

    $ perl -Mencoding=UTF-16BE -Mcharnames=:short -wle 'print "\x{FEFF}\x00A" =~ /^\N{BOM}/ || "big lose"'
    Can't locate object method "cat_decode" via package "Encode::Unicode".

    $ perl -Mencoding=UTF-16BE,STDOUT,latin1 -Mcharnames=:short -wle 'print "\x{FEFF}\x00A" =~ /^\N{BOM}/ || "big lose"'
    encoding: Unknown encoding for STDOUT, 'latin1' at -e line 0
    BEGIN failed--compilation aborted.

    $ perl -Mencoding=UTF-16BE,STDOUT,ISO8859-1 -Mcharnames=:short -wle 'print "\x{FEFF}\x00A" =~ /^\N{BOM}/ || "big lose"'
    encoding: Unknown encoding for STDOUT, 'ISO8859-1' at -e line 0
    BEGIN failed--compilation aborted.

    $ perl -Mencoding=UTF-16BE,STDOUT,iso-8859-1 -Mcharnames=:short -wle 'print "\x{FEFF}\x00A" =~ /^\N{BOM}/ || "big lose"'
    Can't locate object method "cat_decode" via package "Encode::Unicode".

And no, Yves, there's no bug report on this.  When *everything* 
that I try in this area seems to lead to Perl projectile-hurling 
at me with and and/or ALL of these, depending...

	**** core dumps
	 *** panics
	  ** reams and reams of cascading errors
	  ** internal complaints in core modules
		* mysterious warnings
		* mysterious errors
		* mysterious failures

Well, sheesh!  I'm sorry, but with all that popping out at me,
I just don't stop and bug-report each one of them; that takes even
more analysis with the entire system nastily unstable beyond my
patience.  Why just this mail alone probably deserves to generate
several real bug reports, but I'm not even sure HOW many it merits!
I realize a lot of this could be my brain damage--but not all of it.

So why should one care about matching BOMs, you might ask.

Suppose you're processing a binary data record created on some other
system than one doing the processing.  You know that the character
data in that record are in some 16-bit encoding, but you don't know
what the endianness is, and annoyingly enough there may not even be
a BOM there at all.

And yes, this *is* a real-world problem.  An EXIF record's commment
field is either ASCII or it's--well, something like UCS-2, but this
is so poorly defined that in practice you have to try several ways
because all occur in the wild.

I *have* come up with something ungainly that works.

First, create the data files:

    #!/bin/sh -v
    perl -Mcharnames=Latin        -le 'no encoding;
           print        "tsch\N{u with diaeresis}\N{sharp s}"' > tschuess.dat.0
    perl -Mcharnames=Latin        -le 'use encoding "Latin1";
           print        "tsch\N{u with diaeresis}\N{sharp s}"' > tschuess.dat.1
    perl -Mcharnames=Latin        -le 'use utf8;
           print        "tsch\N{u with diaeresis}\N{sharp s}"' > tschuess.dat.2
    perl -Mcharnames=Latin        -le 'use encoding "utf8";
           print        "tsch\N{u with diaeresis}\N{sharp s}"' > tschuess.dat.3
    perl -Mcharnames=Latin        -le 'use encoding "Latin1", STDOUT => "UTF-8";
           print        "tsch\N{u with diaeresis}\N{sharp s}"' > tschuess.dat.4
    perl -Mcharnames=Latin,:short -le 'use encoding "Latin1", STDOUT => "UTF16-BE";
           print "\N{BOM}tsch\N{u with diaeresis}\N{sharp s}"' > tschuess.dat.5
    perl -Mcharnames=Latin,:short -le 'use encoding "Latin1", STDOUT => "UTF16-LE";
           print "\N{BOM}tsch\N{u with diaeresis}\N{sharp s}"' > tschuess.dat.6
    perl -Mcharnames=Latin        -le 'use encoding "Latin1", STDOUT => "UTF16";
           print        "tsch\N{u with diaeresis}\N{sharp s}"' > tschuess.dat.7
    perl -Mcharnames=Latin        -le 'use encoding "Latin1", STDOUT => "UTF16-BE";
           print        "tsch\N{u with diaeresis}\N{sharp s}"' > tschuess.dat.8

The test program, rddat, follows--and remarkably, it *does* work
correctly.  I decided not to use pack with n/v formats and instead
fight my way through the encode/decode bits.  I'm doing a lot of
things oddly below partly because you really do have to bend over
backwards for this, and partly to try to isolate just what's
causing problems and when they're really happening.

You MUSTN'T say use encoding no matter the flavor, even ASCII or
Latin1 in this program (ahnd, I suspect, nearly any program).  You can
see what happens when you try this way.  It actually causes Perl to
take a panic, although which one you trigger is version-dependent.
This too I believe to be an ipso facto bug in Perl itself, just like
the coredump you get from mistakenly saying C<use encoding Unicode>.

Watch:

    $ env PERL_encoding_test=latin1 perl5.10.0 rddat *[5-8]
    ... NOW RUNNING PROGRAM ...
    found BOM of flavor #2 => 0xfe.0xff at rddat line 80.
    tschuess.dat-5's big endian at rddat line 95.
    tschuess.dat-5 contains => "\x{1f2b3ba4}" does not map to iso-8859-1 at rddat line 104, <$imp> line 1.
    panic: sv_setpvn called with negative strlen at rddat line 104, <$imp> line 1.

And

    $ env PERL_encoding_test=latin1 perl5.10.1 rddat *[5-8]
    ... NOW RUNNING PROGRAM ...
    found BOM of flavor #2 => 0xfe.0xff at rddat line 80.
    tschuess.dat-5's big endian at rddat line 95.
    tschuess.dat-5 contains => "\x{1f2b3ba4}" does not map to iso-8859-1 at rddat line 104, <$imp> line 1.
    panic: sv_setpvn called with negative strlen at rddat line 104, <$imp> line 1.

But without that env setting which you'll see triggers a C<use encoding
...>, the program seems to run ok, both in ISO-8859-1 terminals and also
in ISO-10646-1 terminals where I have PERL_UNICODE set to S.

Since we only care about the 16-bit encodings, run it as 
follows to generate the cited output:

    $ perl rddat tschuess.dat.[5-8]
    ... NOW RUNNING PROGRAM ...
    found BOM of flavor #2 => 0xfe.0xff at rddat line 80.
    tschuess.dat-5's big endian at rddat line 95.
    tschuess.dat-5 contains => tschüß
    found BOM of flavor #1 => 0xff.0xfe at rddat line 80.
    tschuess.dat-6's little endian at rddat line 88.
    tschuess.dat-6 contains => tschüß
    found BOM of flavor #2 => 0xfe.0xff at rddat line 80.
    tschuess.dat-7's big endian at rddat line 95.
    tschuess.dat-7 contains => tschüß
    tschuess.dat-8's got no BOM; rewinding to big endian at rddat line 97.
    tschuess.dat-8 contains => tschüß

You may be surprised by just which BOM flavors the 1 and 2
it matched are.  Those are split-byte versions, not \N{BOM} 
versions.

I do sometimes C<use encoding> on one-liners to generate test data.
It's still remarkably pesky.  Notice the strange output of run #3:

    1$ perl -le 'use encoding ASCII, STDOUT => "UTF-7"; print "tsch\xFC\xDF"' | perl -lnE 'printf "line #$.: %vx\n", $_'
    line #1: 74.73.63.68.2b.2f.2f.33.2f.2f.51.2d

    2$ perl -le 'use encoding ASCII, STDOUT => "UTF-8"; print "tsch\xFC\xDF"' | perl -lnE 'printf "line #$.: %vx\n", $_'
    line #1: 74.73.63.68.ef.bf.bd.ef.bf.bd

    3$ perl -le 'use encoding ASCII, STDOUT => "Latin1"; print "tsch\xFC\xDF"' | perl -lnE 'printf "line #$.: %vx\n", $_'
    "\x{fffd}" does not map to iso-8859-1.
    "\x{fffd}" does not map to iso-8859-1.
    line #1: 74.73.63.68.5c.78.7b.66.66.66.64.7d.5c.78.7b.66.66.66.64.7d

    4$ perl -le 'use encoding Latin1, STDOUT => "Latin1"; print "tsch\xFC\xDF"' | perl -lnE 'printf "line #$.: %vx\n", $_'
    line #1: 74.73.63.68.fc.df

    5$ perl -le 'use encoding ASCII, STDOUT => "UCS-2"; print "tsch\xFC\xDF"' | perl -lnE 'printf "line #$.: %vx\n", $_'
    line #1: 0.74.0.73.0.63.0.68.ff.fd.ff.fd.0

    6$ perl -le 'use encoding ASCII, STDOUT => "UTF-16"; print "tsch\xFC\xDF"' | perl -lnE 'printf "line #$.: %vx\n", $_'
    line #1: fe.ff.0.74.0.73.0.63.0.68.ff.fd.ff.fd.0

    7$ perl -le 'use encoding ASCII, STDOUT => "UTF-16LE"; print "tsch\xFC\xDF"' | perl -lnE 'printf "line #$.: %vx\n", $_'
    line #1: 74.0.73.0.63.0.68.0.fd.ff.fd.ff
    line #2: 0

    8$ perl -le 'use encoding ASCII, STDOUT => "UTF-16BE"; print "tsch\xFC\xDF"' | perl -lnE 'printf "line #$.: %vx\n", $_'
    line #1: 0.74.0.73.0.63.0.68.ff.fd.ff.fd.0

    9$ perl -le 'use encoding ASCII, STDOUT => "UTF-32"; print "tsch\xFC\xDF"' | perl -lnE 'printf "line #$.: %vx\n", $_'
    line #1: 0.0.fe.ff.0.0.0.74.0.0.0.73.0.0.0.63.0.0.0.68.0.0.ff.fd.0.0.ff.fd.0.0.0

   10$ perl -le 'use encoding ASCII, STDOUT => "UTF-32LE"; print "tsch\xFC\xDF"' | perl -lnE 'printf "line #$.: %vx\n", $_'
    line #1: 74.0.0.0.73.0.0.0.63.0.0.0.68.0.0.0.fd.ff.0.0.fd.ff.0.0
    line #2: 0.0.0


--tom

##################################################################
#!/usr/bin/perl

BEGIN { die "needs file args" unless @ARGV }

use strict;

use charnames ":short";

use Encode qw< encode decode >;
use encoding::warnings; # "FATAL";
use           warnings; #  FATAL => "all";


use if $ENV{PERL_encoding_test} => encoding => $ENV{PERL_encoding_test};

our $ENC_BYTES;
BEGIN  {
    $| = 1;
    $ENC_BYTES = "ISO-8859-1";

    binmode(STDOUT, ":encoding($ENC_BYTES)")
                                                  || die "can't fix stdout"
        unless  $ENV{       PERL_UNICODE      }
            || grep { m< UTF | UCS | 10646 >i }
               grep {         defined         }
                @ENV{ qw[LC_ALL LC_TYPE LANG] };

    open(STDERR, $_=">&STDOUT") || do { my $__LINE__ = __LINE__;
        my %seen = ();
        for my $fh (*STDERR, *STDOUT) {
            next unless    my($dev,$ino) = stat($fh);
            next unless $seen{$dev,$ino}++;
            printf $fh "%s: couldn't dup STDERR$_: "
                     . "%s [errno=%d] at line %d of file %s\n",
                        $0, ($!) x 2,
                        $__LINE__, __FILE__
        }
        die;
    };

}

# easiest way to tell run-time warnings from compile-time ones:
INIT { warn  "... NOW RUNNING PROGRAM ...\n" }

our $BOM_BE = decode($ENC_BYTES, encode("UTF16-BE", "\N{BOM}"));
our $BOM_LE = decode($ENC_BYTES, encode("UTF16-LE", "\N{BOM}"));

process($_) for @ARGV;
exit();

sub process {

    my $infile = shift;
    open(my $imp, "< :raw", $infile) || die "can't open $infile: $!";

    die "read $infile: $!" unless 2 == read($imp, my $magicno, 2);

    # hm... this decoding doesn't seem to help anything at all
    ### $magicno = decode($ENC_BYTES, $magicno);

    SCOPE: {
        my $bom;
        # XXX: terrible, *terrible* things happen if $bom_flavor\'s my()ne
        our $bom_flavor = 0;

        my $suppress_warnings_only_during_compilation_not_execution = sub {
            no encoding::warnings;
            no warnings "utf8";  # oh, puh-LEASE
            ($bom) = $magicno =~ m{
                ^ (   \xFF \xFE (?{ $bom_flavor = 1 })
                    | \xFE \xFF (?{ $bom_flavor = 2 })
                    | \x{FFFE}  (?{ $bom_flavor = 3 })
                    | \x{FEFF}  (?{ $bom_flavor = 4 })
						  | \N{BOM}   (?{ $bom_flavor = 5 })
                  )
            }x;
        };
        &$suppress_warnings_only_during_compilation_not_execution;
        if ($bom) {
            warn sprintf "found BOM of flavor #%d => %#vx",
                            $bom_flavor, $bom;
        }
    }

    ###if ($magicno eq $BOM_LE) {
    ###if ($magicno =~ m{^$BOM_LE} ) {
    if ($magicno =~ m{ \A \Q$BOM_LE\E \z }x ) {
        warn "$infile\'s little endian";
        binmode($imp, ":encoding(utf16-le)") || die $!;
    } else {

        ###if ($magicno eq $BOM_BE) {
        ###if ($magicno =~ m{^$BOM_BE} ) {
        if ($magicno =~ m{ \A \Q$BOM_BE\E \z }x ) {
            warn "$infile\'s big endian";
        } else {
            warn "$infile\'s got no BOM; rewinding to big endian";
            seek($imp, 0, 0) || die "seek: $!";
        }
        binmode($imp, ":encoding(utf16-be)") || die $!;
    }

    print "$infile contains => ";
    print scalar <$imp>;
}

__END__

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About