Front page | perl.perl5.porters |
Postings from October 2009
Re: [perl #69414] Case-insensitive utf8 matching problem
Thread Previous
|
Thread Next
From:
demerphq
Date:
October 1, 2009 14:18
Subject:
Re: [perl #69414] Case-insensitive utf8 matching problem
Message ID:
9b18b3110910011418x385a6671sb9e26bd43b8ed5@mail.gmail.com
2009/9/30 Tom Christiansen <tchrist@perl.com>:
>>Tom Christiansen wrote:
>> > [snip]
>> >
>> > That's enough. I won't ask anyone to guess how to *reliably* write
>> >
>> > if ($data =~ s/^$BOM//) { $byte_order = XXX; }
>> >
>> > where BOM is the two-byte sequence FF FE or FE FF, depending. It's
>> > probably not what you may think it is :(, since C<use encoding "utf8">
>> > renders that otherwise straightforward problem pathetically tortuous.
I think you have managed to get yourself quite muddled.
A BOM is a codepoint which has a (relatively) unambiguous byte
sequence when encoded using any of the standard Unicode formats, and
which also has the defined rendering property of being "invisible". So
in order to know if you have a BOM /at all/ you have to inspect at the
/byte level/. Thinking of it as a "character" until you actually KNOW
it is a BOM is pretty silly, as you dont even know it is a BOM until
you have inspected its bytes and determined what encoding the message
is in, for all you know it might not be Unicode at all.
>> >
>
>>I would hope that
>> use charnames 'short';
>> if ($data =~ s/^\N{BOM}//) { $byte_order = XXX; }
>>would work.
>
> Karl, I do like your use of \N{BOM}. That's much better than
> a hard-coded 0xFF and 0xFE (or vice versa) since it's symbolic.
> You'll see I use it in the attached program, once it gets
> sent through a couple of encode/decode passes (really).
>
> Now there are several reasons why you can't go at the problem I have
> in mind with what you've written. One is that we've no mnemonic for
> a flipped-endian BOM--which you're not even allowed to THINK about.
There is a reason. What you want makes no sense.
A BOM is a code-point in Unicode. Endianness is a property of /bytes/
in word based encodings. Bytes and codepoints are *different*.
The /bytes/ of a BOM are a property of its Encoding, not its
codepoint. They are only incidentally the same in UTF-16BE/UTF-16(BE).
So for instance:
$ perl -MEncode -le'printf "%-10s: %s\n",$_, join " ",unpack "(h2)*",
Encode::encode($_, "\x{FEFF}") for qw(UTF-8 UTF-16LE UTF-16BE UTF-32BE
UTF-32LE UTF-32 UTF-16)'
UTF-8 : fe bb fb
UTF-16LE : ff ef
UTF-16BE : ef ff
UTF-32BE : 00 00 ef ff
UTF-32LE : ff ef 00 00
UTF-32 : 00 00 ef ff 00 00 ef ff
UTF-16 : ef ff ef ff
The last two are worth noting. You see there is some subtlety here as
BOM's are strictly speaking /required/ at the beginning of UTF-16 and
UTF-32, and are strictly speaking not BOM's at all in the rest, as
they are instead ZWNBSP's instead.
So when we ask encode() to encode as UTF-16 it does the "right thing"
and adds the BOM, even though really in this case its probably the
wrong thing as we are already emitting a BOM ourselves. But its easy
to see why Dan did it like that.
>
> $ perl -wle 'my $x = chr(0xFFFE)'
> Unicode character 0xfffe is illegal at -e line 1.
>
> Strangely enough, that's a run-time warning, not a compile-time one:
>
> $ perl -cwle 'my $x = chr(0xFFFE)'
> -e syntax OK
Hardly strange. The string containing the illegal character isnt
produced until the chr() function executes and converts the number
0xFFFE into the codepoint 0xFFFE, which is illegal as the codepoint
FFFE could be confused with FEFF on little-endian boxes and that would
defeat the entire purpose of using it as BOM at all.
> But another is how you'd write the data:
>
> $ perl -Mcharnames=:short -wle 'print "\xFE\xFF\x00A" =~ /^N{BOM}/ || "big lose"'
> big lose
>
> $ perl -Mcharnames=:short -wle 'print "\xFF\xFEA\x00" =~ /^N{BOM}/ || "little lose"'
> little lose
First i guess you meant \N{BOM} there right?
And second, you need to go back and rethink things here. Why the heck should:
"\x{FE}\x{FF}\x{00}\x{41}"
"\x{FF}\x{FE}\x{41}\x{00}"
match
"\x{FEFF}"
Do you see an "\x{FEFF}" in either of those strings? Do you expect
"\x{EF}\x{BB}\x{BF}" to match "\x{FEFF}" as well? Can you not see why
that makes no sense?
> Perl won't let you specify literal octets this way; it's pretty
> exasperating, actually. Those two high-bit octets get implicity
> upgraded into something you don't mean, so now you have the wrong
> characters there. Trying the obvious thing is even worse.
>
> $ perl -Mcharnames=:short -wle 'print "\x{FEFF}\x00A" =~ /^N{BOM}/ || "big lose"'
> big lose
>
> $ perl -Mcharnames=:short -wle 'print "\x{FFFE}A\x00" =~ /^N{BOM}/ || "little lose"'
> Unicode character 0xfffe is illegal at -e line 1.
> little lose
>
> $ perl -M-encoding -Mcharnames=:short -wle 'print "\x{FEFF}\x00A" =~ /^\N{BOM}/ || "big lose"'
> 1
I'm almost stunned that you think any of these are meaningful or obvious.
Go read the documentation for Encode. Read it carefully.
> Oddly, that last one finally let me do it--and I'm sorry, but
> it's hardly intuitive to my thick skull.
>
> Oh, and don't you go trying it the other way, either.
>
> $ perl -M-encoding -Mcharnames=:short -wle 'print "\x{FFFE}A\x00" =~ /^N{BOM}/ || "little lose"'
> Unicode character 0xfffe is illegal at -e line 1.
> little lose
>
> Other things are worse. Don't even dream of trying with C<use
> encoding ...>, as that road leads to more misery than you'd
> believe. It's either terribly broken, or else it's terribly
> wrong; mostly both.
>
> $ perl -Mencoding -Mcharnames=:short -wle 'print "\x{FEFF}\x00A" =~ /^\N{BOM}/ || "big lose"'
> Use of uninitialized value $name in string eq at /usr/local/lib/perl5/5.10.1/OpenBSD.i386-openbsd/encoding.pm line 107.
> Use of uninitialized value $name in string eq at /usr/local/lib/perl5/5.10.1/OpenBSD.i386-openbsd/encoding.pm line 115.
> Use of uninitialized value $name in exists at /usr/local/lib/perl5/5.10.1/OpenBSD.i386-openbsd/Encode.pm line 105.
> Use of uninitialized value $find in exists at /usr/local/lib/perl5/5.10.1/OpenBSD.i386-openbsd/Encode/Alias.pm line 25.
> Use of uninitialized value $find in hash element at /usr/local/lib/perl5/5.10.1/OpenBSD.i386-openbsd/Encode/Alias.pm line 26.
>
> [ 45 lines deleted ]
>
> Use of uninitialized value $find in pattern match (m//) at /usr/local/lib/perl5/5.10.1/OpenBSD.i386-openbsd/Encode/Alias.pm line 31.
> Use of uninitialized value $find in string eq at /usr/local/lib/perl5/5.10.1/OpenBSD.i386-openbsd/Encode/Alias.pm line 44.
> Use of uninitialized value $find in hash element at /usr/local/lib/perl5/5.10.1/OpenBSD.i386-openbsd/Encode/Alias.pm line 57.
> Use of uninitialized value $find in hash element at /usr/local/lib/perl5/5.10.1/OpenBSD.i386-openbsd/Encode/Alias.pm line 77.
> Use of uninitialized value $name in string ne at /usr/local/lib/perl5/5.10.1/OpenBSD.i386-openbsd/Encode.pm line 111.
> Use of uninitialized value $name in hash element at /usr/local/lib/perl5/5.10.1/OpenBSD.i386-openbsd/Encode.pm line 115.
> Use of uninitialized value $name in concatenation (.) or string at /usr/local/lib/perl5/5.10.1/OpenBSD.i386-openbsd/encoding.pm line 121.
> encoding: Unknown encoding '' at -e line 0
> BEGIN failed--compilation aborted.
>
> Which, while not a record in mismanaged error recovery, is neither
> a lovely tribute to the same. Adding an encoding brings no joy,
> either--nor reasonable diagnostics.
>
> $ perl -Mencoding=UTF-16 -Mcharnames=:short -wle 'print "\x{FEFF}\x00A" =~ /^\N{BOM}/ || "big lose"'
> Can't locate object method "cat_decode" via package "Encode::Unicode".
>
> $ perl -Mencoding=UTF-16BE -Mcharnames=:short -wle 'print "\x{FEFF}\x00A" =~ /^\N{BOM}/ || "big lose"'
> Can't locate object method "cat_decode" via package "Encode::Unicode".
>
> $ perl -Mencoding=UTF-16BE,STDOUT,latin1 -Mcharnames=:short -wle 'print "\x{FEFF}\x00A" =~ /^\N{BOM}/ || "big lose"'
> encoding: Unknown encoding for STDOUT, 'latin1' at -e line 0
> BEGIN failed--compilation aborted.
>
> $ perl -Mencoding=UTF-16BE,STDOUT,ISO8859-1 -Mcharnames=:short -wle 'print "\x{FEFF}\x00A" =~ /^\N{BOM}/ || "big lose"'
> encoding: Unknown encoding for STDOUT, 'ISO8859-1' at -e line 0
> BEGIN failed--compilation aborted.
>
> $ perl -Mencoding=UTF-16BE,STDOUT,iso-8859-1 -Mcharnames=:short -wle 'print "\x{FEFF}\x00A" =~ /^\N{BOM}/ || "big lose"'
> Can't locate object method "cat_decode" via package "Encode::Unicode".
>
> And no, Yves, there's no bug report on this. When *everything*
> that I try in this area seems to lead to Perl projectile-hurling
> at me with and and/or ALL of these, depending...
Well if Perl seg faults with a core module then it is worth a bug report.
If something distributed with the core spews out acre's of warnings
about bizarre non-user servicable stuff when it should throw a clean
error then file a bug report.
If something distributed with the core clearly does the opposite of
what it is documented to do then file a bug report.
If it is just perl not doing what you expect when you randomly mix
pragmas together without reading the docs carefully then you probably
should go ask on Perlmonks for a Unicode/Encoding tutorial. This is
not the right forum for them.
You have encountered several of the above, ill leave it you to decide
what is what. But I will say that many simple clear bug reports will
get you further than long mails like this. Especially when you are
mixing apples and oranges all over the place and frankly making shit
up and expecting the computer to understand it.
>
> **** core dumps
> *** panics
> ** reams and reams of cascading errors
> ** internal complaints in core modules
> * mysterious warnings
> * mysterious errors
> * mysterious failures
>
> Well, sheesh! I'm sorry, but with all that popping out at me,
> I just don't stop and bug-report each one of them; that takes even
> more analysis with the entire system nastily unstable beyond my
> patience. Why just this mail alone probably deserves to generate
> several real bug reports, but I'm not even sure HOW many it merits!
> I realize a lot of this could be my brain damage--but not all of it.
>
> So why should one care about matching BOMs, you might ask.
No, personally I wouldn't, but thanks. I worked enough with BOM's to
know why they are useful and why they are also often pita's.
> Suppose you're processing a binary data record created on some other
> system than one doing the processing. You know that the character
> data in that record are in some 16-bit encoding, but you don't know
> what the endianness is, and annoyingly enough there may not even be
> a BOM there at all.
>
> And yes, this *is* a real-world problem. An EXIF record's commment
> field is either ASCII or it's--well, something like UCS-2, but this
> is so poorly defined that in practice you have to try several ways
> because all occur in the wild.
>
> # easiest way to tell run-time warnings from compile-time ones:
> INIT { warn "... NOW RUNNING PROGRAM ...\n" }
Unless your code is evaled, and then the INIT block wont run at all,
and in fact will probably throw a fatal error.
> our $BOM_BE = decode($ENC_BYTES, encode("UTF16-BE", "\N{BOM}"));
> our $BOM_LE = decode($ENC_BYTES, encode("UTF16-LE", "\N{BOM}"));
Ah so you do know Encode. Good.
Except, umm, why are you decoding it after you encode it? The output
of encode() is octets (aka "bytes"). Why are you decoding it into
ENC_BYTES?
> process($_) for @ARGV;
> exit();
>
> sub process {
>
> my $infile = shift;
> open(my $imp, "< :raw", $infile) || die "can't open $infile: $!";
>
> die "read $infile: $!" unless 2 == read($imp, my $magicno, 2);
>
> # hm... this decoding doesn't seem to help anything at all
> ### $magicno = decode($ENC_BYTES, $magicno);
>
> SCOPE: {
> my $bom;
> # XXX: terrible, *terrible* things happen if $bom_flavor\'s my()ne
> our $bom_flavor = 0;
>
> my $suppress_warnings_only_during_compilation_not_execution = sub {
> no encoding::warnings;
> no warnings "utf8"; # oh, puh-LEASE
> ($bom) = $magicno =~ m{
> ^ ( \xFF \xFE (?{ $bom_flavor = 1 })
> | \xFE \xFF (?{ $bom_flavor = 2 })
> | \x{FFFE} (?{ $bom_flavor = 3 })
> | \x{FEFF} (?{ $bom_flavor = 4 })
> | \N{BOM} (?{ $bom_flavor = 5 })
> )
> }x;
> };
> &$suppress_warnings_only_during_compilation_not_execution;
> if ($bom) {
> warn sprintf "found BOM of flavor #%d => %#vx",
> $bom_flavor, $bom;
> }
> }
>
> ###if ($magicno eq $BOM_LE) {
> ###if ($magicno =~ m{^$BOM_LE} ) {
> if ($magicno =~ m{ \A \Q$BOM_LE\E \z }x ) {
> warn "$infile\'s little endian";
> binmode($imp, ":encoding(utf16-le)") || die $!;
> } else {
>
> ###if ($magicno eq $BOM_BE) {
> ###if ($magicno =~ m{^$BOM_BE} ) {
> if ($magicno =~ m{ \A \Q$BOM_BE\E \z }x ) {
> warn "$infile\'s big endian";
> } else {
> warn "$infile\'s got no BOM; rewinding to big endian";
> seek($imp, 0, 0) || die "seek: $!";
> }
> binmode($imp, ":encoding(utf16-be)") || die $!;
> }
>
> print "$infile contains => ";
> print scalar <$imp>;
> }
>
> __END__
>
Assuming Encode doesnt already have support for this, id would do
something like the following:
sub find_encoding {
my $string= shift;
my @bytes= unpack "C*", substr( $string . " ", 0, 4 );
$encoding= '';
if ( $bytes[0] == 0xFF ) {
if ( $bytes[1] == 0xFE ) {
$encoding= 'UTF-16LE';
if ( $bytes[2] == 0x00 and $bytes[3] == 0x00 ) {
$encoding= 'UTF-32LE';
}
}
}
elsif ( $bytes[0] == 0xFE ) {
if ( $bytes[1] == 0xFF ) {
$encoding= 'UTF-16BE';
}
}
elsif ( $bytes[0] == 0xF7 ) {
if ( $bytes[1] == 0x64 and $bytes[2] == 0x4C ) {
$encoding= 'UTF-1';
}
}
elsif ( $bytes[0] == 0xEF ) {
if ( $bytes[1] == 0xBB and $bytes[2] == 0xBF ) {
$encoding= 'UTF-8';
}
}
elsif ( $bytes[0] == 0xDD ) {
if ( $bytes[1] == 0x73 and $bytes[2] == 0x66 and $bytes[3] == 0x73 ) {
$encoding= 'UTF-EBCDIC';
}
}
elsif ( $bytes[0] == 0x2B ) {
if (
$bytes[1] == 0x2F
and $bytes[2] == 0x76
and ( $bytes[3] == 0x39
or $bytes[3] == 0x38
or $bytes[3] == 0x2F
or $bytes[3] == 0x2B ) )
{
$encoding= 'UTF-7';
}
}
elsif ( $bytes[0] == 0x00
and $bytes[1] == 0x00
and $bytes[2] == 0xFE
and $bytes[3] == 0xFF )
{
$encoding= 'UTF-32BE';
}
return $encoding ? ":encoding($encoding)" : "";
}
open my $fh,"<",$file or die "Failed to read '$file':$!";
sysread( $fh, $maybe_bom, 4); #bytes
close $fh;
# Now we know the encoding probably...
my $encoding= find_encoding($maybe_bom);
open my $fh, "<$encoding", $file
or die "Failed to reopen '$file' with $encoding:$!";
read($fh,my $bom, 1) #chars
if $encoding;
while (<$fh>) {
...
}
--
perl -Mre=debug -e "/just|another|perl|hacker/"
Thread Previous
|
Thread Next