Front page | perl.perl5.porters |
Postings from September 2009
Re: [perl #69414] Case-insensitive utf8 matching problem
Thread Previous
|
Thread Next
From:
demerphq
Date:
September 27, 2009 10:27
Subject:
Re: [perl #69414] Case-insensitive utf8 matching problem
Message ID:
9b18b3110909271027q3b195e19x3ffde41593b9b2ed@mail.gmail.com
2009/9/26 Christoph Bussenius <perlbug-followup@perl.org>:
> # New Ticket Created by Christoph Bussenius
> # Please include the string: [perl #69414]
> # in the subject line of all future correspondence about this issue.
> # <URL: http://rt.perl.org/rt3/Ticket/Display.html?id=69414 >
>
>
> This is a bug report for perl from Christoph Bussenius <pepe@cpan.org>,
> generated with the help of perlbug 1.35 running under perl v5.8.8.
>
>
> -----------------------------------------------------------------
>
> If a regular expression is matched case-insensitively against an utf8-upgraded
> string, the case matching is usually done correctly with respect to Unicode
> case semantics, i.e.
>
> my $str = "hä; utf8::upgrade($str); $str =~ /HÄ/i
>
> is true. However I found that
>
> my $str = "hä; utf8::upgrade($str); $str =~ /Ä/i
>
> is false (only the regexes differ), which I believe to be a bug.
>
> As these tests require that the source-code be latin1-encoded, I made a more
> portable version that hex-encodes the literals. The second test fails
> due to the bug.
>
> This has been tested in 5.8.8, 5.10.0 and bleed
> 663bfafc78cf049036e7391ba11385234dcbe9ed.
>
>
> use strict;
> use warnings;
> use Devel::Peek;
> use Test::More tests => 3;
>
> my $lower = pack('H*', '68e4'); # hä
> my $upper = pack('H*', '48c4'); # HÄ
> my $Auml = pack('H*', 'c4'); # Ä
> my $Auml2 = pack('H*', 'c4'); # Ä
> utf8::upgrade($lower);
> utf8::upgrade($Auml2);
>
> warn "hä\n";
> Dump($lower);
> warn "HÄ\n";
> Dump($upper);
> warn "Ä\n";
> Dump($Auml);
> warn "Ä -- upgraded\n";
> Dump($Auml2);
>
> # We search for three regexes within the upgraded string "hä", ignoring
> # case.
>
> ok($lower =~ /$upper/i, 'the full string in upper-case is found');
> ok($lower =~ /$Auml/i, 'the single upper-case umlaut should be found'); # FAILS
> ok($lower =~ /$Auml2/i, 'it is found if it is utf8-encoded');
Can you please run the following perl script with no arguments. If it
doesnt output the expected thing (also included here) please run it
again with an argument of 1, and pipe the STDOUT/STDERR output to a
file and then send us the file. Note the script is coded awkwardly for
deliberate reasons. Please bear with me.
use strict;
use warnings;
use Devel::Peek;
my $re_debug= shift || 1;
my $a_umlaut= chr(0xe4);
my $A_umlaut= chr(0xc4);
my @yn=("no", "yes");
select(STDERR);
for my $use_re_debug (0..$re_debug) {
my $debug_code="";
$debug_code= "use re 'debug';"
if $use_re_debug;
eval $debug_code. <<'EOFCODE' or die "Failed to eval!: $@";
printf "%-10s %-10s %-10s %-10s\n",qw(Utf8-Pat Utf8-Str PfxLen Matches?);
for my $utf8_pat (0..1) {
my $pat= $A_umlaut;
utf8::upgrade($pat) if $utf8_pat;
for my $utf8_str (0..1) {
my $str= $a_umlaut;
utf8::upgrade($str) if $utf8_str;
for my $prefix_len (0..2) {
$str= ("H" x $prefix_len) . $str;
if ($use_re_debug) {
print "Str:\n";
Dump($str);
print "Pat: \n";
Dump($pat);
}
printf "%-10s %-10s %-10s %-10s\n",
$yn[$utf8_pat],
$yn[$utf8_str],
$prefix_len,
$yn[$str =~ /$pat/i],
"\n";
}
}
}
1
EOFCODE
}
__END__
# You should see this:
Utf8-Pat Utf8-Str PfxLen Matches?
no no 0 no
no no 1 no
no no 2 no
no yes 0 no
no yes 1 no
no yes 2 no
yes no 0 yes
yes no 1 yes
yes no 2 yes
yes yes 0 yes
yes yes 1 yes
yes yes 2 yes
Also, id like to point out that the way things are supposed to work is
that the utf8ness of the pattern decides the semantics. In this case
the pattern /isnt/ utf8 (presumably) and thus the A umlaut isnt an A
umlaut, its a high bit undefined character that Perl gives no
semantics to and thus will not case fold. When the pattern is utf8
perl uses the utf8 case folding rules for the code point involved, and
finds the match.
So your report is a little troubling on the face of it. And more info
is needed. One thing that would also be interesting is to see your
code run on your perl with the -Mre=debug output. However that is
likely to be long and confusing for the same reasons i did the awkward
approach i did here (you really want to load the utf8 regex
metastructures BEFROE you load re=debug, otherwise you have to search
through whacks of unrelated patterns).
Yves
--
perl -Mre=debug -e "/just|another|perl|hacker/"
Thread Previous
|
Thread Next