In-Reply-To: Message from Eric Brine <ikegami@adaelis.com>
of "Sat, 26 Sep 2009 22:34:24 EDT."
<f86994700909261934n1558febcn22367f594f4f9527@mail.gmail.com>
>> This is a bug report for perl from Christoph Bussenius <pepe@cpan.org>,
>> generated with the help of perlbug 1.35 running under perl v5.8.8.
>> If a regular expression is matched case-insensitively against an utf8-
>> upgraded string, the case matching is usually done correctly with
>> respect to Unicode case semantics, i.e.
>> my $str = "hä"; utf8::upgrade($str); $str =~ /HÄ/i
>> is true. However I found that
>> my $str = "hä; utf8::upgrade($str); $str =~ /Ä/i
>> is false (only the regexes differ), which I believe to be a bug.
> I can reproduce the bug with 5.8.0:
> perl -e"binmode STDOUT, ':encoding(iso-latin-1)'; print qq{my \$str = qq{h\x{E4}}; utf8::upgrade(\$str); print \$str =~ /H\x{C4}/i ?1:0}" | perl -l
> 0
> But not with 5.8.8, 5.10.0 and 5.10.1:
> perl -e"binmode STDOUT, ':encoding(iso-latin-1)'; print qq{my \$str = qq{h\x{E4}}; utf8::upgrade(\$str); print \$str =~ /H\x{C4}/i ?1:0}" | perl -l
> 1
> I suspect your source is encoded using UTF-8, but you told
> Perl it's encoded using iso-latin-1 (by not telling it
> otherwise). 5.8.0, 5.8.8, 5.10. and 5.10.1:
That's not quite right: "not telling it otherwise" is *not* equivalent
to having told it string data are actually character data from your
native character encoding--defaulting to (EBCDIC or) ISO8859-1 (=Latin1).
In particular, absent any encoding directive, Perl will (usually)
fail to apply correct and expected character semantics to high-bit
octets. It depend whethers it got upgrades. So for reliability,
you really do have to tell it so, *even* if your source code
contains no high-bit literals.
And "use utf8" isn't even good enough, either.
(using simple examples to avoid quoting ick)
$ perl -le 'print "H\xC4" =~ /h\xE4/i || "loses"'
loses
$ perl -le 'print "\x{397}\xC4" =~ /\x{397}\xE4/i || "loses"'
1
That worked because the two ETA's force upgrades, whereas
those two diaeresied A's don't. Perl can sometimes carp
insightfully at you on such occasions:
$ perl -Mencoding::warnings -le 'print "\x{397}\xC4" =~ /\x{397}\xE4/i || "loses"'
Bytes implicitly upgraded into wide characters as iso-8859-1 at lib/Carp.pm line 1
Bytes implicitly upgraded into wide characters as iso-8859-1 at -e line 1
But... in Carp?
$ perl -WMencoding::warnings -le 'print "\xC4" =~ /\xE4/i || "loses"'
Bytes implicitly upgraded into wide characters as iso-8859-1 at lib/Carp.pm line 1
Bytes implicitly upgraded into wide characters as iso-8859-1 at -e line 1
Bytes implicitly upgraded into wide characters as iso-8859-1 at -e line 1
1
Hm. I nearly wonder whether that's correct in Carp; after all,
what's the user expected to do about the carping Carp code?
$ perl -WMstrict -Mencoding::warnings -wle \
'no encoding defined; print "\x{397}\xC4" =~ /\x{397}\xE4/i || "loses"'
1
Er, hm; perhaps. Or perhaps not?
Best be back to simpler bits:
$ perl -Mcharnames=Latin -le 'print "\N{H}\xC4" =~ /\N{h}\xE4/i || "loses"'
1
\N{iftily} promoting the kitten's kaboodle. This, too:
$ perl -Mencoding=Latin1 -le 'print "H\xC4" =~ /h\xE4/i || "loses"'
1
That works. But then, so does nearly anything, even this:
$ perl -Mencoding=ASCII -le 'print "H\xC4" =~ /h\xE4/i || "loses"'
1
But running rather contrary to expectation, *NOT* this:
$ perl -Mutf8 -le 'print "H\xC4" =~ /h\xE4/i || "loses"'
loses
So you have to write it this way:
$ perl -Mencoding=unicode -le 'print "H\xC4" =~ /h\xE4/i || "loses"'
Segmentation fault (core dumped)
Exit 139
WHOOPS!?
Program received signal SIGSEGV, Segmentation fault.
0x1c0979a2 in Perl_delete_eval_scope ()
(gdb) bt
#0 0x1c0979a2 in Perl_delete_eval_scope ()
#1 0x1c025d15 in Perl_call_sv ()
#2 0x1c028be8 in Perl_call_list ()
#3 0x1c01e69f in S_process_special_blocks ()
#4 0x1c01e05f in Perl_newATTRSUB ()
#5 0x1c01aff7 in Perl_utilize ()
#6 0x1c04683e in Perl_yyparse ()
#7 0x1c024c7d in S_parse_body ()
#8 0x1c0246d0 in perl_parse ()
#9 0x1c015e3d in main ()
Er, ok; so I *meant*:
$ perl -Mencoding=utf8 -le 'print "H\xC4" =~ /h\xE4/i || "loses"'
1
But GEE, sure shouldn't drop a big ol' *core* *dump* !!
$ limit core 0
Much better. Not.
--tom
--
He sat and sang a melody
his errantry a-tarrying;
he begged a pretty butterfly
that fluttered by to marry him.
She laughed at him, deluded him,
eluded him unpitying;
so long he studied wizardry
and sigaldry and smithying.
Thread Previous
|
Thread Next