develooper Front page | perl.perl5.porters | Postings from September 2009

Re: [perl #69414] Case-insensitive utf8 matching problem

Thread Previous | Thread Next
From:
Tom Christiansen
Date:
September 26, 2009 23:47
Subject:
Re: [perl #69414] Case-insensitive utf8 matching problem
Message ID:
903.1254033985@chthon
In-Reply-To: Message from Eric Brine <ikegami@adaelis.com>
   of "Sat, 26 Sep 2009 22:34:24 EDT."
   <f86994700909261934n1558febcn22367f594f4f9527@mail.gmail.com>

>> This is a bug report for perl from Christoph Bussenius <pepe@cpan.org>,
>> generated with the help of perlbug 1.35 running under perl v5.8.8.

>> If a regular expression is matched case-insensitively against an utf8-
>> upgraded string, the case matching is usually done correctly with
>> respect to Unicode case semantics, i.e.

>>   my $str = "hä"; utf8::upgrade($str); $str =~ /HÄ/i

>> is true.  However I found that

>>   my $str = "hä; utf8::upgrade($str); $str =~ /Ä/i

>> is false (only the regexes differ), which I believe to be a bug.

> I can reproduce the bug with 5.8.0:

>     perl -e"binmode STDOUT, ':encoding(iso-latin-1)'; print qq{my \$str = qq{h\x{E4}}; utf8::upgrade(\$str); print \$str =~ /H\x{C4}/i ?1:0}" | perl -l
>     0

> But not with 5.8.8, 5.10.0 and 5.10.1:

>     perl -e"binmode STDOUT, ':encoding(iso-latin-1)'; print qq{my \$str = qq{h\x{E4}}; utf8::upgrade(\$str); print \$str =~ /H\x{C4}/i ?1:0}" | perl -l
>     1

> I suspect your source is encoded using UTF-8, but you told
> Perl it's encoded using iso-latin-1 (by not telling it
> otherwise). 5.8.0, 5.8.8, 5.10. and 5.10.1:

That's not quite right: "not telling it otherwise" is *not* equivalent
to having told it string data are actually character data from your
native character encoding--defaulting to (EBCDIC or) ISO8859-1 (=Latin1).

In particular, absent any encoding directive, Perl will (usually)
fail to apply correct and expected character semantics to high-bit
octets.  It depend whethers it got upgrades.  So for reliability,
you really do have to tell it so, *even* if your source code
contains no high-bit literals.

And "use utf8" isn't even good enough, either.

		    (using simple examples to avoid quoting ick)

  $ perl -le 'print "H\xC4" =~ /h\xE4/i || "loses"'
  loses

  $ perl -le 'print "\x{397}\xC4" =~ /\x{397}\xE4/i || "loses"'
  1

That worked because the two ETA's force upgrades, whereas 
those two diaeresied A's don't.  Perl can sometimes carp 
insightfully at you on such occasions:

  $ perl -Mencoding::warnings -le 'print "\x{397}\xC4" =~ /\x{397}\xE4/i || "loses"'
  Bytes implicitly upgraded into wide characters as iso-8859-1 at lib/Carp.pm line 1
  Bytes implicitly upgraded into wide characters as iso-8859-1 at -e line 1

But... in Carp?

  $ perl -WMencoding::warnings -le 'print "\xC4" =~ /\xE4/i || "loses"'
  Bytes implicitly upgraded into wide characters as iso-8859-1 at lib/Carp.pm line 1
  Bytes implicitly upgraded into wide characters as iso-8859-1 at -e line 1
  Bytes implicitly upgraded into wide characters as iso-8859-1 at -e line 1
  1

Hm.  I nearly wonder whether that's correct in Carp; after all, 
what's the user expected to do about the carping Carp code?

  $ perl -WMstrict -Mencoding::warnings -wle \
      'no encoding defined; print "\x{397}\xC4" =~ /\x{397}\xE4/i || "loses"'
  1

Er, hm; perhaps.  Or perhaps not?

Best be back to simpler bits:

  $ perl -Mcharnames=Latin -le 'print "\N{H}\xC4" =~ /\N{h}\xE4/i || "loses"'
  1

\N{iftily} promoting the kitten's kaboodle.  This, too:

  $ perl -Mencoding=Latin1 -le 'print "H\xC4" =~ /h\xE4/i || "loses"'
  1

That works.  But then, so does nearly anything, even this:

  $ perl -Mencoding=ASCII -le 'print "H\xC4" =~ /h\xE4/i || "loses"'
  1

But running rather contrary to expectation, *NOT* this:

  $ perl -Mutf8 -le 'print "H\xC4" =~ /h\xE4/i || "loses"'
  loses

So you have to write it this way:

  $ perl -Mencoding=unicode -le 'print "H\xC4" =~ /h\xE4/i || "loses"'
  Segmentation fault (core dumped)
  Exit 139

WHOOPS!?

  Program received signal SIGSEGV, Segmentation fault.
  0x1c0979a2 in Perl_delete_eval_scope ()
  (gdb) bt
  #0  0x1c0979a2 in Perl_delete_eval_scope ()
  #1  0x1c025d15 in Perl_call_sv ()
  #2  0x1c028be8 in Perl_call_list ()
  #3  0x1c01e69f in S_process_special_blocks ()
  #4  0x1c01e05f in Perl_newATTRSUB ()
  #5  0x1c01aff7 in Perl_utilize ()
  #6  0x1c04683e in Perl_yyparse ()
  #7  0x1c024c7d in S_parse_body ()
  #8  0x1c0246d0 in perl_parse ()
  #9  0x1c015e3d in main ()

Er, ok; so I *meant*:

  $ perl -Mencoding=utf8 -le 'print "H\xC4" =~ /h\xE4/i || "loses"'
  1

But GEE, sure shouldn't drop a big ol' *core* *dump* !!

  $ limit core 0

Much better.  Not.

--tom

-- 
  He sat and sang a melody
    his errantry a-tarrying;
  he begged a pretty butterfly
    that fluttered by to marry him.
  She laughed at him, deluded him,
    eluded him unpitying;
  so long he studied wizardry
    and sigaldry and smithying.

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About