Front page | perl.perl5.porters |
Postings from September 2009
Re: [perl #69414] Case-insensitive utf8 matching problem
Thread Previous
|
Thread Next
From:
demerphq
Date:
September 27, 2009 15:01
Subject:
Re: [perl #69414] Case-insensitive utf8 matching problem
Message ID:
9b18b3110909271501h5114c2dak1f6106e812c90e9f@mail.gmail.com
2009/9/27 Christoph Bussenius <pepe@cpan.org>:
> Hi,
>
> first I want to clarify two issues about my original message:
>
> On Sun, Sep 27, 2009 at 07:27:37PM +0200, demerphq wrote:
>> 2009/9/26 Christoph Bussenius <perlbug-followup@perl.org>:
>> > my $str = "hä; utf8::upgrade($str); $str =~ /HÄ/i
>> >
>> > is true. However I found that
>> >
>> > my $str = "hä; utf8::upgrade($str); $str =~ /Ä/i
>
> Of course there were quotes missing, it should have been "hä" (with quotes) in
> both cases.
>
>> > my $lower = pack('H*', '68e4'); # hä
>> > my $upper = pack('H*', '48c4'); # HÄ
>> > my $Auml = pack('H*', 'c4'); # Ä
>> > my $Auml2 = pack('H*', 'c4'); # Ä
>> > utf8::upgrade($lower);
>> > utf8::upgrade($Auml2);
>> >
>> > warn "hä\n";
>> > Dump($lower);
>> > warn "HÄ\n";
>> > Dump($upper);
>> > warn "Ä\n";
>> > Dump($Auml);
>> > warn "Ä -- upgraded\n";
>> > Dump($Auml2);
>
> I should have included the output of these debug lines. For brevity
> I'll only show the PV lines of the Devel::Peek output:
>
> hä
> PV = 0x9b4ccf0 "h\303\244"\0 [UTF8 "h\x{e4}"]
> HÄ
> PV = 0x9b50e50 "H\304"\0
> Ä
> PV = 0x9b509e8 "\304"\0
> Ä -- upgraded
> PV = 0x9a74eb8 "\303\204"\0 [UTF8 "\x{c4}"]
>
> (same on 5.8.8, 5.10.0 and bleed 326df896fec9493c512db76eb6738c3ce3ba9097).
>
> This should dispel any doubts that my source code was utf8-encoded. Only those
> strings are utf8 that were explicitly upgraded, and they have the UTF8 flag on.
>
>
>> Can you please run the following perl script with no arguments. If it
>> doesnt output the expected thing (also included here) please run it
>> again with an argument of 1, and pipe the STDOUT/STDERR output to a
>> file and then send us the file. Note the script is coded awkwardly for
>> deliberate reasons. Please bear with me.
>>
>> use strict;
>> use warnings;
>> use Devel::Peek;
>>
>> my $re_debug= shift || 1;
>
> Hmm, I hope I do the right thing by changing the 1 to 0 because otherwise the
> output would be the same with no arguments and with "1" as argument.
Yes, you got it right. Sorry.
>
>
> Your script shows the expected output with 5.10.0 and bleed, however I get
> different output with 5.8.8. This is the diff between the expected output and
> 5.8.8 output:
>
> --- expected 2009-09-27 20:17:23.000000000 +0200
> +++ actual588 2009-09-27 21:01:33.000000000 +0200
> @@ -5,9 +5,9 @@
> no yes 0 no
> no yes 1 no
> no yes 2 no
> -yes no 0 yes
> -yes no 1 yes
> -yes no 2 yes
> +yes no 0 no
> +yes no 1 no
> +yes no 2 no
> yes yes 0 yes
> yes yes 1 yes
> yes yes 2 yes
>
> (See below for the full output with argument 1.)
>
> Unfortunately I don't have 5.8.9 around so I can't test the behaviour there.
> All my tests were on Gentoo Linux.
Ok this is definitely wrong.
>>
>> Also, id like to point out that the way things are supposed to work is
>> that the utf8ness of the pattern decides the semantics.
>
> However, if that is the case, would you regard it as a bug that all tested perl
> versions (5.8.8, 5.10.0, bleed) print "1" for:
>
> perl -lwe 'my $str = "h\xe4"; utf8::upgrade($str); print $str =~ /H\xc4/i'
>
> ?
No, i was wrong. Was not thinking clearly and after checking the facts
you are right. It should be true if the string is utf8 or the pattern
is utf8, and in fact in 5.12 probably should be true, period.
> Here, unicode semantics are activated by the string (not the pattern) having
> utf8ness.
>
> As the script from my original message showed, the utf8ness of the pattern only
> comes in when I change
>
> /H\xc4/i
> to
> /\xc4/i
>
> which is, at least, not very consistent.
No no no. My bad, you are right, this is a bug in the startclass
logic. If you modify the script to disable the startclass logic this
pattern matches just fine. I dont have time to fix it now tho.
>> One thing that would also be interesting is to see your
>> code run on your perl with the -Mre=debug output.
>> However that is likely to be long and confusing
>
> In fact it is some 3000 lines long. Because other people have confirmed the
> above behaviour on bleedperl, I think/hope it can be reproduced anywhere, so I
> will omit the output for now.
Yeah.
>
> Please let me know if you need more information. I hope I didn't make this
> report more confusing than necessary :)
Nope, all good. This is a confirmed bug in pretty much everything, and
your perl is even worse.
>
> The rest of this mail is the full output of your script with argument "1" on
> perl 5.8.8:
Right. The trick he is to do thing twice. The first time we load alll
the utf8 gunk, the second time we enable the debug engine.
> Utf8-Pat Utf8-Str PfxLen Matches?
> no no 0 no
> no no 1 no
> no no 2 no
> no yes 0 no
> no yes 1 no
> no yes 2 no
> yes no 0 no
> yes no 1 no
> yes no 2 no
> yes yes 0 yes
> yes yes 1 yes
> yes yes 2 yes
> Utf8-Pat Utf8-Str PfxLen Matches?
> Str:
> SV = PV(0x935ce18) at 0x9355b78
> REFCNT = 1
> FLAGS = (PADBUSY,PADMY,POK,pPOK)
> PV = 0x937ba20 "\344"\0
> CUR = 1
> LEN = 4
> Pat:
> SV = PV(0x92cf970) at 0x9318fc0
> REFCNT = 1
> FLAGS = (PADBUSY,PADMY,POK,pPOK)
> PV = 0x937b928 "\304"\0
> CUR = 1
> LEN = 4
> Compiling REx `Ä'
> size 3 Got 28 bytes for offset annotations.
> first at 1
> 1: EXACTF <Ä>(3)
> 3: END(0)
> stclass "EXACTF <Ä>" minlen 1
> Offsets: [3]
> 1[1] 0[0] 2[0]
> Matching REx "Ä" against "ä"
> Matching stclass "EXACTF <Ä>" against "ä"
> Contradicts stclass...
> Match failed
> no no 0 no
> Str:
> SV = PV(0x935ce18) at 0x9355b78
> REFCNT = 1
> FLAGS = (PADBUSY,PADMY,POK,pPOK)
> PV = 0x937ba20 "H\344"\0
> CUR = 2
> LEN = 4
> Pat:
> SV = PV(0x92cf970) at 0x9318fc0
> REFCNT = 1
> FLAGS = (PADBUSY,PADMY,POK,pPOK)
> PV = 0x937b928 "\304"\0
> CUR = 1
> LEN = 4
> Matching REx "Ä" against "Hä"
> Matching stclass "EXACTF <Ä>" against "Hä"
> Contradicts stclass...
> Match failed
> no no 1 no
> Str:
> SV = PV(0x935ce18) at 0x9355b78
> REFCNT = 1
> FLAGS = (PADBUSY,PADMY,POK,pPOK)
> PV = 0x937ba20 "HHH\344"\0
> CUR = 4
> LEN = 8
> Pat:
> SV = PV(0x92cf970) at 0x9318fc0
> REFCNT = 1
> FLAGS = (PADBUSY,PADMY,POK,pPOK)
> PV = 0x937b928 "\304"\0
> CUR = 1
> LEN = 4
> Matching REx "Ä" against "HHHä"
> Matching stclass "EXACTF <Ä>" against "HHHä"
> Contradicts stclass...
> Match failed
> no no 2 no
> Str:
> SV = PV(0x935ce18) at 0x9355b78
> REFCNT = 1
> FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8)
> PV = 0x935f0c0 "\303\244"\0 [UTF8 "\x{e4}"]
> CUR = 2
> LEN = 3
> Pat:
> SV = PV(0x92cf970) at 0x9318fc0
> REFCNT = 1
> FLAGS = (PADBUSY,PADMY,POK,pPOK)
> PV = 0x937b928 "\304"\0
> CUR = 1
> LEN = 4
> Matching REx "Ä" against "\x{e4}"
> Matching stclass "EXACTF <Ä>" against "ä"
> Contradicts stclass...
> Match failed
> no yes 0 no
> Str:
> SV = PVMG(0x92fb080) at 0x9355b78
> REFCNT = 1
> FLAGS = (PADBUSY,PADMY,SMG,POK,pPOK,UTF8)
> IV = 0
> NV = 0
> PV = 0x935f0c0 "H\303\244"\0 [UTF8 "H\x{e4}"]
> CUR = 3
> LEN = 4
> MAGIC = 0x93a2218
> MG_VIRTUAL = &PL_vtbl_utf8
> MG_TYPE = PERL_MAGIC_utf8(w)
> MG_LEN = 2
> Pat:
> SV = PV(0x92cf970) at 0x9318fc0
> REFCNT = 1
> FLAGS = (PADBUSY,PADMY,POK,pPOK)
> PV = 0x937b928 "\304"\0
> CUR = 1
> LEN = 4
> Matching REx "Ä" against "H\x{e4}"
> Matching stclass "EXACTF <Ä>" against "Hä"
> Contradicts stclass...
> Match failed
> no yes 1 no
> Str:
> SV = PVMG(0x92fb080) at 0x9355b78
> REFCNT = 1
> FLAGS = (PADBUSY,PADMY,SMG,POK,pPOK,UTF8)
> IV = 0
> NV = 0
> PV = 0x935f0c0 "HHH\303\244"\0 [UTF8 "HHH\x{e4}"]
> CUR = 5
> LEN = 8
> MAGIC = 0x93a2218
> MG_VIRTUAL = &PL_vtbl_utf8
> MG_TYPE = PERL_MAGIC_utf8(w)
> MG_LEN = 4
> Pat:
> SV = PV(0x92cf970) at 0x9318fc0
> REFCNT = 1
> FLAGS = (PADBUSY,PADMY,POK,pPOK)
> PV = 0x937b928 "\304"\0
> CUR = 1
> LEN = 4
> Matching REx "Ä" against "HHH\x{e4}"
> Matching stclass "EXACTF <Ä>" against "HHHä"
> Contradicts stclass...
> Match failed
> no yes 2 no
> Str:
> SV = PVMG(0x92fb080) at 0x9355b78
> REFCNT = 1
> FLAGS = (PADBUSY,PADMY,POK,pPOK)
> IV = 0
> NV = 0
> PV = 0x935f0c0 "\344"\0
> CUR = 1
> LEN = 8
> Pat:
> SV = PV(0x92cf970) at 0x9318fc0
> REFCNT = 1
> FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8)
> PV = 0x937ba20 "\303\204"\0 [UTF8 "\x{c4}"]
> CUR = 2
> LEN = 3
> Freeing REx: `"\304"'
> Compiling REx `Ä'
> size 3 Got 28 bytes for offset annotations.
> first at 1
> 1: EXACTF <\x{e4}>(3)
> 3: END(0)
> stclass "EXACTF <\x{e4}>" minlen 1
> Offsets: [3]
> 1[2] 0[0] 3[0]
> Matching REx "\x{c4}" against "ä"
> Malformed UTF-8 character (unexpected non-continuation byte 0x00, immediately after start byte 0xe4) in pattern match (m//) at (eval 3) line 16.
Interesting warning.
> Matching stclass "EXACTF <\\x{e4}>" against "\x{0}"
> Contradicts stclass...
> Match failed
> yes no 0 no
> Str:
> SV = PVMG(0x92fb080) at 0x9355b78
> REFCNT = 1
> FLAGS = (PADBUSY,PADMY,POK,pPOK)
> IV = 0
> NV = 0
> PV = 0x935f0c0 "H\344"\0
> CUR = 2
> LEN = 8
> Pat:
> SV = PVMG(0x92fb0a0) at 0x9318fc0
> REFCNT = 1
> FLAGS = (PADBUSY,PADMY,SMG,POK,pPOK,UTF8)
> IV = 0
> NV = 0
> PV = 0x937ba20 "\303\204"\0 [UTF8 "\x{c4}"]
> CUR = 2
> LEN = 3
> MAGIC = 0x93a2218
> MG_VIRTUAL = &PL_vtbl_utf8
> MG_TYPE = PERL_MAGIC_utf8(w)
> MG_LEN = 1
> Matching REx "\x{c4}" against "Hä"
> Malformed UTF-8 character (unexpected non-continuation byte 0x00, immediately after start byte 0xe4) in pattern match (m//) at (eval 3) line 16.
And here it is again...
> Matching stclass "EXACTF <\\x{e4}>" against "H\x{0}"
> Contradicts stclass...
> Match failed
> yes no 1 no
> Str:
> SV = PVMG(0x92fb080) at 0x9355b78
> REFCNT = 1
> FLAGS = (PADBUSY,PADMY,POK,pPOK)
> IV = 0
> NV = 0
> PV = 0x935f0c0 "HHH\344"\0
> CUR = 4
> LEN = 8
> Pat:
> SV = PVMG(0x92fb0a0) at 0x9318fc0
> REFCNT = 1
> FLAGS = (PADBUSY,PADMY,SMG,POK,pPOK,UTF8)
> IV = 0
> NV = 0
> PV = 0x937ba20 "\303\204"\0 [UTF8 "\x{c4}"]
> CUR = 2
> LEN = 3
> MAGIC = 0x93a2218
> MG_VIRTUAL = &PL_vtbl_utf8
> MG_TYPE = PERL_MAGIC_utf8(w)
> MG_LEN = 1
> Matching REx "\x{c4}" against "HHHä"
> Malformed UTF-8 character (unexpected non-continuation byte 0x00, immediately after start byte 0xe4) in pattern match (m//) at (eval 3) line 16.
And again. Something is not well in the debug output in 5.8.x.
> Matching stclass "EXACTF <\\x{e4}>" against "HHH\x{0}"
> Contradicts stclass...
> Match failed
> yes no 2 no
> Str:
> SV = PVMG(0x92fb080) at 0x9355b78
> REFCNT = 1
> FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8)
> IV = 0
> NV = 0
> PV = 0x935f118 "\303\244"\0 [UTF8 "\x{e4}"]
> CUR = 2
> LEN = 3
> MAGIC = 0x9381b50
> MG_VIRTUAL = &PL_vtbl_utf8
> MG_TYPE = PERL_MAGIC_utf8(w)
> MG_LEN = 1
> Pat:
> SV = PVMG(0x92fb0a0) at 0x9318fc0
> REFCNT = 1
> FLAGS = (PADBUSY,PADMY,SMG,POK,pPOK,UTF8)
> IV = 0
> NV = 0
> PV = 0x937ba20 "\303\204"\0 [UTF8 "\x{c4}"]
> CUR = 2
> LEN = 3
> MAGIC = 0x93a2218
> MG_VIRTUAL = &PL_vtbl_utf8
> MG_TYPE = PERL_MAGIC_utf8(w)
> MG_LEN = 1
> Matching REx "\x{c4}" against "\x{e4}"
> Matching stclass "EXACTF <\\x{e4}>" against "\x{e4}"
> Setting an EVAL scope, savestack=172
> 0 <> <\x{e4}> | 1: EXACTF <\x{e4}>
> 2 <\x{e4}> <> | 3: END
> Match successful!
> yes yes 0 yes
> Str:
> SV = PVMG(0x92fb080) at 0x9355b78
> REFCNT = 1
> FLAGS = (PADBUSY,PADMY,SMG,POK,pPOK,UTF8)
> IV = 0
> NV = 0
> PV = 0x935f118 "H\303\244"\0 [UTF8 "H\x{e4}"]
> CUR = 3
> LEN = 4
> MAGIC = 0x9381b50
> MG_VIRTUAL = &PL_vtbl_utf8
> MG_TYPE = PERL_MAGIC_utf8(w)
> MG_LEN = 2
> Pat:
> SV = PVMG(0x92fb0a0) at 0x9318fc0
> REFCNT = 1
> FLAGS = (PADBUSY,PADMY,SMG,POK,pPOK,UTF8)
> IV = 0
> NV = 0
> PV = 0x937ba20 "\303\204"\0 [UTF8 "\x{c4}"]
> CUR = 2
> LEN = 3
> MAGIC = 0x93a2218
> MG_VIRTUAL = &PL_vtbl_utf8
> MG_TYPE = PERL_MAGIC_utf8(w)
> MG_LEN = 1
> Matching REx "\x{c4}" against "H\x{e4}"
> Matching stclass "EXACTF <\\x{e4}>" against "H\x{e4}"
> Setting an EVAL scope, savestack=172
> 1 <H> <\x{e4}> | 1: EXACTF <\x{e4}>
> 3 <H\x{e4}> <> | 3: END
> Match successful!
> yes yes 1 yes
> Str:
> SV = PVMG(0x92fb080) at 0x9355b78
> REFCNT = 1
> FLAGS = (PADBUSY,PADMY,SMG,POK,pPOK,UTF8)
> IV = 0
> NV = 0
> PV = 0x935f118 "HHH\303\244"\0 [UTF8 "HHH\x{e4}"]
> CUR = 5
> LEN = 8
> MAGIC = 0x9381b50
> MG_VIRTUAL = &PL_vtbl_utf8
> MG_TYPE = PERL_MAGIC_utf8(w)
> MG_LEN = 4
> Pat:
> SV = PVMG(0x92fb0a0) at 0x9318fc0
> REFCNT = 1
> FLAGS = (PADBUSY,PADMY,SMG,POK,pPOK,UTF8)
> IV = 0
> NV = 0
> PV = 0x937ba20 "\303\204"\0 [UTF8 "\x{c4}"]
> CUR = 2
> LEN = 3
> MAGIC = 0x93a2218
> MG_VIRTUAL = &PL_vtbl_utf8
> MG_TYPE = PERL_MAGIC_utf8(w)
> MG_LEN = 1
> Matching REx "\x{c4}" against "HHH\x{e4}"
> Matching stclass "EXACTF <\\x{e4}>" against "HHH\x{e4}"
> Setting an EVAL scope, savestack=172
> 3 <HHH> <\x{e4}> | 1: EXACTF <\x{e4}>
> 5 <HHH\x{e4}> <> | 3: END
> Match successful!
> yes yes 2 yes
> Freeing REx: `\x{c4}'
>
>
>
> Regards,
> Christoph
>
--
perl -Mre=debug -e "/just|another|perl|hacker/"
Thread Previous
|
Thread Next