develooper Front page | perl.perl5.porters | Postings from December 2008

Re: multi-char re fold problem

Thread Previous | Thread Next
From:
karl williamson
Date:
December 27, 2008 13:37
Subject:
Re: multi-char re fold problem
Message ID:
49569FF1.1070902@khwilliamson.com
demerphq wrote:
> 2008/12/27 karl williamson <public@khwilliamson.com>:
>> Rafael Garcia-Suarez wrote:
>>> 2008/12/26 karl williamson <public@khwilliamson.com>:
>>>> Attached is a patch for this.  The problem is that in this subroutine p
>>>> may
>>>>  or may not be in utf8, and the flag do_utf8 indicates which.  The code
>>>> calls  various functions passing both p and do_utf8, and these work.  But
>>>> to_utf8_fold() expects its argument to always be in utf8, and this caused
>>>> the problem  Also the av's are stored as utf8, so the memEQ would not
>>>> work
>>>> correctly on a non-utf8 p even though no error message would be
>>>> generated.
>>>>
>>>> The patch creates a copy of p in utf8, if necessary, and uses that even
>>>> when
>>>> calling the functions that accept the do_utf8 flag, as they create
>>>> temporaries, convert to utf8, and then throw the conversion away.  It is
>>>> more efficient to do the conversion once in the caller and pass that to
>>>> each
>>>> routine.
>>>>
>>>> I'm not sure what to do about a test case.
>>>>
>>>> "\xc0" =~ qr/[\x{1f4}\xc0]/;
>>>>
>>>> doesn't show the problem, but
>>>>
>>>> use Test::More tests => 1;
>>>> like("\xc0", qr/[\x{1f4}\xc0]/i, 'get malformed utf8');
>>>>
>>>> does.  And it looks like none of the existing re tests use Test.
>>> Then there is probably a problem in Test::More itself ?
>>>
>>> (Is there a bug number for this?)
>>>
>>> I've tested the patch, but I would feel more comfortable with a test
>>> case. (or with a comment from Yves)
>>>
>>>
>> No bug number.  Should I create one?
>>
>> I suspect that it isn't a bug in Test::More, but that it calls things
>> somehow differently, which is kind of scary in itself that it perturbs the
>> environment  Maybe a certain class of tests shouldn't be done using Test.  I
>> don't know.
>>
>> If we don't hear from Yves in the meantime, I'll look tomorrow to see how to
>> reproduce it without using Test.
> 
> Is this a problem with casefolding unicode characters in a charclass?
> 
> I have to admit that on reading this I dont have much to add. And my
> windows box is offline these days due to a hardware failure so if im
> going to debug it ill have to learn gdb finally. Which could take a
> while :-)
> 
> Yves
> 
> 
> 
> 
I don't understand much gdb either, and find it mostly non-intuitive 
(and I was once an expert on debuggers), but I know it enough of it to 
get somewhere.

I was hoping this patch would fix the multi-char fold problem too, but 
it didn't.  And they're not related, I believe, as p would already have 
to be in utf8 in order to get a multi-char fold, and so my new utf8_p is 
equal to p, and therefore doesn't add anything in this case.

I did look at the routine somewhat for the multi-char problem, and I 
used gdb on the expression "ʼN" =~ qr/[ʼn]/i.  What may look like a lower 
case n here is in fact U+0149, LATIN SMALL LETTER N PRECEDED BY APOSTROPHE.

I find that the dearth of comments in this file keep me from grasping 
the forest, and am only grokking the trees.  S_reginclass is being 
called here with lenp set to NULL, so the code that looks at the 
multicharacter folds can't be executed (line 5774).  When I hacked 
around that, it still won't match, because p hasn't been folded, so the 
memEQ is testing "ʼN" against "ʼn", and hence fails.  I don't grok the 
forest enough to know what to do about all that.

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About