develooper Front page | perl.perl5.porters | Postings from April 2007

Re: Analysis of problems with mixed encoding case insensitive matches in regex engine.

Thread Previous | Thread Next
From:
demerphq
Date:
April 24, 2007 07:46
Subject:
Re: Analysis of problems with mixed encoding case insensitive matches in regex engine.
Message ID:
9b18b3110704240746u461e4bdcl208ef7d7f9c5ef64@mail.gmail.com
On 4/24/07, demerphq <demerphq@gmail.com> wrote:
> The problem is that the optimiser thinks that /\xDF/i under unicode is
> really 'ss' and therefore that the minimum length string that can
> match is 2. Which obviously cases problems matching a latin-1 \xDF
> which is only one byte. Amusingly another bug in the regex engine
> allows this to work out ok when the string is unicode. utf8 \xDF is
> two bytes long, and the regex engine has some issues with the
> distinction between "byte length" and "codepoint length", so it sees
> the two bytes of the single codepoint as being sufficient length, and
> then uses unicode folding to convert the strings \xDF to 'ss' and
> everything works out. But this is fluke, im positive that there are
> other fold case scenarios where we cant rely on this bug saving the
> day. If the fold case version was longer (in bytes) than the utf8
> version of the original it would not work out.
[...]
> At this point the only solution I can think of is to disable minlen
> checks when a character is encountered that folds to a multi-character
> string.

Well i have a better solution it looks like. Ive created a new regop
FOLDCHAR that will be used to handle the three problematic codepoints
properly. This way the regex engine doesnt see them as normal text and
therefore the optimiser can do the right thing and everything works
out properly.

Sigh, so much trouble for one character. (The other two are just bonus material)

Its actually possible to detect codepoints that will have this problem
so its probably smart to put something in mktables that will detect
and warn if any new one come up. Or we can just do it by hand when
updating the unicode data files.

Patch is attached.

cheers,
Yves

-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About