develooper Front page | perl.perl5.porters | Postings from April 2007

Re: Analysis of problems with mixed encoding case insensitive matches in regex engine.

Thread Previous | Thread Next
Gerard Goossen
April 24, 2007 06:40
Re: Analysis of problems with mixed encoding case insensitive matches in regex engine.
Message ID:
On Tue, Apr 24, 2007 at 11:37:52AM +0200, demerphq wrote:
> [...]
> The problem is that the optimiser thinks that /\xDF/i under unicode is
> really 'ss' and therefore that the minimum length string that can
> match is 2. Which obviously cases problems matching a latin-1 \xDF
> which is only one byte. Amusingly another bug in the regex engine
> allows this to work out ok when the string is unicode. utf8 \xDF is
> two bytes long, and the regex engine has some issues with the
> distinction between "byte length" and "codepoint length", so it sees
> the two bytes of the single codepoint as being sufficient length, and
> then uses unicode folding to convert the strings \xDF to 'ss' and
> everything works out. But this is fluke, im positive that there are
> other fold case scenarios where we cant rely on this bug saving the
> day. If the fold case version was longer (in bytes) than the utf8
> version of the original it would not work out.

I encountered the same problem with kurila, a bit stronger because there
are lengths are in bytes. Thus even single characters would
fold to multiple values. And indeed that are many places where the
minlen is used, and the minlen has to be correct, otherwise you get very
unpredictable results.

> This probably doesnt show up on too many peoples radars as most times
> you would be matching against a string that is quite a bit longer than
> the pattern. But for cases like the above there is definitely a bug.
> At this point the only solution I can think of is to disable minlen
> checks when a character is encountered that folds to a multi-character
> string.

You also have to include the characters which are in the same folding
class/group (can't remember the terminology) as the multi-character.
i.e. include 's' and 'S', because "\xDF"=~/ss/i should also match.
In kurila I didn't bother about that and just always disabled minlen
checking with case folding.

Also iirc there are some fixed length assumptions about EXACTF (sorry
can't remember where), so just decreasing minlen might not work and
break more things than it would solve.
> Thats a pretty big hammer for such a case, but its about the best i
> can think of.
> Other ideas anyone?

If we want to get performance out of the engine, I think the most interest
approach would be to "unfold" the string at compile time, something
/e/i  -> /(e|E)/
/foo/i -> /(f|F)(o|O)(o|O)/
/ss/i -> /(((s|S)(s|S))|\xDF)/
/\xDF/i -> /ss/i -> /(((s|S)(s|S))|\xDF)/ 

Then let trie, study, etc do all the optimalization.

> cheers,
> Yves
> ps: Actually I have to say the minlen/startclass optimisations are
> pretty crufty and are clearly not properly unicode aware.  There is a
> serious need to completely rewrite study_chunk(), probably as several
> routines so that sanity can be restored. But thats a big project, one
> that would probably be sufficiently large that it would need to be
> funded by TPF, assuming somebody had time to do it at all.

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About