develooper Front page | perl.perl5.porters | Postings from April 2007

Analysis of problems with mixed encoding case insensitive matches in regex engine.

Thread Next
April 24, 2007 02:38
Analysis of problems with mixed encoding case insensitive matches in regex engine.
Message ID:
Hi all,

As all the regular readers of this list know perl uses two string
encodings internally. One is essentially latin1, and the other is utf8
encoded unicode.

Unicode has an interesting feature that some people might not be
familiar with, the way it stipulates that an implementer handle
case-insentive matches includes the requirement that in some sitations
a single char can match several chars. The typical example of this is
\xDF, aka LATIN SMALL LETTER SHARP S, which when case folded ends up
as 'ss'.

The two encodings have different semantics, under latin1 \xDF case
insensitively matches only \xDF. Under utf8/unicode it matches 'ss' as
i already said.

Overall this isnt a problem. If the pattern or string is utf8/unicode
then perl uses unicode semantics and things work out pretty much as
one might expect (given one knows about the two encodings and their
differing semantics).

Now it turns out that there is a bug in the regex engine optimiser
related to \xDF in that the behaviour of




is not very predicatable. Depending on whether the pattern or the
string is utf8 the pattern will match differently. One would assume
that unicode semantics would be obeyed when either the string or
pattern was unicode, and that latin1 semantics (for lack of a better
term) would be followed only when neither were unicode.

Thus it would seem reasonable to expect that "ss" matches \xDF case
insensitively only when one or the other or both were unicode, and
that \xDF would match \xDF insensitively always. Except it doesnt. The
problem turns out the be minlen checking, and would apparently affect
ALL case-insensitive unicode matches where the fold-case version of a
codepoint is a multi-codepoint sequence.

The problem is that the optimiser thinks that /\xDF/i under unicode is
really 'ss' and therefore that the minimum length string that can
match is 2. Which obviously cases problems matching a latin-1 \xDF
which is only one byte. Amusingly another bug in the regex engine
allows this to work out ok when the string is unicode. utf8 \xDF is
two bytes long, and the regex engine has some issues with the
distinction between "byte length" and "codepoint length", so it sees
the two bytes of the single codepoint as being sufficient length, and
then uses unicode folding to convert the strings \xDF to 'ss' and
everything works out. But this is fluke, im positive that there are
other fold case scenarios where we cant rely on this bug saving the
day. If the fold case version was longer (in bytes) than the utf8
version of the original it would not work out.

This probably doesnt show up on too many peoples radars as most times
you would be matching against a string that is quite a bit longer than
the pattern. But for cases like the above there is definitely a bug.

At this point the only solution I can think of is to disable minlen
checks when a character is encountered that folds to a multi-character

Thats a pretty big hammer for such a case, but its about the best i
can think of.

Other ideas anyone?

ps: Actually I have to say the minlen/startclass optimisations are
pretty crufty and are clearly not properly unicode aware.  There is a
serious need to completely rewrite study_chunk(), probably as several
routines so that sanity can be restored. But thats a big project, one
that would probably be sufficiently large that it would need to be
funded by TPF, assuming somebody had time to do it at all.

perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About