PATCH: partial [perl #58182]: regex case-sensitive matching now utf8nessindependent

karl williamson
December 9, 2009 11:12
I believe this resolves other bug reports, but haven't had time to look 
them up.

The patch is both attached, and available at:
branch: matching

This patch makes case-sensitive regex matching give the same results 
regardless of whether the string and/or pattern are in utf8, unless "use 
legacy 'unicode8bit'" is in effect, in which case it works as before.

Since Yves is incommunicado, I took what he had done before Larry's veto 
and extended and modified it, adding an intermediate way.  What that 
means is that anything that looks like[[:xxx:]] will match only in the 
ASCII range, or in the current locale, if set.  I never heard any 
controversy about that part of the proposal, and it makes sense to me 
that a Posix construct should act like the Posix definition says to.

\d, \s, and \w (hence \b) and their complements act as before, except 
that when 8-bit unicode mode is on, they also match appropriately in the 
128-255 range.

This solves the utf8ness problem, as the Posix never match outside their 
locale or ascii, so utf8ness doesn't matter; and the others match the 
same whether utf8 or not.

I was surprised at actually how little code was involved.  Making Posix 
always mean Posix simplified things quite a bit.  \d doesn't match 
anything in the 128-255 range, so it did not have to be touched. 
Essentially, all that had to be done was to create new regnodes for \s, 
\w, and \b (and complements) that say to match using unicode semantics. 
  Everywhere their parallel nodes are in the code, I added these nodes. 
  When compiling, regcomp checks for being in 8-bit unicode semantics 
mode, and if so, uses the new node; if not it uses the old node.  In 
execution, regexec uses the old definition when matching the old node, 
and the new semantics when the match is for the new node.  I split 
[[:word:]] from \w and [[:digit:]] from \d so that they would match 
using Posix semantics regardless of utf8ness.

But that is basically it.

Several .t files depended on the legacy behaviors to test edge cases for 
utf8ness.  I added a 'use legacy' to those.

Also, several text processing modules can't deal with \s matching a 
no-break space.  I spent too much time trying to learn them to decide if 
this is a bug or not, finding the one or two lines in each that were at 
fault.  It is a bug if the text can be utf8, which would automatically 
cause the \s to suddenly match the no-break space.  But I wasn't sure 
which ones are claimed to transparently handle utf8.  So, I added a 'use 
legacy' to the modules, which gives the same behavior as in the past.

Several TODOs were accomplished and removed from some regex .t files

I took advantage of changing regcomp.c to add a croak when the re has 
gone insane; I've had it in my development version for some time.  It 
seems to happen when there are too many /\N{...}/ calls in a program.

