Re: PATCH: partial [perl #58182]: regex case-sensitive matching nowutf8ness independent

karl williamson
December 9, 2009 20:39
Re: PATCH: partial [perl #58182]: regex case-sensitive matching nowutf8ness independent
demerphq wrote:
> 2009/12/9 karl williamson <>:
>> I believe this resolves other bug reports, but haven't had time to look them
>> up.
>> The patch is both attached, and available at:
>> git://
>> branch: matching
>> This patch makes case-sensitive regex matching give the same results
>> regardless of whether the string and/or pattern are in utf8, unless "use
>> legacy 'unicode8bit'" is in effect, in which case it works as before.
>> Since Yves is incommunicado,
> I was heads-uped about this mail, but I've not had any time to respond
> yet. Sorry.
>> I took what he had done before Larry's veto and
>> extended and modified it, adding an intermediate way.  What that means is
>> that anything that looks like[[:xxx:]] will match only in the ASCII range,
>> or in the current locale, if set.  I never heard any controversy about that
>> part of the proposal, and it makes sense to me that a Posix construct should
>> act like the Posix definition says to.
> This is good IMO, it will allow us to close a number of open tickets.
>> \d, \s, and \w (hence \b) and their complements act as before, except that
>> when 8-bit unicode mode is on, they also match appropriately in the 128-255
>> range.
>> This solves the utf8ness problem, as the Posix never match outside their
>> locale or ascii, so utf8ness doesn't matter; and the others match the same
>> whether utf8 or not.
>> I was surprised at actually how little code was involved.  Making Posix
>> always mean Posix simplified things quite a bit.  \d doesn't match anything
>> in the 128-255 range, so it did not have to be touched. Essentially, all
>> that had to be done was to create new regnodes for \s, \w, and \b (and
>> complements) that say to match using unicode semantics.  Everywhere their
>> parallel nodes are in the code, I added these nodes.  When compiling,
>> regcomp checks for being in 8-bit unicode semantics mode, and if so, uses
>> the new node; if not it uses the old node.  In execution, regexec uses the
>> old definition when matching the old node, and the new semantics when the
>> match is for the new node.  I split [[:word:]] from \w and [[:digit:]] from
>> \d so that they would match using Posix semantics regardless of utf8ness.
>> But that is basically it.
>> Several .t files depended on the legacy behaviors to test edge cases for
>> utf8ness.  I added a 'use legacy' to those.
>> Also, several text processing modules can't deal with \s matching a no-break
>> space.  I spent too much time trying to learn them to decide if this is a
>> bug or not, finding the one or two lines in each that were at fault.  It is
>> a bug if the text can be utf8, which would automatically cause the \s to
>> suddenly match the no-break space.  But I wasn't sure which ones are claimed
>> to transparently handle utf8.  So, I added a 'use legacy' to the modules,
>> which gives the same behavior as in the past.
>> Several TODOs were accomplished and removed from some regex .t files
>> I took advantage of changing regcomp.c to add a croak when the re has gone
>> insane; I've had it in my development version for some time.  It seems to
>> happen when there are too many /\N{...}/ calls in a program.
> I had a quick review of the patch and what you have done.
> I have two minor objections, but i dont think they need be seen as roadblocks.
> First, the problem of qr// raises its head. You construct a pattern
> one context with your new pragma in effect, and then embed it in
> another pattern somewhere else and the magicness of the pattern is
> lost. This is the same problem as with use locale, and personally
> something I think breaks the general modern model of patterns. However
> it is better than nothing and modifiers can be leveraged on top of
> your patch so that is fine IMO.

I'm not sure I follow this.  I think what you're saying is that the 
original pattern is decompiled or thrown away and then recompiled under 
the new scheme?
> Second, and really this is just another facet of the original problem
> is that people now need to modify existing code to preserve the
> existing semantics. If this was controlled by modifier then this
> wouldnt be necessary as we would just make the default modifier behave
> as in 5.8.x, also if really necessary we could bifurcate the POSIX
> stuff into multiple opcodes (old/new behaviour) and resolve any
> objections to fixing the POSIX opcodes.
One should be able to change the default modifier, I would hope.
> However my opinion is this is a really good step forward and should be
> applied to blead.
> cheers,
> Yves

