develooper Front page | perl.perl5.porters | Postings from December 2009

Re: PATCH: partial [perl #58182]: regex case-sensitive matching now utf8ness independent

Thread Previous | Thread Next
December 9, 2009 12:00
Re: PATCH: partial [perl #58182]: regex case-sensitive matching now utf8ness independent
Message ID:
2009/12/9 karl williamson <>:
> I believe this resolves other bug reports, but haven't had time to look them
> up.
> The patch is both attached, and available at:
> git://
> branch: matching
> This patch makes case-sensitive regex matching give the same results
> regardless of whether the string and/or pattern are in utf8, unless "use
> legacy 'unicode8bit'" is in effect, in which case it works as before.
> Since Yves is incommunicado,

I was heads-uped about this mail, but I've not had any time to respond
yet. Sorry.

> I took what he had done before Larry's veto and
> extended and modified it, adding an intermediate way.  What that means is
> that anything that looks like[[:xxx:]] will match only in the ASCII range,
> or in the current locale, if set.  I never heard any controversy about that
> part of the proposal, and it makes sense to me that a Posix construct should
> act like the Posix definition says to.

This is good IMO, it will allow us to close a number of open tickets.

> \d, \s, and \w (hence \b) and their complements act as before, except that
> when 8-bit unicode mode is on, they also match appropriately in the 128-255
> range.
> This solves the utf8ness problem, as the Posix never match outside their
> locale or ascii, so utf8ness doesn't matter; and the others match the same
> whether utf8 or not.
> I was surprised at actually how little code was involved.  Making Posix
> always mean Posix simplified things quite a bit.  \d doesn't match anything
> in the 128-255 range, so it did not have to be touched. Essentially, all
> that had to be done was to create new regnodes for \s, \w, and \b (and
> complements) that say to match using unicode semantics.  Everywhere their
> parallel nodes are in the code, I added these nodes.  When compiling,
> regcomp checks for being in 8-bit unicode semantics mode, and if so, uses
> the new node; if not it uses the old node.  In execution, regexec uses the
> old definition when matching the old node, and the new semantics when the
> match is for the new node.  I split [[:word:]] from \w and [[:digit:]] from
> \d so that they would match using Posix semantics regardless of utf8ness.
> But that is basically it.
> Several .t files depended on the legacy behaviors to test edge cases for
> utf8ness.  I added a 'use legacy' to those.
> Also, several text processing modules can't deal with \s matching a no-break
> space.  I spent too much time trying to learn them to decide if this is a
> bug or not, finding the one or two lines in each that were at fault.  It is
> a bug if the text can be utf8, which would automatically cause the \s to
> suddenly match the no-break space.  But I wasn't sure which ones are claimed
> to transparently handle utf8.  So, I added a 'use legacy' to the modules,
> which gives the same behavior as in the past.
> Several TODOs were accomplished and removed from some regex .t files
> I took advantage of changing regcomp.c to add a croak when the re has gone
> insane; I've had it in my development version for some time.  It seems to
> happen when there are too many /\N{...}/ calls in a program.

I had a quick review of the patch and what you have done.

I have two minor objections, but i dont think they need be seen as roadblocks.

First, the problem of qr// raises its head. You construct a pattern
one context with your new pragma in effect, and then embed it in
another pattern somewhere else and the magicness of the pattern is
lost. This is the same problem as with use locale, and personally
something I think breaks the general modern model of patterns. However
it is better than nothing and modifiers can be leveraged on top of
your patch so that is fine IMO.

Second, and really this is just another facet of the original problem
is that people now need to modify existing code to preserve the
existing semantics. If this was controlled by modifier then this
wouldnt be necessary as we would just make the default modifier behave
as in 5.8.x, also if really necessary we could bifurcate the POSIX
stuff into multiple opcodes (old/new behaviour) and resolve any
objections to fixing the POSIX opcodes.

However my opinion is this is a really good step forward and should be
applied to blead.


perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About