Front page | perl.perl5.porters |
Postings from December 2009
Re: PATCH: partial [perl #58182]: regex case-sensitive matching nowutf8ness independent
Thread Previous
|
Thread Next
From:
karl williamson
Date:
December 9, 2009 20:39
Subject:
Re: PATCH: partial [perl #58182]: regex case-sensitive matching nowutf8ness independent
Message ID:
4B207B64.8050202@khwilliamson.com
demerphq wrote:
> 2009/12/9 karl williamson <public@khwilliamson.com>:
>> I believe this resolves other bug reports, but haven't had time to look them
>> up.
>>
>> The patch is both attached, and available at:
>> git://github.com/khwilliamson/perl.git
>> branch: matching
>>
>> This patch makes case-sensitive regex matching give the same results
>> regardless of whether the string and/or pattern are in utf8, unless "use
>> legacy 'unicode8bit'" is in effect, in which case it works as before.
>>
>> Since Yves is incommunicado,
>
> I was heads-uped about this mail, but I've not had any time to respond
> yet. Sorry.
>
>> I took what he had done before Larry's veto and
>> extended and modified it, adding an intermediate way. What that means is
>> that anything that looks like[[:xxx:]] will match only in the ASCII range,
>> or in the current locale, if set. I never heard any controversy about that
>> part of the proposal, and it makes sense to me that a Posix construct should
>> act like the Posix definition says to.
>
> This is good IMO, it will allow us to close a number of open tickets.
>
>> \d, \s, and \w (hence \b) and their complements act as before, except that
>> when 8-bit unicode mode is on, they also match appropriately in the 128-255
>> range.
>>
>> This solves the utf8ness problem, as the Posix never match outside their
>> locale or ascii, so utf8ness doesn't matter; and the others match the same
>> whether utf8 or not.
>>
>> I was surprised at actually how little code was involved. Making Posix
>> always mean Posix simplified things quite a bit. \d doesn't match anything
>> in the 128-255 range, so it did not have to be touched. Essentially, all
>> that had to be done was to create new regnodes for \s, \w, and \b (and
>> complements) that say to match using unicode semantics. Everywhere their
>> parallel nodes are in the code, I added these nodes. When compiling,
>> regcomp checks for being in 8-bit unicode semantics mode, and if so, uses
>> the new node; if not it uses the old node. In execution, regexec uses the
>> old definition when matching the old node, and the new semantics when the
>> match is for the new node. I split [[:word:]] from \w and [[:digit:]] from
>> \d so that they would match using Posix semantics regardless of utf8ness.
>>
>> But that is basically it.
>>
>> Several .t files depended on the legacy behaviors to test edge cases for
>> utf8ness. I added a 'use legacy' to those.
>>
>> Also, several text processing modules can't deal with \s matching a no-break
>> space. I spent too much time trying to learn them to decide if this is a
>> bug or not, finding the one or two lines in each that were at fault. It is
>> a bug if the text can be utf8, which would automatically cause the \s to
>> suddenly match the no-break space. But I wasn't sure which ones are claimed
>> to transparently handle utf8. So, I added a 'use legacy' to the modules,
>> which gives the same behavior as in the past.
>>
>> Several TODOs were accomplished and removed from some regex .t files
>>
>> I took advantage of changing regcomp.c to add a croak when the re has gone
>> insane; I've had it in my development version for some time. It seems to
>> happen when there are too many /\N{...}/ calls in a program.
>>
>
> I had a quick review of the patch and what you have done.
>
> I have two minor objections, but i dont think they need be seen as roadblocks.
>
> First, the problem of qr// raises its head. You construct a pattern
> one context with your new pragma in effect, and then embed it in
> another pattern somewhere else and the magicness of the pattern is
> lost. This is the same problem as with use locale, and personally
> something I think breaks the general modern model of patterns. However
> it is better than nothing and modifiers can be leveraged on top of
> your patch so that is fine IMO.
I'm not sure I follow this. I think what you're saying is that the
original pattern is decompiled or thrown away and then recompiled under
the new scheme?
>
> Second, and really this is just another facet of the original problem
> is that people now need to modify existing code to preserve the
> existing semantics. If this was controlled by modifier then this
> wouldnt be necessary as we would just make the default modifier behave
> as in 5.8.x, also if really necessary we could bifurcate the POSIX
> stuff into multiple opcodes (old/new behaviour) and resolve any
> objections to fixing the POSIX opcodes.
One should be able to change the default modifier, I would hope.
>
> However my opinion is this is a really good step forward and should be
> applied to blead.
>
> cheers,
> Yves
>
>
Thread Previous
|
Thread Next