Front page | perl.perl5.porters |
Postings from December 2009
Re: PATCH: partial [perl #58182]: regex case-sensitive matching now utf8ness independent
Thread Previous
|
Thread Next
From:
jesse
Date:
December 9, 2009 14:22
Subject:
Re: PATCH: partial [perl #58182]: regex case-sensitive matching now utf8ness independent
Message ID:
20091209222154.GB22371@bestpractical.com
Karl,
Thank you very much for all your work on this. I'll admit that the patch
is a bit more extensive than I'd anticipated.
At a basic procedural level, any module that's inside cpan/ really needs
to be patched upstream in the relevant CPAN distribution and then pulled
into blead as they're released to CPAN.
That highlights my base concern here -- As of right now, if those
changes were pushed into the CPAN distributions, they wouldn't run on
any Perl release before 5.11.2.
That issue is fixable if we CPAN a version of "feature.pm" designed for
older releases of Perl, but it gets increasingly hard to keep "legacy"
directives synced between the differing realities of different Perls'
concepts of "legacy".
Talking to Nicholas about my concerns, he suggested that many of these
problems would go away if legacy directives always defaulted to enabled.
I know that a number of folks are eager to jettison historical designs
that are now considered to have been mistakes, but intentionally
breaking backwards-compatibility by inverting default behavior isn't
the right thing for us to doing.
Would you be comfortable with flopping the 'unicode8bit' legacy default
such that users who want the new semantics would use something like:
"no legacy 'unicode8bit'"
or "use feature 'unicode8bit';
or "use feature ':5.12';
Thanks,
Jesse
On Wed, Dec 09, 2009 at 12:11:00PM -0700, karl williamson wrote:
> I believe this resolves other bug reports, but haven't had time to look
> them up.
>
> The patch is both attached, and available at:
> git://github.com/khwilliamson/perl.git
> branch: matching
>
> This patch makes case-sensitive regex matching give the same results
> regardless of whether the string and/or pattern are in utf8, unless "use
> legacy 'unicode8bit'" is in effect, in which case it works as before.
>
> Since Yves is incommunicado, I took what he had done before Larry's veto
> and extended and modified it, adding an intermediate way. What that
> means is that anything that looks like[[:xxx:]] will match only in the
> ASCII range, or in the current locale, if set. I never heard any
> controversy about that part of the proposal, and it makes sense to me
> that a Posix construct should act like the Posix definition says to.
>
> \d, \s, and \w (hence \b) and their complements act as before, except
> that when 8-bit unicode mode is on, they also match appropriately in the
> 128-255 range.
>
> This solves the utf8ness problem, as the Posix never match outside their
> locale or ascii, so utf8ness doesn't matter; and the others match the
> same whether utf8 or not.
>
> I was surprised at actually how little code was involved. Making Posix
> always mean Posix simplified things quite a bit. \d doesn't match
> anything in the 128-255 range, so it did not have to be touched.
> Essentially, all that had to be done was to create new regnodes for \s,
> \w, and \b (and complements) that say to match using unicode semantics.
> Everywhere their parallel nodes are in the code, I added these nodes.
> When compiling, regcomp checks for being in 8-bit unicode semantics
> mode, and if so, uses the new node; if not it uses the old node. In
> execution, regexec uses the old definition when matching the old node,
> and the new semantics when the match is for the new node. I split
> [[:word:]] from \w and [[:digit:]] from \d so that they would match
> using Posix semantics regardless of utf8ness.
>
> But that is basically it.
>
> Several .t files depended on the legacy behaviors to test edge cases for
> utf8ness. I added a 'use legacy' to those.
>
> Also, several text processing modules can't deal with \s matching a
> no-break space. I spent too much time trying to learn them to decide if
> this is a bug or not, finding the one or two lines in each that were at
> fault. It is a bug if the text can be utf8, which would automatically
> cause the \s to suddenly match the no-break space. But I wasn't sure
> which ones are claimed to transparently handle utf8. So, I added a 'use
> legacy' to the modules, which gives the same behavior as in the past.
>
> Several TODOs were accomplished and removed from some regex .t files
>
> I took advantage of changing regcomp.c to add a croak when the re has
> gone insane; I've had it in my development version for some time. It
> seems to happen when there are too many /\N{...}/ calls in a program.
> From 65f96077a5c64ea2ebaa200194782540c112fd8d Mon Sep 17 00:00:00 2001
> From: Karl Williamson <khw@khw-desktop.(none)>
> Date: Wed, 9 Dec 2009 11:25:36 -0700
> Subject: [PATCH] regex case-sensitive match utf8ness independent
>
> ---
> cpan/Pod-Simple/lib/Pod/Simple/BlackBox.pm | 4 +-
> cpan/Test-Harness/lib/TAP/Parser/YAMLish/Reader.pm | 1 +
> cpan/podlators/lib/Pod/Text.pm | 1 +
> cpan/podlators/lib/Pod/Text/Color.pm | 1 +
> cpan/podlators/lib/Pod/Text/Overstrike.pm | 1 +
> cpan/podlators/lib/Pod/Text/Termcap.pm | 1 +
> dist/Storable/t/downgrade.t | 6 +-
> ext/POSIX/t/time.t | 1 +
> handy.h | 13 +
> lib/legacy.t | 127 +++++++++-
> regcomp.c | 269 +++++++++++++++-----
> regcomp.h | 53 +++--
> regcomp.sym | 19 +-
> regexec.c | 176 ++++++++++----
> regnodes.h | 54 +++-
> t/op/sysio.t | 6 +-
> t/re/pat_special_cc.t | 1 +
> t/re/re_tests | 2 +-
> t/re/reg_posixcc.t | 32 ++--
> 19 files changed, 604 insertions(+), 164 deletions(-)
Thread Previous
|
Thread Next