develooper Front page | perl.perl5.porters | Postings from August 2016

[perl #124256] Regex loop for \K in lookbehind

From:
Karl Williamson via RT
Date:
August 6, 2016 23:44
Subject:
[perl #124256] Regex loop for \K in lookbehind
Message ID:
rt-4.0.24-14742-1470527076-769.124256-15-0@perl.org
On Thu Dec 10 03:37:21 2015, demerphq wrote:
> While I have not tested it the patch looks fine to me. I have to admit
> to feeling stupid reading it actually. It didn't occur to me to simply
> forbid \K inside of lookaround, and it is a simple solution which
> leaves the option of allowing it in the future if we can figure out
> what it should mean and how it should work.  So ++ to you Tony.
> 
> Yves
> 
> On 28 May 2015 at 08:05, Tony Cook via RT <perlbug-followup@perl.org>
> wrote:
> > On Tue May 26 22:03:58 2015, tonyc wrote:
> >> On Tue Apr 28 14:27:34 2015, demerphq wrote:
> >> > On 28 April 2015 at 22:28, Karl Williamson
> >> > <public@khwilliamson.com>
> >> > wrote:
> >> > > I haven't ever looked at the \K code; I was hoping someone else
> >> > > would look
> >> > > at this.
> >> >
> >> > Ok, then I will try to find time to address this.
> >>
> >> What's the intended behaviour of \K in a look-(ahead|behind)?
> >>
> >> My first thught was that it should be ignored, something like:
> >
> > Which was broken in at least one way.
> >
> > Attached is a "better" patch, depending on how much I've managed
> > to mess up the regexp engine ;)
> >
> > Tony
> >
> > ---
> > via perlbug:  queue: perl5 status: open
> > https://rt.perl.org/Ticket/Display.html?id=124256
> >
> > From 40f4dee6393db72c33d89117b6e092e4e349366c Mon Sep 17 00:00:00
> > 2001
> > From: Tony Cook <tony@develop-help.com>
> > Date: Thu, 28 May 2015 16:03:50 +1000
> > Subject: [PATCH] prevent \K working in lookahead/behind assertions
> > (and warn)
> >
> > this is probably incorrect, since I'm clueless about the regexp
> > engine
> >
> > It may be that in_lookbehind and in_lookahead can be combined, since
> > in_lookbehind appears to be only used to maintain its own value.
> > ---
> > pod/perldiag.pod       |    5 +++++
> > regcomp.c              |   51
> > +++++++++++++++++++++++++++++++++++++-----------
> > t/lib/warnings/regcomp |   12 ++++++++++++
> > t/re/pat_advanced.t    |   14 +++++++++++++
> > 4 files changed, 71 insertions(+), 11 deletions(-)
> >
> > diff --git a/pod/perldiag.pod b/pod/perldiag.pod
> > index 93ae13b..fb00745 100644
> > --- a/pod/perldiag.pod
> > +++ b/pod/perldiag.pod
> > @@ -6556,6 +6556,11 @@ about the /d modifier.
> >  (W misc) You have a \E in a double-quotish string without a C<\U>,
> >  C<\L> or C<\Q> preceding it.
> >
> > +=item Useless use of \K in lookbehind/lookahead in regex; marked by
> > S<<-- HERE> in m/%s/
> > +
> > +(W regexp) Your regular expression used C<\K> in a lookhead or
> > +lookbehind assertion, where is has no effect.
> > +
> >  =item Useless use of greediness modifier '%c' in regex; marked by
> > S<<-- HERE> in m/%s/
> >
> > (W regexp) You specified something like these:
> > diff --git a/regcomp.c b/regcomp.c
> > index 712c8ed7..945778d 100644
> > --- a/regcomp.c
> > +++ b/regcomp.c
> > @@ -177,6 +177,7 @@ struct RExC_state_t {
> >                                            through */
> >     U32         study_chunk_recursed_bytes;  /* bytes in bitmap */
> >     I32                in_lookbehind;
> > +    I32                in_lookahead;
> >     I32                contains_locale;
> >     I32                contains_i;
> >     I32                override_recoding;
> > @@ -255,6 +256,7 @@ struct RExC_state_t {
> > #define RExC_study_chunk_recursed_bytes  \
> >                                    (pRExC_state-
> > >study_chunk_recursed_bytes)
> > #define RExC_in_lookbehind     (pRExC_state->in_lookbehind)
> > +#define RExC_in_lookahead      (pRExC_state->in_lookahead)
> > #define RExC_contains_locale   (pRExC_state->contains_locale)
> > #define RExC_contains_i (pRExC_state->contains_i)
> > #define RExC_override_recoding (pRExC_state->override_recoding)
> > @@ -6633,6 +6635,7 @@ Perl_re_op_compile(pTHX_ SV ** const patternp,
> > int pat_count,
> >     RExC_seen = 0;
> >     RExC_maxlen = 0;
> >     RExC_in_lookbehind = 0;
> > +    RExC_in_lookahead = 0;
> >     RExC_seen_zerolen = *exp == '^' ? -1 : 0;
> >     RExC_extralen = 0;
> >     RExC_override_recoding = 0;
> > @@ -9782,6 +9785,12 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32
> > paren, I32 *flagp,U32 depth)
> >
> > *flagp = 0;                                /* Tentatively. */
> >
> > +    if (RExC_in_lookbehind) {
> > +       RExC_in_lookbehind++;
> > +    }
> > +    if (RExC_in_lookahead) {
> > +        RExC_in_lookahead++;
> > +    }
> >
> > /* Make an OPEN node, if parenthesized. */
> > if (paren) {
> > @@ -10055,9 +10064,11 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32
> > paren, I32 *flagp,U32 depth)
> >             RExC_seen |= REG_LOOKBEHIND_SEEN;
> >            RExC_in_lookbehind++;
> >            RExC_parse++;
> > -                /* FALLTHROUGH */
> > +                RExC_seen_zerolen++;
> > +                break;
> >        case '=':           /* (?=...) */
> > -               RExC_seen_zerolen++;
> > +               RExC_in_lookahead++;
> > +                RExC_seen_zerolen++;
> >             break;
> >        case '!':           /* (?!...) */
> >            RExC_seen_zerolen++;
> > @@ -10685,6 +10696,9 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32
> > paren, I32 *flagp,U32 depth)
> > if (RExC_in_lookbehind) {
> >    RExC_in_lookbehind--;
> > }
> > +    if (RExC_in_lookahead) {
> > +        RExC_in_lookahead--;
> > +    }
> > if (after_freeze > RExC_npar)
> >     RExC_npar = after_freeze;
> > return(ret);
> > @@ -11787,15 +11801,30 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state,
> > I32 *flagp, U32 depth)
> >        *flagp |= SIMPLE;
> >        goto finish_meta_pat;
> >    case 'K':
> > -           RExC_seen_zerolen++;
> > -           ret = reg_node(pRExC_state, KEEPS);
> > -           *flagp |= SIMPLE;
> > -           /* XXX:dmq : disabling in-place substitution seems to
> > -            * be necessary here to avoid cases of memory corruption,
> > as
> > -            * with: C<$_="x" x 80; s/x\K/y/> -- rgs
> > -            */
> > -            RExC_seen |= REG_LOOKBEHIND_SEEN;
> > -           goto finish_meta_pat;
> > +            if (!RExC_in_lookbehind && !RExC_in_lookahead) {
> > +                RExC_seen_zerolen++;
> > +                ret = reg_node(pRExC_state, KEEPS);
> > +                *flagp |= SIMPLE;
> > +                /* XXX:dmq : disabling in-place substitution seems
> > to
> > +                 * be necessary here to avoid cases of memory
> > corruption, as
> > +                 * with: C<$_="x" x 80; s/x\K/y/> -- rgs
> > +                 */
> > +                RExC_seen |= REG_LOOKBEHIND_SEEN;
> > +            }
> > +            else {
> > +                if (PASS2) {
> > +                    /* adjust offset so <-- points at the K */
> > +                    ++RExC_parse;
> > +                    ckWARNreg(RExC_parse, "Useless use of \\K in
> > lookbehind/lookahead");
> > +                    --RExC_parse;
> > +                }
> > +                /* originally I did goto tryagain here, but that
> > failed
> > +                 * with an Internal urp when a ) immediately
> > followed the \K.
> > +                 * So return something, even if it's NOTHING.
> > +                 */
> > +                ret = reg_node(pRExC_state, NOTHING);
> > +            }
> > +            goto finish_meta_pat;
> >    case 'Z':
> >        ret = reg_node(pRExC_state, SEOL);
> >        *flagp |= SIMPLE;
> > diff --git a/t/lib/warnings/regcomp b/t/lib/warnings/regcomp
> > index b9943a0..9d569cd 100644
> > --- a/t/lib/warnings/regcomp
> > +++ b/t/lib/warnings/regcomp
> > @@ -36,3 +36,15 @@ $a = qr/[\c,]/;
> > EXPECT
> > "\c," is more clearly written simply as "l" at - line 9.
> > "\c," is more clearly written simply as "l" at - line 10.
> > +########
> > +# regcomp.c - \K in assertion
> > +use warnings;
> > +$x = "aaaa";
> > +$x =~ /(?<=\Ka)/;
> > +$x =~ /(?=a\Ka)aa/;
> > +no warnings 'regexp';
> > +$x =~ /(?<=\Ka)/;
> > +$x =~ /(?=a\Ka)aa/;
> > +EXPECT
> > +Useless use of \K in lookbehind/lookahead in regex; marked by <--
> > HERE in m/(?<=\K <-- HERE a)/ at - line 4.
> > +Useless use of \K in lookbehind/lookahead in regex; marked by <--
> > HERE in m/(?=a\K <-- HERE a)aa/ at - line 5.
> > diff --git a/t/re/pat_advanced.t b/t/re/pat_advanced.t
> > index 891bb66..d8d5823 100644
> > --- a/t/re/pat_advanced.t
> > +++ b/t/re/pat_advanced.t
> > @@ -1498,6 +1498,20 @@ sub run_tests {
> >     $x = "abcde";
> >     $x =~ s/(.)\K/$1/g;
> >     is($x, "aabbccddee", $message);
> > +
> > +        no warnings 'regexp';
> > +        $x = "aaaa";
> > +        $x =~ /(?<=\Ka)/;
> > +        is($&, "", "\\K in lookbehind meaningless");
> > +
> > +        $x =~ /(?<=(a)\K)/;
> > +        is($&, "", "\\K in lookbehind meaningless (with nesting)");
> > +
> > +        $x =~ /(?=a\Ka)aa/;
> > +        is($&, "aa", "\\K in lookahead meaningless");
> > +
> > +        $x =~ /(?=(a)\Ka)aa/;
> > +        is($&, "aa", "\\K in lookahead meaningless (nesting)");
> > }
> >
> > {
> > --
> > 1.7.10.4
> >
> >

Tony, It looks to me like you can apply this patch, with a possible watchdog-timer test as well
-- 
Karl Williamson

---
via perlbug:  queue: perl5 status: open
https://rt.perl.org/Ticket/Display.html?id=124256



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About