develooper Front page | perl.perl5.changes | Postings from September 2019

[perl.git] branch blead updated. v5.31.4-268-g830b3eb245

From:
Karl Williamson
Date:
September 29, 2019 17:46
Subject:
[perl.git] branch blead updated. v5.31.4-268-g830b3eb245
Message ID:
E1iEdHD-0005mN-4F@git.dc.perl.space
In perl.git, the branch blead has been updated

<https://perl5.git.perl.org/perl.git/commitdiff/830b3eb245d5dbcf095fbd4b5d59764c697c20df?hp=0db1c5b08ab4711aa04177a4549a29f2e83123b6>

- Log -----------------------------------------------------------------
commit 830b3eb245d5dbcf095fbd4b5d59764c697c20df
Author: Karl Williamson <khw@cpan.org>
Date:   Sat Sep 28 14:01:41 2019 -0600

    perl.h: Silence warning when compiled with C++
    
    This silences a warning that the pragma it surrounds is not valid on
    C++.  We don't need to know that, and it clutters the compilation
    output.

commit 5cd61b66283b55e639490151d4e730a840ab13d5
Author: Karl Williamson <khw@cpan.org>
Date:   Sat Sep 28 11:58:59 2019 -0600

    regex: Add LEXACT_ONLY8 node type
    
    This is like LEXACT, but it is known that only strings encoded in UTF-8
    will match it, so don't even have to try if that condition isn't met.

commit 3363c7035ff1df0c3ffeae0cd18bb86cc39d62e4
Author: Karl Williamson <khw@cpan.org>
Date:   Thu Sep 26 21:38:46 2019 -0600

    regex: Create and handle LEXACT nodes
    
    See the previous commit for info on these.
    
    I am not changing trie code to recognize these at this time.

commit ae06e581c6e9944620eed4980fe89a3749886ed0
Author: Karl Williamson <khw@cpan.org>
Date:   Wed Sep 25 10:12:32 2019 -0600

    Add regnode LEXACT, for long strings
    
    This commit adds a new regnode for strings that don't fit in a regular
    one, and adds a structure for that regnode to use.  Actually using them
    is deferred to the next commit.
    
    This new regnode structure is needed because the previous structure only
    allows for an 8 bit length field, 255 max bytes.  This commit puts the
    length instead in a new field, the same place single-argument regnodes
    put their argument.  Hence this long string is an extra 32 bits of
    overhead, but at no string length is this node ever bigger than the
    combination of the smaller nodes it replaces.
    
    I also considered simply combining the original 8 bit length field
    (which is now unused) with the first byte of the string field to get a
    16 bit length, and have the actual string be offset by 1.  But I
    rejected that because it would mean the string would usually not be
    aligned, slowing down memory accesses.
    
    This new LEXACT regnode can hold up to what 1024 regular EXACT ones hold,
    using 4K fewer overhead bytes to do so.  That means it can handle
    strings containing 262000 bytes.  The comments give ideas for expanding
    that should it become necessary or desirable.
    
    Besides the space advantage, any hardware acceleration in memcmp
    can be done in much bigger chunks, and otherwise the memcmp inner loop
    (often written in assembly) will run many more times in a row, and our
    outer loop that calls it, correspondingly fewer.

commit 3ae8ec479bc65ef004bd856d90b82106186771d9
Author: Karl Williamson <khw@cpan.org>
Date:   Sun Sep 22 16:12:07 2019 -0600

    regcomp.c: Change handling of filled EXACT nodes
    
    This changes the detection mechanism to check just before writing to see
    if if would be out of bounds, and if so, instead break out of the loop,
    and go close out the node.  Prior to this commit space for a worst-case
    scenario was reserved, and we didn't start a new character if we were in
    that danger zone.  This left nodes left fully packed than they could
    have been.
    
    Thus this improves the packing of nodes, especially under /i, from the
    previous mechanism.  But more importantly, it set things up so that we
    can potentially increase the node size as we go along.
    
    This also changes the handling of avoiding splitting a multi-character
    fold across nodes under /i.  For example, take the sequence 'ffi',  We
    wouldn't want to end a node with 'ff', when the first character in the
    next node is an 'i', as U+FB03 folds to that sequence, and the code that
    does pattern matching can't currently match across node boundaries.
    Previously we backed off filling the node until the final character
    wasn't one that could potentially cause such a break.  That is we didn't
    look at the next character and see if it was an 'i' (or some other
    potential multi-char fold.)  Now we do look at that next
    character(s), and only back off if this actually would split a real
    multi-char fold.

commit c45abc0a05f632031d992cdd210e7d08b1e71cf2
Author: Karl Williamson <khw@cpan.org>
Date:   Sun Sep 22 15:26:03 2019 -0600

    regcomp.h: Add comments

commit 741c97a294f71b1272425b04db64d5ae4fca312f
Author: Karl Williamson <khw@cpan.org>
Date:   Sun Sep 22 15:25:23 2019 -0600

    regcomp.h: Remove obsolete macro
    
    This is no longer used

commit 48503d568f6f366a95e7f5b48535b0cd6600eaca
Author: Karl Williamson <khw@cpan.org>
Date:   Sat Sep 21 14:34:20 2019 -0600

    regcomp.c: Rename three variables
    
    One of the variables is misnamed,  the upper_fill indicates that the
    node has to be left not completely filled.  Comments will be added in a
    later commit.
    
    The other two are renamed in preparation for future changes to more
    accurately describe their new purposes.

commit ab9b57e2cec82267ea64f3853c4d4cf119bb4c7c
Author: Karl Williamson <khw@cpan.org>
Date:   Sat Sep 21 13:31:37 2019 -0600

    regcomp.c: White-space only, comments
    
    Outdent a block that was doubly indented.  Change some other white space
    and fix grammar in a comment

commit c7fd9c6721948aad186aefa3d0365b0158d17cbb
Author: Karl Williamson <khw@cpan.org>
Date:   Sat Sep 21 13:24:33 2019 -0600

    regcomp: Use new set macro to store a value
    
    This is in preparation for the current mechanism in a later commit to
    become a not legal lhs

-----------------------------------------------------------------------

Summary of changes:
 perl.h              |   2 +
 pod/perldebguts.pod |   7 +
 regcomp.c           | 525 +++++++++++++++++++++++++++++++++++-----------------
 regcomp.h           |  57 +++++-
 regcomp.sym         |   5 +
 regexec.c           |  84 ++++++---
 regnodes.h          | 282 ++++++++++++++--------------
 t/re/pat.t          |  80 +++++++-
 8 files changed, 705 insertions(+), 337 deletions(-)

diff --git a/perl.h b/perl.h
index b1ab81dd3b..05dbe0e785 100644
--- a/perl.h
+++ b/perl.h
@@ -7483,6 +7483,7 @@ START_EXTERN_C
  */
 
 /* The quadmath literals are anon structs which -Wc++-compat doesn't like. */
+GCC_DIAG_IGNORE_DECL(-Wpragmas);
 GCC_DIAG_IGNORE_DECL(-Wc++-compat);
 
 #  ifdef USE_QUADMATH
@@ -7553,6 +7554,7 @@ INFNAN_NV_U8_DECL PL_nan = { 0.0/0.0 }; /* keep last */
 #    endif
 #  endif
 
+GCC_DIAG_RESTORE_DECL;
 GCC_DIAG_RESTORE_DECL;
 
 #else
diff --git a/pod/perldebguts.pod b/pod/perldebguts.pod
index 1e23b84af4..4142cf7d35 100644
--- a/pod/perldebguts.pod
+++ b/pod/perldebguts.pod
@@ -660,6 +660,11 @@ will be lost.
 
  EXACT            str        Match this string (flags field is the
                              length).
+
+ # In a long string node, the U32 argument is the length, and is
+ # immediately followed by the string.
+ LEXACT           len:str 1  Match this long string (preceded by length;
+                             flags unused).
  EXACTL           str        Like EXACT, but /l is in effect (used so
                              locale-related warnings can be checked
                              for).
@@ -687,6 +692,8 @@ will be lost.
 
  EXACT_ONLY8      str        Like EXACT, but only UTF-8 encoded targets
                              can match
+ LEXACT_ONLY8     len:str 1  Like LEXACT, but only UTF-8 encoded targets
+                             can match
  EXACTFU_ONLY8    str        Like EXACTFU, but only UTF-8 encoded
                              targets can match
 
diff --git a/regcomp.c b/regcomp.c
index e74f4d8fab..e8e4efb3d5 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -3551,9 +3551,9 @@ S_make_trie(pTHX_ RExC_state_t *pRExC_state, regnode *startbranch,
                     if ( state==1 ) {
                         OP( convert ) = nodetype;
                         str=STRING(convert);
-                        STR_LEN(convert)=0;
+                        setSTR_LEN(convert, 0);
                     }
-                    STR_LEN(convert) += len;
+                    setSTR_LEN(convert, STR_LEN(convert) + len);
                     while (len--)
                         *str++ = *ch++;
 		} else {
@@ -3993,8 +3993,9 @@ S_construct_ahocorasick_from_trie(pTHX_ RExC_state_t *pRExC_state, regnode *sour
  *      using /iaa matching will be doing so almost entirely with ASCII
  *      strings, so this should rarely be encountered in practice */
 
-#define JOIN_EXACT(scan,min_subtract,unfolded_multi_char, flags) \
-    if (PL_regkind[OP(scan)] == EXACT) \
+#define JOIN_EXACT(scan,min_subtract,unfolded_multi_char, flags)    \
+    if (PL_regkind[OP(scan)] == EXACT && OP(scan) != LEXACT         \
+                                      && OP(scan) != LEXACT_ONLY8)  \
         join_exact(pRExC_state,(scan),(min_subtract),unfolded_multi_char, (flags), NULL, depth+1)
 
 STATIC U32
@@ -4160,7 +4161,7 @@ S_join_exact(pTHX_ RExC_state_t *pRExC_state, regnode *scan,
             merged++;
 
             NEXT_OFF(scan) += NEXT_OFF(n);
-            STR_LEN(scan) += STR_LEN(n);
+            setSTR_LEN(scan, STR_LEN(scan) + STR_LEN(n));
             next = n + NODE_SZ_STR(n);
             /* Now we can overwrite *n : */
             Move(STRING(n), STRING(scan) + oldl, STR_LEN(n), char);
@@ -5197,7 +5198,9 @@ S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode **scanp,
 	    }
 	}
 	else if (   OP(scan) == EXACT
+                 || OP(scan) == LEXACT
                  || OP(scan) == EXACT_ONLY8
+                 || OP(scan) == LEXACT_ONLY8
                  || OP(scan) == EXACTL)
         {
 	    SSize_t l = STR_LEN(scan);
@@ -5319,7 +5322,9 @@ S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode **scanp,
 		if (flags & (SCF_DO_SUBSTR | SCF_DO_STCLASS)) {
 		    next = NEXTOPER(scan);
 		    if (   OP(next) == EXACT
+                        || OP(next) == LEXACT
                         || OP(next) == EXACT_ONLY8
+                        || OP(next) == LEXACT_ONLY8
                         || OP(next) == EXACTL
                         || (flags & SCF_DO_STCLASS))
                     {
@@ -7978,7 +7983,9 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_count,
         /* Ignore EXACT as we deal with it later. */
 	if (PL_regkind[OP(first)] == EXACT) {
 	    if (   OP(first) == EXACT
+	        || OP(first) == LEXACT
                 || OP(first) == EXACT_ONLY8
+                || OP(first) == LEXACT_ONLY8
                 || OP(first) == EXACTL)
             {
 		NOOP;	/* Empty, get anchored substr later. */
@@ -8324,7 +8331,9 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_count,
                  && nop == END)
             RExC_rx->extflags |= RXf_WHITE;
         else if ( RExC_rx->extflags & RXf_SPLIT
-                  && (fop == EXACT || fop == EXACT_ONLY8 || fop == EXACTL)
+                  && (   fop == EXACT || fop == LEXACT
+                      || fop == EXACT_ONLY8 || fop == LEXACT_ONLY8
+                      || fop == EXACTL)
                   && STR_LEN(first) == 1
                   && *(STRING(first)) == ' '
                   && nop == END )
@@ -13922,13 +13931,14 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
 	    UV ender = 0;
 	    char *p;
 	    char *s;
-
-/* This allows us to fill a node with just enough spare so that if the final
- * character folds, its expansion is guaranteed to fit */
-#define MAX_NODE_STRING_SIZE (255-UTF8_MAXBYTES_CASE)
-
 	    char *s0;
-	    U8 upper_parse = MAX_NODE_STRING_SIZE;
+            U32 max_string_len = 255;
+
+            /* We may have to reparse the node, artificially stopping filling
+             * it early, based on info gleaned in the first parse.  This
+             * variable gives where we stop.  Make it above the normal stopping
+             * place first time through. */
+	    U32 upper_fill = max_string_len + 1;
 
             /* We start out as an EXACT node, even if under /i, until we find a
              * character which is in a fold.  The algorithm now segregates into
@@ -13944,7 +13954,7 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
             /* Assume the node will be fully used; the excess is given back at
              * the end.  We can't make any other length assumptions, as a byte
              * input sequence could shrink down. */
-            Ptrdiff_t initial_size = STR_SZ(256);
+            Ptrdiff_t current_string_nodes = STR_SZ(max_string_len);
 
             bool next_is_quantifier;
             char * oldp = NULL;
@@ -13975,10 +13985,15 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
             /* So is the MICRO SIGN */
             bool has_micro_sign = FALSE;
 
+            /* Set when we fill up the current node and there is still more
+             * text to process */
+            bool overflowed;
+
             /* Allocate an EXACT node.  The node_type may change below to
              * another EXACTish node, but since the size of the node doesn't
              * change, it works */
-            ret = regnode_guts(pRExC_state, node_type, initial_size, "exact");
+            ret = regnode_guts(pRExC_state, node_type, current_string_nodes,
+                                                                    "exact");
             FILL_NODE(ret, node_type);
             RExC_emit++;
 
@@ -13988,6 +14003,12 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
 
 	  reparse:
 
+            p = RExC_parse;
+            len = 0;
+            s = s0;
+
+          continue_parse:
+
             /* This breaks under rare circumstances.  If folding, we do not
              * want to split a node at a character that is a non-final in a
              * multi-char fold, as an input string could just happen to want to
@@ -14002,12 +14023,14 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
                    || UTF8_IS_INVARIANT(UCHARAT(RExC_parse))
                    || UTF8_IS_START(UCHARAT(RExC_parse)));
 
+            overflowed = FALSE;
+
             /* Here, we have a literal character.  Find the maximal string of
              * them in the input that we can fit into a single EXACTish node.
              * We quit at the first non-literal or when the node gets full, or
              * under /i the categorization of folding/non-folding character
              * changes */
-	    for (p = RExC_parse; len < upper_parse && p < RExC_end; ) {
+            while (p < RExC_end && len < upper_fill) {
 
                 /* In most cases each iteration adds one byte to the output.
                  * The exceptions override this */
@@ -14345,20 +14368,29 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
                 /* Ready to add 'ender' to the node */
 
                 if (! FOLD) {  /* The simple case, just append the literal */
+                  not_fold_common:
 
-                      not_fold_common:
-                        if (UVCHR_IS_INVARIANT(ender) || ! UTF) {
-                            *(s++) = (char) ender;
-                        }
-                        else {
-                            U8 * new_s = uvchr_to_utf8((U8*)s, ender);
-                            added_len = (char *) new_s - s;
-                            s = (char *) new_s;
+                    /* Don't output if it would overflow */
+                    if (UNLIKELY(len > max_string_len - ((UTF)
+                                                         ? UVCHR_SKIP(ender)
+                                                         : 1)))
+                    {
+                        overflowed = TRUE;
+                        break;
+                    }
 
-                            if (ender > 255)  {
-                                requires_utf8_target = TRUE;
-                            }
+                    if (UVCHR_IS_INVARIANT(ender) || ! UTF) {
+                        *(s++) = (char) ender;
+                    }
+                    else {
+                        U8 * new_s = uvchr_to_utf8((U8*)s, ender);
+                        added_len = (char *) new_s - s;
+                        s = (char *) new_s;
+
+                        if (ender > 255)  {
+                            requires_utf8_target = TRUE;
                         }
+                    }
                 }
                 else if (LOC && is_PROBLEMATIC_LOCALE_FOLD_cp(ender)) {
 
@@ -14424,20 +14456,33 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
 
                     if (UTF) {  /* Use the folded value */
                         if (UVCHR_IS_INVARIANT(ender)) {
+                            if (UNLIKELY(len + 1 > max_string_len)) {
+                                overflowed = TRUE;
+                                break;
+                            }
+
                             *(s)++ = (U8) toFOLD(ender);
                         }
                         else {
-                            ender = _to_uni_fold_flags(
+                            U8 temp[UTF8_MAXBYTES_CASE+1];
+
+                            UV folded = _to_uni_fold_flags(
                                     ender,
-                                    (U8 *) s,
+                                    temp,
                                     &added_len,
                                     FOLD_FLAGS_FULL | ((ASCII_FOLD_RESTRICTED)
                                                     ? FOLD_FLAGS_NOMIX_ASCII
                                                     : 0));
+                            if (UNLIKELY(len + added_len > max_string_len)) {
+                                overflowed = TRUE;
+                                break;
+                            }
+
+                            Copy(temp, s, added_len, char);
                             s += added_len;
 
-                            if (   ender > 255
-                                && LIKELY(ender != GREEK_SMALL_LETTER_MU))
+                            if (   folded > 255
+                                && LIKELY(folded != GREEK_SMALL_LETTER_MU))
                             {
                                 /* U+B5 folds to the MU, so its possible for a
                                  * non-UTF-8 target to match it */
@@ -14489,9 +14534,16 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
                             if (UNLIKELY(ender == LATIN_SMALL_LETTER_SHARP_S)) {
                                 maybe_SIMPLE = 0;
                                 if (node_type == EXACTFU) {
+
+                                    if (UNLIKELY(len + 2 > max_string_len)) {
+                                        overflowed = TRUE;
+                                        break;
+                                    }
+
                                     *(s++) = 's';
 
-                                    /* Let the code below add in the extra 's' */
+                                    /* Let the code below add in the extra 's'
+                                     * */
                                     ender = 's';
                                     added_len = 2;
                                 }
@@ -14503,6 +14555,11 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
                             has_micro_sign = TRUE;
                         }
 
+                        if (UNLIKELY(len + 1 > max_string_len)) {
+                            overflowed = TRUE;
+                            break;
+                        }
+
                         *(s++) = (DEPENDS_SEMANTICS)
                                  ? (char) toFOLD(ender)
 
@@ -14527,168 +14584,280 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
 
 	    } /* End of loop through literal characters */
 
-            /* Here we have either exhausted the input or ran out of room in
-             * the node.  (If we encountered a character that can't be in the
-             * node, transfer is made directly to <loopdone>, and so we
-             * wouldn't have fallen off the end of the loop.)  In the latter
-             * case, we artificially have to split the node into two, because
-             * we just don't have enough space to hold everything.  This
-             * creates a problem if the final character participates in a
-             * multi-character fold in the non-final position, as a match that
-             * should have occurred won't, due to the way nodes are matched,
-             * and our artificial boundary.  So back off until we find a non-
-             * problematic character -- one that isn't at the beginning or
-             * middle of such a fold.  (Either it doesn't participate in any
-             * folds, or appears only in the final position of all the folds it
-             * does participate in.)  A better solution with far fewer false
-             * positives, and that would fill the nodes more completely, would
-             * be to actually have available all the multi-character folds to
-             * test against, and to back-off only far enough to be sure that
-             * this node isn't ending with a partial one.  <upper_parse> is set
-             * further below (if we need to reparse the node) to include just
-             * up through that final non-problematic character that this code
-             * identifies, so when it is set to less than the full node, we can
-             * skip the rest of this */
-            if (FOLD && p < RExC_end && upper_parse == MAX_NODE_STRING_SIZE) {
-                PERL_UINT_FAST8_T backup_count = 0;
-
-                const STRLEN full_len = len;
-
-		assert(len >= MAX_NODE_STRING_SIZE);
-
-                /* Here, <s> points to just beyond where we have output the
-                 * final character of the node.  Look backwards through the
-                 * string until find a non- problematic character */
-
-		if (! UTF) {
-
-                    /* This has no multi-char folds to non-UTF characters */
-                    if (ASCII_FOLD_RESTRICTED) {
-                        goto loopdone;
-                    }
+            /* Here we have either exhausted the input or run out of room in
+             * the node.  If the former, we are done.  (If we encountered a
+             * character that can't be in the node, transfer is made directly
+             * to <loopdone>, and so we wouldn't have fallen off the end of the
+             * loop.)  */
+            if (LIKELY(! overflowed)) {
+                goto loopdone;
+            }
+
+            /* Here we have run out of room.  We can grow plain EXACT and
+             * LEXACT nodes.  If the pattern is gigantic enough, though,
+             * eventually we'll have to artificially chunk the pattern into
+             * multiple nodes. */
+            if (! LOC && (node_type == EXACT || node_type == LEXACT)) {
+                Size_t overhead = 1 + regarglen[OP(REGNODE_p(ret))];
+                Size_t overhead_expansion = 0;
+                char temp[256];
+                Size_t max_nodes_for_string;
+                Size_t achievable;
+                SSize_t delta;
+
+                /* Here we couldn't fit the final character in the current
+                 * node, so it will have to be reparsed, no matter what else we
+                 * do */
+                p = oldp;
+
+
+                /* If would have overflowed a regular EXACT node, switch
+                 * instead to an LEXACT.  The code below is structured so that
+                 * the actual growing code is common to changing from an EXACT
+                 * or just increasing the LEXACT size.  This means that we have
+                 * to save the string in the EXACT case before growing, and
+                 * then copy it afterwards to its new location */
+                if (node_type == EXACT) {
+                    overhead_expansion = regarglen[LEXACT] - regarglen[EXACT];
+                    RExC_emit += overhead_expansion;
+                    Copy(s0, temp, len, char);
+                }
+
+                /* Ready to grow.  If it was a plain EXACT, the string was
+                 * saved, and the first few bytes of it overwritten by adding
+                 * an argument field.  We assume, as we do elsewhere in this
+                 * file, that one byte of remaining input will translate into
+                 * one byte of output, and if that's too small, we grow again,
+                 * if too large the excess memory is freed at the end */
+
+                max_nodes_for_string = U16_MAX - overhead - overhead_expansion;
+                achievable = MIN(max_nodes_for_string,
+                                 current_string_nodes + STR_SZ(RExC_end - p));
+                delta = achievable - current_string_nodes;
+
+                /* If there is just no more room, go finish up this chunk of
+                 * the pattern. */
+                if (delta <= 0) {
+                    goto loopdone;
+                }
 
-                    while (--s >= s0 && IS_NON_FINAL_FOLD(*s)) {
-                        backup_count++;
-                    }
-                    len = s - s0 + 1;
-		}
-                else {
+                change_engine_size(pRExC_state, delta + overhead_expansion);
+                current_string_nodes += delta;
+                max_string_len
+                           = sizeof(struct regnode) * current_string_nodes;
+                upper_fill = max_string_len + 1;
 
-                    /* Point to the first byte of the final character */
-                    s = (char *) utf8_hop_back((U8 *) s, -1, (U8 *) s0);
+                /* If the length was small, we know this was originally an
+                 * EXACT node now converted to LEXACT, and the string has to be
+                 * restored.  Otherwise the string was untouched.  260 is just
+                 * a number safely above 255 so don't have to worry about
+                 * getting it precise */
+                if (len < 260) {
+                    node_type = LEXACT;
+                    FILL_NODE(ret, node_type);
+                    s0 = STRING(REGNODE_p(ret));
+                    Copy(temp, s0, len, char);
+                    s = s0 + len;
+                }
 
-                    while (s >= s0) {   /* Search backwards until find
-                                           a non-problematic char */
-                        if (UTF8_IS_INVARIANT(*s)) {
+                goto continue_parse;
+            }
+            else {
 
-                            /* There are no ascii characters that participate
-                             * in multi-char folds under /aa.  In EBCDIC, the
-                             * non-ascii invariants are all control characters,
-                             * so don't ever participate in any folds. */
-                            if (ASCII_FOLD_RESTRICTED
-                                || ! IS_NON_FINAL_FOLD(*s))
-                            {
-                                break;
-                            }
-                        }
-                        else if (UTF8_IS_DOWNGRADEABLE_START(*s)) {
-                            if (! IS_NON_FINAL_FOLD(EIGHT_BIT_UTF8_TO_NATIVE(
-                                                                  *s, *(s+1))))
-                            {
-                                break;
-                            }
+                /* Here is /i.  Running out of room creates a problem if we are
+                 * folding, and the split happens in the middle of a
+                 * multi-character fold, as a match that should have occurred,
+                 * won't, due to the way nodes are matched, and our artificial
+                 * boundary.  So back off until we aren't splitting such a
+                 * fold.  If there is no such place to back off to, we end up
+                 * taking the entire node as-is.  This can happen if the node
+                 * consists entirely of 'f' or entirely of 's' characters (or
+                 * things that fold to them) as 'ff' and 'ss' are
+                 * multi-character folds.
+                 *
+                 * At this point:
+                 *  oldp        points to the beginning in the input of the
+                 *              final character in the node.
+                 *  p           points to the beginning in the input of the
+                 *              next character in the input, the one that won't
+                 *              fit in the node.
+                 *
+                 * We aren't in the middle of a multi-char fold unless the
+                 * final character in the node can appear in a non-final
+                 * position in such a fold.  Very few characters actually
+                 * participate in multi-character folds, and fewer still can be
+                 * in the non-final position.  But it's complicated to know
+                 * here if that final character is folded or not, so skip this
+                 * check */
+
+                           /* Make sure enough space for final char of node,
+                            * first char of following node, and the fold of the
+                            * following char (so we don't have to worry about
+                            * that fold running off the end */
+                U8 foldbuf[UTF8_MAXBYTES_CASE * 5 + 1];
+                STRLEN fold_len;
+                UV folded;
+
+                assert(FOLD);
+
+                /* The Unicode standard says that multi character folds consist
+                 * of either two or three characters.  So we create a buffer
+                 * containing a window of three.  The first is the final
+                 * character in the node (folded), and then the two that begin
+                 * the following node.   But if the first character of the
+                 * following node can't be in a non-final fold position, there
+                 * is no need to look at its successor character.  The macros
+                 * used below to check for multi character folds require folded
+                 * inputs, so we have to fold these.  (The fold of p was likely
+                 * calculated in the loop above, but it hasn't beeen saved, and
+                 * khw thinks it would be too entangled to change to do so) */
+
+                if (UTF || LIKELY(UCHARAT(p) != MICRO_SIGN)) {
+                    folded = _to_uni_fold_flags(ender,
+                                                foldbuf,
+                                                &fold_len,
+                                                FOLD_FLAGS_FULL);
+                }
+                else {
+                    foldbuf[0] = folded = MICRO_SIGN;
+                    fold_len = 1;
+                }
+
+                /* Here, foldbuf contains the fold of the first character in
+                 * the next node.  We may also need the next one (if there is
+                 * one) to get our third, but if the first character folded to
+                 * more than one, those extra one(s) will serve as the third.
+                 * Also, we don't need a third unless the previous one can
+                 * appear in a non-final position in a fold */
+                if (  ((RExC_end - p) > ((UTF) ? UVCHR_SKIP(ender) : 1))
+                    && (fold_len == 1 || (   UTF
+                                          && UVCHR_SKIP(folded) == fold_len))
+                    &&  UNLIKELY(_invlist_contains_cp(PL_NonFinalFold, folded)))
+                {
+                    if (UTF) {
+                        STRLEN next_fold_len;
+
+                        toFOLD_utf8_safe((U8*) p + UTF8SKIP(p),
+                                         (U8*) RExC_end, foldbuf + fold_len,
+                                         &next_fold_len);
+                        fold_len += next_fold_len;
+                    }
+                    else {
+                        if (UNLIKELY(p[1] == LATIN_SMALL_LETTER_SHARP_S)) {
+                            foldbuf[fold_len] = 's';
                         }
-                        else if (! _invlist_contains_cp(
-                                        PL_NonFinalFold,
-                                        valid_utf8_to_uvchr((U8 *) s, NULL)))
-                        {
-                            break;
+                        else {
+                            foldbuf[fold_len] = toLOWER_L1(p[1]);
                         }
+                        fold_len++;
+                    }
+                }
 
-                        /* Here, the current character is problematic in that
-                         * it does occur in the non-final position of some
-                         * fold, so try the character before it, but have to
-                         * special case the very first byte in the string, so
-                         * we don't read outside the string */
-                        s = (s == s0) ? s -1 : (char *) utf8_hop((U8 *) s, -1);
-                        backup_count++;
-                    } /* End of loop backwards through the string */
-
-                    /* If there were only problematic characters in the string,
-                     * <s> will point to before s0, in which case the length
-                     * should be 0, otherwise include the length of the
-                     * non-problematic character just found */
-                    len = (s < s0) ? 0 : s - s0 + UTF8SKIP(s);
-		}
+                /* Here foldbuf contains the the fold of p, and if appropriate
+                 * that of the character following p in the input. */
 
-                /* Here, have found the final character, if any, that is
-                 * non-problematic as far as ending the node without splitting
-                 * it across a potential multi-char fold.  <len> contains the
-                 * number of bytes in the node up-to and including that
-                 * character, or is 0 if there is no such character, meaning
-                 * the whole node contains only problematic characters.  In
-                 * this case, give up and just take the node as-is.  We can't
-                 * do any better */
-                if (len == 0) {
-                    len = full_len;
+                /* Search backwards until find a place that doesn't split a
+                 * multi-char fold */
+                while (1) {
+                    STRLEN s_len;
+                    char s_fold_buf[UTF8_MAXBYTES_CASE];
+                    char * s_fold = s_fold_buf;
 
-                } else {
+                    if (s <= s0) {
 
-                    /* Here, the node does contain some characters that aren't
-                     * problematic.  If we didn't have to backup any, then the
-                     * final character in the node is non-problematic, and we
-                     * can take the node as-is */
-                    if (backup_count == 0) {
-                        goto loopdone;
+                        /* There's no safe place in the node to split.  Quit so
+                         * will take the whole node */
+                        break;
                     }
-                    else if (backup_count == 1) {
 
-                        /* If the final character is problematic, but the
-                         * penultimate is not, back-off that last character to
-                         * later start a new node with it */
-                        p = oldp;
-                        goto loopdone;
+                    /* Backup 1 character.  The first time through this moves s
+                     * to point to the final character in the node */
+                    if (UTF) {
+                        s = (char *) utf8_hop_back((U8 *) s, -1, (U8 *) s0);
+                    }
+                    else {
+                        s--;
                     }
 
-                    /* Here, the final non-problematic character is earlier
-                     * in the input than the penultimate character.  What we do
-                     * is reparse from the beginning, going up only as far as
-                     * this final ok one, thus guaranteeing that the node ends
-                     * in an acceptable character.  The reason we reparse is
-                     * that we know how far in the character is, but we don't
-                     * know how to correlate its position with the input parse.
-                     * An alternate implementation would be to build that
-                     * correlation as we go along during the original parse,
-                     * but that would entail extra work for every node, whereas
-                     * this code gets executed only when the string is too
-                     * large for the node, and the final two characters are
-                     * problematic, an infrequent occurrence.  Yet another
-                     * possible strategy would be to save the tail of the
-                     * string, and the next time regatom is called, initialize
-                     * with that.  The problem with this is that unless you
-                     * back off one more character, you won't be guaranteed
-                     * regatom will get called again, unless regbranch,
-                     * regpiece ... are also changed.  If you do back off that
-                     * extra character, so that there is input guaranteed to
-                     * force calling regatom, you can't handle the case where
-                     * just the first character in the node is acceptable.  I
-                     * (khw) decided to try this method which doesn't have that
-                     * pitfall; if performance issues are found, we can do a
-                     * combination of the current approach plus that one */
-                    upper_parse = len;
-                    len = 0;
-                    s = s0;
-                    goto reparse;
+                    /* 's' may or may not be folded; so make sure it is, and
+                     * use just the final character in its fold (should there
+                     * be more than one */
+                    if (UTF) {
+                        toFOLD_utf8_safe((U8*) s,
+                                         (U8*) s + UTF8SKIP(s),
+                                         (U8 *) s_fold_buf, &s_len);
+                        while (s_fold + UTF8SKIP(s_fold) < s_fold_buf + s_len)
+                        {
+                            s_fold += UTF8SKIP(s_fold);
+                        }
+                        s_len = UTF8SKIP(s_fold);
+                    }
+                    else {
+                        if (UNLIKELY(UCHARAT(s) == LATIN_SMALL_LETTER_SHARP_S))
+                        {
+                            s_fold_buf[0] = 's';
+                        }
+                        else {  /* This works for all other non-UTF-8 folds
+                                 */
+                            s_fold_buf[0] = toLOWER_L1(UCHARAT(s));
+                        }
+                        s_len = 1;
+                    }
+
+                    /* Unshift this character to the beginning of the buffer,
+                     * No longer needed trailing characters are overwritten.
+                     * */
+                    Move(foldbuf, foldbuf + s_len, sizeof(foldbuf) - s_len, U8);
+                    Copy(s_fold, foldbuf, s_len, U8);
+
+                    /* If this isn't a multi-character fold, we have found a
+                     * splittable place.  If this is the final character in the
+                     * node, that means the node is valid as-is, and can quit.
+                     * Otherwise, we note how much we can fill the node before
+                     * coming to a non-splittable position, and go parse it
+                     * again, stopping there. This is done because we know
+                     * where in the output to stop, but we don't have a map to
+                     * where that is in the input.  One could be created, but
+                     * it seems like overkill for such a rare event as we are
+                     * dealing with here */
+                    if (UTF) {
+                        if (! is_MULTI_CHAR_FOLD_utf8_safe(foldbuf,
+                                                foldbuf + UTF8_MAXBYTES_CASE))
+                        {
+                            upper_fill = s + UTF8SKIP(s) - s0;
+                            if (LIKELY(upper_fill == 255)) {
+                                break;
+                            }
+                            goto reparse;
+                        }
+                    }
+                    else if (! is_MULTI_CHAR_FOLD_latin1_safe(foldbuf,
+                                                foldbuf + UTF8_MAXBYTES_CASE))
+                    {
+                        upper_fill = s + 1 - s0;
+                        if (LIKELY(upper_fill == 255)) {
+                            break;
+                        }
+                        goto reparse;
+                    }
                 }
+
+                /* Here the node consists entirely of non-final multi-char
+                 * folds.  (Likely it is all 'f's or all 's's.)  There's no
+                 * decent place to split it, so give up and just take the whole
+                 * thing */
+
 	    }   /* End of verifying node ends with an appropriate char */
 
+            p = oldp;
+
           loopdone:   /* Jumped to when encounters something that shouldn't be
                          in the node */
 
             /* Free up any over-allocated space; cast is to silence bogus
              * warning in MS VC */
             change_engine_size(pRExC_state,
-                                - (Ptrdiff_t) (initial_size - STR_SZ(len)));
+                        - (Ptrdiff_t) (current_string_nodes - STR_SZ(len)));
 
             /* I (khw) don't know if you can get here with zero length, but the
              * old code handled this situation by creating a zero-length EXACT
@@ -14707,7 +14876,13 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
                     else if (requires_utf8_target) {
                         node_type = EXACT_ONLY8;
                     }
-                } else if (FOLD) {
+                }
+                else if (node_type == LEXACT) {
+                    if (requires_utf8_target) {
+                        node_type = LEXACT_ONLY8;
+                    }
+                }
+                else if (FOLD) {
                     if (    UNLIKELY(has_micro_sign || has_ss)
                         && (node_type == EXACTFU || (   node_type == EXACTF
                                                      && maybe_exactfu)))
@@ -14760,11 +14935,11 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
                 }
 
                 OP(REGNODE_p(ret)) = node_type;
-                STR_LEN(REGNODE_p(ret)) = len;
+                setSTR_LEN(REGNODE_p(ret), len);
                 RExC_emit += STR_SZ(len);
 
                 /* If the node isn't a single character, it can't be SIMPLE */
-                if (len > (Size_t) ((UTF) ? UVCHR_SKIP(ender) : 1)) {
+                if (len > (Size_t) ((UTF) ? UTF8SKIP(STRING(REGNODE_p(ret))) : 1)) {
                     maybe_SIMPLE = 0;
                 }
 
@@ -18802,7 +18977,7 @@ S_regclass(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth,
                     ret = regnode_guts(pRExC_state, op, len, "exact");
                     FILL_NODE(ret, op);
                     RExC_emit += 1 + STR_SZ(len);
-                    STR_LEN(REGNODE_p(ret)) = len;
+                    setSTR_LEN(REGNODE_p(ret), len);
                     if (len == 1) {
                         *STRING(REGNODE_p(ret)) = (U8) value;
                     }
@@ -19951,7 +20126,9 @@ S_regtail_study(pTHX_ RExC_state_t *pRExC_state, regnode_offset p,
 #endif
         if ( exact ) {
             switch (OP(REGNODE_p(scan))) {
+                case LEXACT:
                 case EXACT:
+                case LEXACT_ONLY8:
                 case EXACT_ONLY8:
                 case EXACTL:
                 case EXACTF:
diff --git a/regcomp.h b/regcomp.h
index d9f2cbe63e..520e60e399 100644
--- a/regcomp.h
+++ b/regcomp.h
@@ -156,6 +156,14 @@ struct regnode_string {
     char string[1];
 };
 
+struct regnode_lstring { /* Constructed this way to keep the string aligned. */
+    U8	flags;
+    U8  type;
+    U16 next_off;
+    U32 str_len;    /* Only 16 bits allowed before would overflow 'next_off' */
+    char string[1];
+};
+
 /* Argument bearing node - workhorse, 
    arg1 is often for the data field */
 struct regnode_1 {
@@ -324,20 +332,57 @@ struct regnode_ssc {
 
 #undef OP
 #undef OPERAND
-#undef MASK
 #undef STRING
 
 #define	OP(p)		((p)->type)
 #define FLAGS(p)	((p)->flags)	/* Caution: Doesn't apply to all      \
 					   regnode types.  For some, it's the \
 					   character set of the regnode */
+#define	STR_LENs(p)	(__ASSERT_(OP(p) != LEXACT && OP(p) != LEXACT_ONLY8)  \
+                                    ((struct regnode_string *)p)->str_len)
+#define	STRINGs(p)	(__ASSERT_(OP(p) != LEXACT && OP(p) != LEXACT_ONLY8)  \
+                                    ((struct regnode_string *)p)->string)
+#define	OPERANDs(p)	STRINGs(p)
+
+/* Long strings.  Currently limited to length 18 bits, which handles a 262000
+ * byte string.  The limiting factor is the 16 bit 'next_off' field, which
+ * points to the next regnode, so the furthest away it can be is 2**16.  On
+ * most architectures, regnodes are 2**2 bytes long, so that yields 2**18
+ * bytes.  Should a longer string be desired, we could increase it to 26 bits
+ * fairly easily, by changing this node to have longj type which causes the ARG
+ * field to be used for the link to the next regnode (although code would have
+ * to be changed to account for this), and then use a combination of the flags
+ * and next_off fields for the length.  To get 34 bit length, also change the
+ * node to be an ARG2L, using the second 32 bit field for the length, and not
+ * using the flags nor next_off fields at all.  One could have an llstring node
+ * and even an lllstring type. */
+#define	STR_LENl(p)	(__ASSERT_(OP(p) == LEXACT || OP(p) == LEXACT_ONLY8)  \
+                                    (((struct regnode_lstring *)p)->str_len))
+#define	STRINGl(p)	(__ASSERT_(OP(p) == LEXACT || OP(p) == LEXACT_ONLY8)  \
+                                    (((struct regnode_lstring *)p)->string))
+#define	OPERANDl(p)	STRINGl(p)
+
+#define	STR_LEN(p)	((OP(p) == LEXACT || OP(p) == LEXACT_ONLY8)           \
+                                               ? STR_LENl(p) : STR_LENs(p))
+#define	STRING(p)	((OP(p) == LEXACT || OP(p) == LEXACT_ONLY8)           \
+                                               ? STRINGl(p)  : STRINGs(p))
 #define	OPERAND(p)	STRING(p)
 
-#define MASK(p)		((char*)OPERAND(p))
-#define	STR_LEN(p)	(((struct regnode_string *)p)->str_len)
-#define	STRING(p)	(((struct regnode_string *)p)->string)
+/* The number of (smallest) regnode equivalents that a string of length l bytes
+ * occupies */
 #define STR_SZ(l)	(((l) + sizeof(regnode) - 1) / sizeof(regnode))
-#define NODE_SZ_STR(p)	(STR_SZ(STR_LEN(p))+1)
+
+/* The number of (smallest) regnode equivalents that the EXACTISH node 'p'
+ * occupies */
+#define NODE_SZ_STR(p)	(STR_SZ(STR_LEN(p)) + 1 + regarglen[(p)->type])
+
+#define setSTR_LEN(p,v)                                                     \
+    STMT_START{                                                             \
+        if (OP(p) == LEXACT || OP(p) == LEXACT_ONLY8)                       \
+            ((struct regnode_lstring *)(p))->str_len = (v);                 \
+        else                                                                \
+            ((struct regnode_string *)(p))->str_len = (v);                  \
+    } STMT_END
 
 #undef NODE_ALIGN
 #undef ARG_LOC
@@ -716,6 +761,8 @@ struct regnode_ssc {
 #  define UCHARAT(p)	((int)*(p)&CHARMASK)
 #endif
 
+/* Number of regnode equivalents that 'guy' occupies beyond the size of the
+ * smallest regnode. */
 #define EXTRA_SIZE(guy) ((sizeof(guy)-1)/sizeof(struct regnode))
 
 #define REG_ZERO_LEN_SEEN                   0x00000001
diff --git a/regcomp.sym b/regcomp.sym
index 8a2fb240f1..fd594dfdcd 100644
--- a/regcomp.sym
+++ b/regcomp.sym
@@ -117,6 +117,10 @@ BRANCH      BRANCH,     node 0 V  ; Match this alternative, or the next...
 # NOTE: the relative ordering of these types is important do not change it
 
 EXACT       EXACT,      str       ; Match this string (flags field is the length).
+
+#* In a long string node, the U32 argument is the length, and is
+#* immediately followed by the string.
+LEXACT      EXACT,  len:str 1; Match this long string (preceded by length; flags unused).
 EXACTL      EXACT,      str       ; Like EXACT, but /l is in effect (used so locale-related warnings can be checked for).
 EXACTF      EXACT,      str       ; Like EXACT, but match using /id rules; (string not UTF-8, not guaranteed to be folded).
 EXACTFL     EXACT,      str       ; Like EXACT, but match using /il rules; (string not likely to be folded).
@@ -137,6 +141,7 @@ EXACTFAA_NO_TRIE  EXACT, str	  ; Like EXACT, but match using /iaa rules (string
 
 
 EXACT_ONLY8 EXACT,      str       ; Like EXACT, but only UTF-8 encoded targets can match
+LEXACT_ONLY8 EXACT,  len:str 1    ; Like LEXACT, but only UTF-8 encoded targets can match
 EXACTFU_ONLY8 EXACT,    str       ; Like EXACTFU, but only UTF-8 encoded targets can match
 # One could add EXACTFAA8 and something that has the same effect for /l,
 # but these would be extremely uncommon
diff --git a/regexec.c b/regexec.c
index a6e5f87bee..db19a50d86 100644
--- a/regexec.c
+++ b/regexec.c
@@ -2298,8 +2298,8 @@ S_find_byclass(pTHX_ regexp * prog, const regnode *c, char *s,
          * first character.  c2 is its fold.  This logic will not work for
          * Unicode semantics and the german sharp ss, which hence should
          * not be compiled into a node that gets here. */
-        pat_string = STRING(c);
-        ln  = STR_LEN(c);	/* length to match in octets/bytes */
+        pat_string = STRINGs(c);
+        ln  = STR_LENs(c);	/* length to match in octets/bytes */
 
         /* We know that we have to match at least 'ln' bytes (which is the
          * same as characters, since not utf8).  If we have to match 3
@@ -2374,8 +2374,8 @@ S_find_byclass(pTHX_ regexp * prog, const regnode *c, char *s,
         /* If one of the operands is in utf8, we can't use the simpler folding
          * above, due to the fact that many different characters can have the
          * same fold, or portion of a fold, or different- length fold */
-        pat_string = STRING(c);
-        ln  = STR_LEN(c);	/* length to match in octets/bytes */
+        pat_string = STRINGs(c);
+        ln  = STR_LENs(c);	/* length to match in octets/bytes */
         pat_end = pat_string + ln;
         lnc = is_utf8_pat       /* length to match in characters */
                 ? utf8_length((U8 *) pat_string, (U8 *) pat_end)
@@ -4237,7 +4237,9 @@ S_setup_EXACTISH_ST_c1_c2(pTHX_ const regnode * const text_node, int *c1p,
     U8 folded[UTF8_MAX_FOLD_CHAR_EXPAND * UTF8_MAXBYTES_CASE + 1] = { '\0' };
 
     if (   OP(text_node) == EXACT
+        || OP(text_node) == LEXACT
         || OP(text_node) == EXACT_ONLY8
+        || OP(text_node) == LEXACT_ONLY8
         || OP(text_node) == EXACTL)
     {
 
@@ -4246,7 +4248,8 @@ S_setup_EXACTISH_ST_c1_c2(pTHX_ const regnode * const text_node, int *c1p,
          * copy the input to the output, avoiding finding the code point of
          * that character */
         if (!is_utf8_pat) {
-            assert(OP(text_node) != EXACT_ONLY8);
+            assert(   OP(text_node) != EXACT_ONLY8
+                   && OP(text_node) != LEXACT_ONLY8);
             c2 = c1 = *pat;
         }
         else if (utf8_target) {
@@ -4254,7 +4257,9 @@ S_setup_EXACTISH_ST_c1_c2(pTHX_ const regnode * const text_node, int *c1p,
             Copy(pat, c2_utf8, UTF8SKIP(pat), U8);
             utf8_has_been_setup = TRUE;
         }
-        else if (OP(text_node) == EXACT_ONLY8) {
+        else if (   OP(text_node) == EXACT_ONLY8
+                 || OP(text_node) == LEXACT_ONLY8)
+        {
             return FALSE;   /* Can only match UTF-8 target */
         }
         else {
@@ -4262,7 +4267,7 @@ S_setup_EXACTISH_ST_c1_c2(pTHX_ const regnode * const text_node, int *c1p,
         }
     }
     else { /* an EXACTFish node */
-        U8 *pat_end = pat + STR_LEN(text_node);
+        U8 *pat_end = pat + STR_LENs(text_node);
 
         /* An EXACTFL node has at least some characters unfolded, because what
          * they match is not known until now.  So, now is the time to fold
@@ -6274,6 +6279,20 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog)
         }
 #undef  ST
 
+	case LEXACT_ONLY8:
+            if (! utf8_target) {
+                sayNO;
+            }
+            /* FALLTHROUGH */
+
+	case LEXACT:
+        {
+	    char *s;
+
+	    s = STRINGl(scan);
+	    ln = STR_LENl(scan);
+            goto join_short_long_exact;
+
 	case EXACTL:             /*  /abc/l       */
             _CHECK_AND_WARN_PROBLEMATIC_LOCALE;
 
@@ -6292,11 +6311,13 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog)
                 sayNO;
             }
             /* FALLTHROUGH */
-	case EXACT: {            /*  /abc/        */
-	    char *s;
+
+	case EXACT:             /*  /abc/        */
           do_exact:
-	    s = STRING(scan);
-	    ln = STR_LEN(scan);
+	    s = STRINGs(scan);
+	    ln = STR_LENs(scan);
+
+          join_short_long_exact:
 	    if (utf8_target != is_utf8_pat) {
 		/* The target and the pattern have differing utf8ness. */
 		char *l = locinput;
@@ -6448,8 +6469,8 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog)
 	    fold_utf8_flags = 0;
 
 	  do_exactf:
-	    s = STRING(scan);
-	    ln = STR_LEN(scan);
+	    s = STRINGs(scan);
+	    ln = STR_LENs(scan);
 
 	    if (   utf8_target
                 || is_utf8_pat
@@ -9363,6 +9384,22 @@ S_regrepeat(pTHX_ regexp *prog, char **startposp, const regnode *p,
 	else
 	    scan = this_eol;
 	break;
+
+    case LEXACT_ONLY8:
+        if (! utf8_target) {
+            break;
+        }
+        /* FALLTHROUGH */
+
+    case LEXACT:
+      {
+        U8 * string;
+        Size_t str_len;
+
+	string = (U8 *) STRINGl(p);
+        str_len = STR_LENl(p);
+        goto join_short_long_exact;
+
     case EXACTL:
         _CHECK_AND_WARN_PROBLEMATIC_LOCALE;
         if (utf8_target && UTF8_IS_ABOVE_LATIN1(*scan)) {
@@ -9377,9 +9414,13 @@ S_regrepeat(pTHX_ regexp *prog, char **startposp, const regnode *p,
         /* FALLTHROUGH */
     case EXACT:
       do_exact:
-        assert(STR_LEN(p) == reginfo->is_utf8_pat ? UTF8SKIP(STRING(p)) : 1);
+	string = (U8 *) STRINGs(p);
+        str_len = STR_LENs(p);
+
+      join_short_long_exact:
+        assert(str_len == reginfo->is_utf8_pat ? UTF8SKIP(string) : 1);
 
-	c = (U8)*STRING(p);
+	c = *string;
 
         /* Can use a simple find if the pattern char to match on is invariant
          * under UTF-8, or both target and pattern aren't UTF-8.  Note that we
@@ -9401,8 +9442,8 @@ S_regrepeat(pTHX_ regexp *prog, char **startposp, const regnode *p,
                  * string EQ */
                 while (hardcount < max
                        && scan < this_eol
-                       && (scan_char_len = UTF8SKIP(scan)) <= STR_LEN(p)
-                       && memEQ(scan, STRING(p), scan_char_len))
+                       && (scan_char_len = UTF8SKIP(scan)) <= str_len
+                       && memEQ(scan, string, scan_char_len))
                 {
                     scan += scan_char_len;
                     hardcount++;
@@ -9412,7 +9453,7 @@ S_regrepeat(pTHX_ regexp *prog, char **startposp, const regnode *p,
 
                 /* Target isn't utf8; convert the character in the UTF-8
                  * pattern to non-UTF8, and do a simple find */
-                c = EIGHT_BIT_UTF8_TO_NATIVE(c, *(STRING(p) + 1));
+                c = EIGHT_BIT_UTF8_TO_NATIVE(c, *(string + 1));
                 scan = (char *) find_span_end((U8 *) scan, (U8 *) this_eol, (U8) c);
             } /* else pattern char is above Latin1, can't possibly match the
                  non-UTF-8 target */
@@ -9436,6 +9477,7 @@ S_regrepeat(pTHX_ regexp *prog, char **startposp, const regnode *p,
 	    }
 	}
 	break;
+      }
 
     case EXACTFAA_NO_TRIE: /* This node only generated for non-utf8 patterns */
         assert(! reginfo->is_utf8_pat);
@@ -9486,7 +9528,7 @@ S_regrepeat(pTHX_ regexp *prog, char **startposp, const regnode *p,
         int c1, c2;
         U8 c1_utf8[UTF8_MAXBYTES+1], c2_utf8[UTF8_MAXBYTES+1];
 
-        assert(STR_LEN(p) == reginfo->is_utf8_pat ? UTF8SKIP(STRING(p)) : 1);
+        assert(STR_LENs(p) == reginfo->is_utf8_pat ? UTF8SKIP(STRINGs(p)) : 1);
 
         if (S_setup_EXACTISH_ST_c1_c2(aTHX_ p, &c1, c1_utf8, &c2, c2_utf8,
                                         reginfo))
@@ -9494,10 +9536,10 @@ S_regrepeat(pTHX_ regexp *prog, char **startposp, const regnode *p,
             if (c1 == CHRTEST_VOID) {
                 /* Use full Unicode fold matching */
                 char *tmpeol = loceol;
-                STRLEN pat_len = reginfo->is_utf8_pat ? UTF8SKIP(STRING(p)) : 1;
+                STRLEN pat_len = reginfo->is_utf8_pat ? UTF8SKIP(STRINGs(p)) : 1;
                 while (hardcount < max
                         && foldEQ_utf8_flags(scan, &tmpeol, 0, utf8_target,
-                                             STRING(p), NULL, pat_len,
+                                             STRINGs(p), NULL, pat_len,
                                              reginfo->is_utf8_pat, utf8_flags))
                 {
                     scan = tmpeol;
diff --git a/regnodes.h b/regnodes.h
index a1929b823f..4174ad2544 100644
--- a/regnodes.h
+++ b/regnodes.h
@@ -6,8 +6,8 @@
 
 /* Regops and State definitions */
 
-#define REGNODE_MAX           	103
-#define REGMATCH_STATE_MAX    	143
+#define REGNODE_MAX           	105
+#define REGMATCH_STATE_MAX    	145
 
 #define	END                   	0	/* 0000 End of program. */
 #define	SUCCEED               	1	/* 0x01 Return from a subroutine, basically. */
@@ -49,72 +49,74 @@
 #define	CLUMP                 	35	/* 0x23 Match any extended grapheme cluster sequence */
 #define	BRANCH                	36	/* 0x24 Match this alternative, or the next... */
 #define	EXACT                 	37	/* 0x25 Match this string (flags field is the length). */
-#define	EXACTL                	38	/* 0x26 Like EXACT, but /l is in effect (used so locale-related warnings can be checked for). */
-#define	EXACTF                	39	/* 0x27 Like EXACT, but match using /id rules; (string not UTF-8, not guaranteed to be folded). */
-#define	EXACTFL               	40	/* 0x28 Like EXACT, but match using /il rules; (string not likely to be folded). */
-#define	EXACTFU               	41	/* 0x29 Like EXACT, but match using /iu rules; (string folded). */
-#define	EXACTFAA              	42	/* 0x2a Like EXACT, but match using /iaa rules; (string folded iff pattern is UTF8; folded length <= unfolded). */
-#define	EXACTFUP              	43	/* 0x2b Like EXACT, but match using /iu rules; (string not UTF-8, not guaranteed to be folded; and it is Problematic). */
-#define	EXACTFLU8             	44	/* 0x2c Like EXACTFU, but use /il, UTF-8, (string is folded, and everything in it is above 255. */
-#define	EXACTFAA_NO_TRIE      	45	/* 0x2d Like EXACT, but match using /iaa rules (string not UTF-8, not guaranteed to be folded, not currently trie-able). */
-#define	EXACT_ONLY8           	46	/* 0x2e Like EXACT, but only UTF-8 encoded targets can match */
-#define	EXACTFU_ONLY8         	47	/* 0x2f Like EXACTFU, but only UTF-8 encoded targets can match */
-#define	EXACTFU_S_EDGE        	48	/* 0x30 /di rules, but nothing in it precludes /ui, except begins and/or ends with [Ss]; (string not UTF-8; compile-time only). */
-#define	NOTHING               	49	/* 0x31 Match empty string. */
-#define	TAIL                  	50	/* 0x32 Match empty string. Can jump here from outside. */
-#define	STAR                  	51	/* 0x33 Match this (simple) thing 0 or more times. */
-#define	PLUS                  	52	/* 0x34 Match this (simple) thing 1 or more times. */
-#define	CURLY                 	53	/* 0x35 Match this simple thing {n,m} times. */
-#define	CURLYN                	54	/* 0x36 Capture next-after-this simple thing */
-#define	CURLYM                	55	/* 0x37 Capture this medium-complex thing {n,m} times. */
-#define	CURLYX                	56	/* 0x38 Match this complex thing {n,m} times. */
-#define	WHILEM                	57	/* 0x39 Do curly processing and see if rest matches. */
-#define	OPEN                  	58	/* 0x3a Mark this point in input as start of #n. */
-#define	CLOSE                 	59	/* 0x3b Close corresponding OPEN of #n. */
-#define	SROPEN                	60	/* 0x3c Same as OPEN, but for script run */
-#define	SRCLOSE               	61	/* 0x3d Close preceding SROPEN */
-#define	REF                   	62	/* 0x3e Match some already matched string */
-#define	REFF                  	63	/* 0x3f Match already matched string, using /di rules. */
-#define	REFFL                 	64	/* 0x40 Match already matched string, using /li rules. */
-#define	REFFU                 	65	/* 0x41 Match already matched string, usng /ui. */
-#define	REFFA                 	66	/* 0x42 Match already matched string, using /aai rules. */
-#define	REFN                  	67	/* 0x43 Match some already matched string */
-#define	REFFN                 	68	/* 0x44 Match already matched string, using /di rules. */
-#define	REFFLN                	69	/* 0x45 Match already matched string, using /li rules. */
-#define	REFFUN                	70	/* 0x46 Match already matched string, using /ui rules. */
-#define	REFFAN                	71	/* 0x47 Match already matched string, using /aai rules. */
-#define	LONGJMP               	72	/* 0x48 Jump far away. */
-#define	BRANCHJ               	73	/* 0x49 BRANCH with long offset. */
-#define	IFMATCH               	74	/* 0x4a Succeeds if the following matches; non-zero flags "f", next_off "o" means lookbehind assertion starting "f..(f-o)" characters before current */
-#define	UNLESSM               	75	/* 0x4b Fails if the following matches; non-zero flags "f", next_off "o" means lookbehind assertion starting "f..(f-o)" characters before current */
-#define	SUSPEND               	76	/* 0x4c "Independent" sub-RE. */
-#define	IFTHEN                	77	/* 0x4d Switch, should be preceded by switcher. */
-#define	GROUPP                	78	/* 0x4e Whether the group matched. */
-#define	EVAL                  	79	/* 0x4f Execute some Perl code. */
-#define	MINMOD                	80	/* 0x50 Next operator is not greedy. */
-#define	LOGICAL               	81	/* 0x51 Next opcode should set the flag only. */
-#define	RENUM                 	82	/* 0x52 Group with independently numbered parens. */
-#define	TRIE                  	83	/* 0x53 Match many EXACT(F[ALU]?)? at once. flags==type */
-#define	TRIEC                 	84	/* 0x54 Same as TRIE, but with embedded charclass data */
-#define	AHOCORASICK           	85	/* 0x55 Aho Corasick stclass. flags==type */
-#define	AHOCORASICKC          	86	/* 0x56 Same as AHOCORASICK, but with embedded charclass data */
-#define	GOSUB                 	87	/* 0x57 recurse to paren arg1 at (signed) ofs arg2 */
-#define	GROUPPN               	88	/* 0x58 Whether the group matched. */
-#define	INSUBP                	89	/* 0x59 Whether we are in a specific recurse. */
-#define	DEFINEP               	90	/* 0x5a Never execute directly. */
-#define	ENDLIKE               	91	/* 0x5b Used only for the type field of verbs */
-#define	OPFAIL                	92	/* 0x5c Same as (?!), but with verb arg */
-#define	ACCEPT                	93	/* 0x5d Accepts the current matched string, with verbar */
-#define	VERB                  	94	/* 0x5e Used only for the type field of verbs */
-#define	PRUNE                 	95	/* 0x5f Pattern fails at this startpoint if no-backtracking through this */
-#define	MARKPOINT             	96	/* 0x60 Push the current location for rollback by cut. */
-#define	SKIP                  	97	/* 0x61 On failure skip forward (to the mark) before retrying */
-#define	COMMIT                	98	/* 0x62 Pattern fails outright if backtracking through this */
-#define	CUTGROUP              	99	/* 0x63 On failure go to the next alternation in the group */
-#define	KEEPS                 	100	/* 0x64 $& begins here. */
-#define	LNBREAK               	101	/* 0x65 generic newline pattern */
-#define	OPTIMIZED             	102	/* 0x66 Placeholder for dump. */
-#define	PSEUDO                	103	/* 0x67 Pseudo opcode for internal use. */
+#define	LEXACT                	38	/* 0x26 Match this long string (preceded by length; flags unused). */
+#define	EXACTL                	39	/* 0x27 Like EXACT, but /l is in effect (used so locale-related warnings can be checked for). */
+#define	EXACTF                	40	/* 0x28 Like EXACT, but match using /id rules; (string not UTF-8, not guaranteed to be folded). */
+#define	EXACTFL               	41	/* 0x29 Like EXACT, but match using /il rules; (string not likely to be folded). */
+#define	EXACTFU               	42	/* 0x2a Like EXACT, but match using /iu rules; (string folded). */
+#define	EXACTFAA              	43	/* 0x2b Like EXACT, but match using /iaa rules; (string folded iff pattern is UTF8; folded length <= unfolded). */
+#define	EXACTFUP              	44	/* 0x2c Like EXACT, but match using /iu rules; (string not UTF-8, not guaranteed to be folded; and it is Problematic). */
+#define	EXACTFLU8             	45	/* 0x2d Like EXACTFU, but use /il, UTF-8, (string is folded, and everything in it is above 255. */
+#define	EXACTFAA_NO_TRIE      	46	/* 0x2e Like EXACT, but match using /iaa rules (string not UTF-8, not guaranteed to be folded, not currently trie-able). */
+#define	EXACT_ONLY8           	47	/* 0x2f Like EXACT, but only UTF-8 encoded targets can match */
+#define	LEXACT_ONLY8          	48	/* 0x30 Like LEXACT, but only UTF-8 encoded targets can match */
+#define	EXACTFU_ONLY8         	49	/* 0x31 Like EXACTFU, but only UTF-8 encoded targets can match */
+#define	EXACTFU_S_EDGE        	50	/* 0x32 /di rules, but nothing in it precludes /ui, except begins and/or ends with [Ss]; (string not UTF-8; compile-time only). */
+#define	NOTHING               	51	/* 0x33 Match empty string. */
+#define	TAIL                  	52	/* 0x34 Match empty string. Can jump here from outside. */
+#define	STAR                  	53	/* 0x35 Match this (simple) thing 0 or more times. */
+#define	PLUS                  	54	/* 0x36 Match this (simple) thing 1 or more times. */
+#define	CURLY                 	55	/* 0x37 Match this simple thing {n,m} times. */
+#define	CURLYN                	56	/* 0x38 Capture next-after-this simple thing */
+#define	CURLYM                	57	/* 0x39 Capture this medium-complex thing {n,m} times. */
+#define	CURLYX                	58	/* 0x3a Match this complex thing {n,m} times. */
+#define	WHILEM                	59	/* 0x3b Do curly processing and see if rest matches. */
+#define	OPEN                  	60	/* 0x3c Mark this point in input as start of #n. */
+#define	CLOSE                 	61	/* 0x3d Close corresponding OPEN of #n. */
+#define	SROPEN                	62	/* 0x3e Same as OPEN, but for script run */
+#define	SRCLOSE               	63	/* 0x3f Close preceding SROPEN */
+#define	REF                   	64	/* 0x40 Match some already matched string */
+#define	REFF                  	65	/* 0x41 Match already matched string, using /di rules. */
+#define	REFFL                 	66	/* 0x42 Match already matched string, using /li rules. */
+#define	REFFU                 	67	/* 0x43 Match already matched string, usng /ui. */
+#define	REFFA                 	68	/* 0x44 Match already matched string, using /aai rules. */
+#define	REFN                  	69	/* 0x45 Match some already matched string */
+#define	REFFN                 	70	/* 0x46 Match already matched string, using /di rules. */
+#define	REFFLN                	71	/* 0x47 Match already matched string, using /li rules. */
+#define	REFFUN                	72	/* 0x48 Match already matched string, using /ui rules. */
+#define	REFFAN                	73	/* 0x49 Match already matched string, using /aai rules. */
+#define	LONGJMP               	74	/* 0x4a Jump far away. */
+#define	BRANCHJ               	75	/* 0x4b BRANCH with long offset. */
+#define	IFMATCH               	76	/* 0x4c Succeeds if the following matches; non-zero flags "f", next_off "o" means lookbehind assertion starting "f..(f-o)" characters before current */
+#define	UNLESSM               	77	/* 0x4d Fails if the following matches; non-zero flags "f", next_off "o" means lookbehind assertion starting "f..(f-o)" characters before current */
+#define	SUSPEND               	78	/* 0x4e "Independent" sub-RE. */
+#define	IFTHEN                	79	/* 0x4f Switch, should be preceded by switcher. */
+#define	GROUPP                	80	/* 0x50 Whether the group matched. */
+#define	EVAL                  	81	/* 0x51 Execute some Perl code. */
+#define	MINMOD                	82	/* 0x52 Next operator is not greedy. */
+#define	LOGICAL               	83	/* 0x53 Next opcode should set the flag only. */
+#define	RENUM                 	84	/* 0x54 Group with independently numbered parens. */
+#define	TRIE                  	85	/* 0x55 Match many EXACT(F[ALU]?)? at once. flags==type */
+#define	TRIEC                 	86	/* 0x56 Same as TRIE, but with embedded charclass data */
+#define	AHOCORASICK           	87	/* 0x57 Aho Corasick stclass. flags==type */
+#define	AHOCORASICKC          	88	/* 0x58 Same as AHOCORASICK, but with embedded charclass data */
+#define	GOSUB                 	89	/* 0x59 recurse to paren arg1 at (signed) ofs arg2 */
+#define	GROUPPN               	90	/* 0x5a Whether the group matched. */
+#define	INSUBP                	91	/* 0x5b Whether we are in a specific recurse. */
+#define	DEFINEP               	92	/* 0x5c Never execute directly. */
+#define	ENDLIKE               	93	/* 0x5d Used only for the type field of verbs */
+#define	OPFAIL                	94	/* 0x5e Same as (?!), but with verb arg */
+#define	ACCEPT                	95	/* 0x5f Accepts the current matched string, with verbar */
+#define	VERB                  	96	/* 0x60 Used only for the type field of verbs */
+#define	PRUNE                 	97	/* 0x61 Pattern fails at this startpoint if no-backtracking through this */
+#define	MARKPOINT             	98	/* 0x62 Push the current location for rollback by cut. */
+#define	SKIP                  	99	/* 0x63 On failure skip forward (to the mark) before retrying */
+#define	COMMIT                	100	/* 0x64 Pattern fails outright if backtracking through this */
+#define	CUTGROUP              	101	/* 0x65 On failure go to the next alternation in the group */
+#define	KEEPS                 	102	/* 0x66 $& begins here. */
+#define	LNBREAK               	103	/* 0x67 generic newline pattern */
+#define	OPTIMIZED             	104	/* 0x68 Placeholder for dump. */
+#define	PSEUDO                	105	/* 0x69 Pseudo opcode for internal use. */
 	/* ------------ States ------------- */
 #define	TRIE_next             	(REGNODE_MAX + 1)	/* state for TRIE */
 #define	TRIE_next_fail        	(REGNODE_MAX + 2)	/* state for TRIE */
@@ -201,6 +203,7 @@ EXTCONST U8 PL_regkind[] = {
 	CLUMP,    	/* CLUMP                  */
 	BRANCH,   	/* BRANCH                 */
 	EXACT,    	/* EXACT                  */
+	EXACT,    	/* LEXACT                 */
 	EXACT,    	/* EXACTL                 */
 	EXACT,    	/* EXACTF                 */
 	EXACT,    	/* EXACTFL                */
@@ -210,6 +213,7 @@ EXTCONST U8 PL_regkind[] = {
 	EXACT,    	/* EXACTFLU8              */
 	EXACT,    	/* EXACTFAA_NO_TRIE       */
 	EXACT,    	/* EXACT_ONLY8            */
+	EXACT,    	/* LEXACT_ONLY8           */
 	EXACT,    	/* EXACTFU_ONLY8          */
 	EXACT,    	/* EXACTFU_S_EDGE         */
 	NOTHING,  	/* NOTHING                */
@@ -354,6 +358,7 @@ static const U8 regarglen[] = {
 	0,                                   	/* CLUMP        */
 	0,                                   	/* BRANCH       */
 	0,                                   	/* EXACT        */
+	EXTRA_SIZE(struct regnode_1),        	/* LEXACT       */
 	0,                                   	/* EXACTL       */
 	0,                                   	/* EXACTF       */
 	0,                                   	/* EXACTFL      */
@@ -363,6 +368,7 @@ static const U8 regarglen[] = {
 	0,                                   	/* EXACTFLU8    */
 	0,                                   	/* EXACTFAA_NO_TRIE */
 	0,                                   	/* EXACT_ONLY8  */
+	EXTRA_SIZE(struct regnode_1),        	/* LEXACT_ONLY8 */
 	0,                                   	/* EXACTFU_ONLY8 */
 	0,                                   	/* EXACTFU_S_EDGE */
 	0,                                   	/* NOTHING      */
@@ -463,6 +469,7 @@ static const char reg_off_by_arg[] = {
 	0,	/* CLUMP        */
 	0,	/* BRANCH       */
 	0,	/* EXACT        */
+	0,	/* LEXACT       */
 	0,	/* EXACTL       */
 	0,	/* EXACTF       */
 	0,	/* EXACTFL      */
@@ -472,6 +479,7 @@ static const char reg_off_by_arg[] = {
 	0,	/* EXACTFLU8    */
 	0,	/* EXACTFAA_NO_TRIE */
 	0,	/* EXACT_ONLY8  */
+	0,	/* LEXACT_ONLY8 */
 	0,	/* EXACTFU_ONLY8 */
 	0,	/* EXACTFU_S_EDGE */
 	0,	/* NOTHING      */
@@ -578,72 +586,74 @@ EXTCONST char * const PL_reg_name[] = {
 	"CLUMP",                 	/* 0x23 */
 	"BRANCH",                	/* 0x24 */
 	"EXACT",                 	/* 0x25 */
-	"EXACTL",                	/* 0x26 */
-	"EXACTF",                	/* 0x27 */
-	"EXACTFL",               	/* 0x28 */
-	"EXACTFU",               	/* 0x29 */
-	"EXACTFAA",              	/* 0x2a */
-	"EXACTFUP",              	/* 0x2b */
-	"EXACTFLU8",             	/* 0x2c */
-	"EXACTFAA_NO_TRIE",      	/* 0x2d */
-	"EXACT_ONLY8",           	/* 0x2e */
-	"EXACTFU_ONLY8",         	/* 0x2f */
-	"EXACTFU_S_EDGE",        	/* 0x30 */
-	"NOTHING",               	/* 0x31 */
-	"TAIL",                  	/* 0x32 */
-	"STAR",                  	/* 0x33 */
-	"PLUS",                  	/* 0x34 */
-	"CURLY",                 	/* 0x35 */
-	"CURLYN",                	/* 0x36 */
-	"CURLYM",                	/* 0x37 */
-	"CURLYX",                	/* 0x38 */
-	"WHILEM",                	/* 0x39 */
-	"OPEN",                  	/* 0x3a */
-	"CLOSE",                 	/* 0x3b */
-	"SROPEN",                	/* 0x3c */
-	"SRCLOSE",               	/* 0x3d */
-	"REF",                   	/* 0x3e */
-	"REFF",                  	/* 0x3f */
-	"REFFL",                 	/* 0x40 */
-	"REFFU",                 	/* 0x41 */
-	"REFFA",                 	/* 0x42 */
-	"REFN",                  	/* 0x43 */
-	"REFFN",                 	/* 0x44 */
-	"REFFLN",                	/* 0x45 */
-	"REFFUN",                	/* 0x46 */
-	"REFFAN",                	/* 0x47 */
-	"LONGJMP",               	/* 0x48 */
-	"BRANCHJ",               	/* 0x49 */
-	"IFMATCH",               	/* 0x4a */
-	"UNLESSM",               	/* 0x4b */
-	"SUSPEND",               	/* 0x4c */
-	"IFTHEN",                	/* 0x4d */
-	"GROUPP",                	/* 0x4e */
-	"EVAL",                  	/* 0x4f */
-	"MINMOD",                	/* 0x50 */
-	"LOGICAL",               	/* 0x51 */
-	"RENUM",                 	/* 0x52 */
-	"TRIE",                  	/* 0x53 */
-	"TRIEC",                 	/* 0x54 */
-	"AHOCORASICK",           	/* 0x55 */
-	"AHOCORASICKC",          	/* 0x56 */
-	"GOSUB",                 	/* 0x57 */
-	"GROUPPN",               	/* 0x58 */
-	"INSUBP",                	/* 0x59 */
-	"DEFINEP",               	/* 0x5a */
-	"ENDLIKE",               	/* 0x5b */
-	"OPFAIL",                	/* 0x5c */
-	"ACCEPT",                	/* 0x5d */
-	"VERB",                  	/* 0x5e */
-	"PRUNE",                 	/* 0x5f */
-	"MARKPOINT",             	/* 0x60 */
-	"SKIP",                  	/* 0x61 */
-	"COMMIT",                	/* 0x62 */
-	"CUTGROUP",              	/* 0x63 */
-	"KEEPS",                 	/* 0x64 */
-	"LNBREAK",               	/* 0x65 */
-	"OPTIMIZED",             	/* 0x66 */
-	"PSEUDO",                	/* 0x67 */
+	"LEXACT",                	/* 0x26 */
+	"EXACTL",                	/* 0x27 */
+	"EXACTF",                	/* 0x28 */
+	"EXACTFL",               	/* 0x29 */
+	"EXACTFU",               	/* 0x2a */
+	"EXACTFAA",              	/* 0x2b */
+	"EXACTFUP",              	/* 0x2c */
+	"EXACTFLU8",             	/* 0x2d */
+	"EXACTFAA_NO_TRIE",      	/* 0x2e */
+	"EXACT_ONLY8",           	/* 0x2f */
+	"LEXACT_ONLY8",          	/* 0x30 */
+	"EXACTFU_ONLY8",         	/* 0x31 */
+	"EXACTFU_S_EDGE",        	/* 0x32 */
+	"NOTHING",               	/* 0x33 */
+	"TAIL",                  	/* 0x34 */
+	"STAR",                  	/* 0x35 */
+	"PLUS",                  	/* 0x36 */
+	"CURLY",                 	/* 0x37 */
+	"CURLYN",                	/* 0x38 */
+	"CURLYM",                	/* 0x39 */
+	"CURLYX",                	/* 0x3a */
+	"WHILEM",                	/* 0x3b */
+	"OPEN",                  	/* 0x3c */
+	"CLOSE",                 	/* 0x3d */
+	"SROPEN",                	/* 0x3e */
+	"SRCLOSE",               	/* 0x3f */
+	"REF",                   	/* 0x40 */
+	"REFF",                  	/* 0x41 */
+	"REFFL",                 	/* 0x42 */
+	"REFFU",                 	/* 0x43 */
+	"REFFA",                 	/* 0x44 */
+	"REFN",                  	/* 0x45 */
+	"REFFN",                 	/* 0x46 */
+	"REFFLN",                	/* 0x47 */
+	"REFFUN",                	/* 0x48 */
+	"REFFAN",                	/* 0x49 */
+	"LONGJMP",               	/* 0x4a */
+	"BRANCHJ",               	/* 0x4b */
+	"IFMATCH",               	/* 0x4c */
+	"UNLESSM",               	/* 0x4d */
+	"SUSPEND",               	/* 0x4e */
+	"IFTHEN",                	/* 0x4f */
+	"GROUPP",                	/* 0x50 */
+	"EVAL",                  	/* 0x51 */
+	"MINMOD",                	/* 0x52 */
+	"LOGICAL",               	/* 0x53 */
+	"RENUM",                 	/* 0x54 */
+	"TRIE",                  	/* 0x55 */
+	"TRIEC",                 	/* 0x56 */
+	"AHOCORASICK",           	/* 0x57 */
+	"AHOCORASICKC",          	/* 0x58 */
+	"GOSUB",                 	/* 0x59 */
+	"GROUPPN",               	/* 0x5a */
+	"INSUBP",                	/* 0x5b */
+	"DEFINEP",               	/* 0x5c */
+	"ENDLIKE",               	/* 0x5d */
+	"OPFAIL",                	/* 0x5e */
+	"ACCEPT",                	/* 0x5f */
+	"VERB",                  	/* 0x60 */
+	"PRUNE",                 	/* 0x61 */
+	"MARKPOINT",             	/* 0x62 */
+	"SKIP",                  	/* 0x63 */
+	"COMMIT",                	/* 0x64 */
+	"CUTGROUP",              	/* 0x65 */
+	"KEEPS",                 	/* 0x66 */
+	"LNBREAK",               	/* 0x67 */
+	"OPTIMIZED",             	/* 0x68 */
+	"PSEUDO",                	/* 0x69 */
 	/* ------------ States ------------- */
 	"TRIE_next",             	/* REGNODE_MAX +0x01 */
 	"TRIE_next_fail",        	/* REGNODE_MAX +0x02 */
@@ -778,7 +788,7 @@ EXTCONST U8 PL_varies[] __attribute__deprecated__ = {
 EXTCONST U8 PL_varies_bitmask[];
 #else
 EXTCONST U8 PL_varies_bitmask[] = {
-    0x00, 0x00, 0x00, 0x00, 0x18, 0x00, 0xF8, 0xC3, 0xFF, 0x32, 0x00, 0x00, 0x00
+    0x00, 0x00, 0x00, 0x00, 0x18, 0x00, 0xE0, 0x0F, 0xFF, 0xCB, 0x00, 0x00, 0x00, 0x00
 };
 #endif /* DOINIT */
 
@@ -801,7 +811,7 @@ EXTCONST U8 PL_simple[] __attribute__deprecated__ = {
 EXTCONST U8 PL_simple_bitmask[];
 #else
 EXTCONST U8 PL_simple_bitmask[] = {
-    0x00, 0x00, 0xFF, 0xFF, 0x07, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+    0x00, 0x00, 0xFF, 0xFF, 0x07, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
 };
 #endif /* DOINIT */
 
diff --git a/t/re/pat.t b/t/re/pat.t
index 67ad0f41af..d6189b80b9 100644
--- a/t/re/pat.t
+++ b/t/re/pat.t
@@ -25,7 +25,7 @@ BEGIN {
 skip_all('no re module') unless defined &DynaLoader::boot_DynaLoader;
 skip_all_without_unicode_tables();
 
-plan tests => 864;  # Update this when adding/deleting tests.
+plan tests => 965;  # Update this when adding/deleting tests.
 
 run_tests() unless caller;
 
@@ -1421,6 +1421,84 @@ EOP
         ok("\x{017F}\x{017F}" =~ qr/^[$sharp_s]?$/i, "[] to EXACTish optimization");
     }
 
+    {   # Test that it avoids spllitting a multi-char fold across nodes.
+        # These all fold to things that are like 'ss', which, if split across
+        # nodes could fail to match a single character that folds to the
+        # combination.
+        my $utf8_locale = find_utf8_ctype_locale();
+        for my $char('F', $sharp_s, "\x{FB00}") {
+            my $length = 260;    # Long enough to overflow an EXACTFish regnode
+            my $p = $char x $length;
+            my $s = ($char eq $sharp_s) ? 'ss' : 'ff';
+            $s = $s x $length;
+            for my $charset (qw(u d l aa)) {
+                for my $utf8 (0..1) {
+                  SKIP:
+                    for my $locale ('C', $utf8_locale) {
+                        skip "test skipped for non-C locales", 2
+                                    if $charset ne 'l'
+                                    && (! defined $locale || $locale ne 'C');
+                        if ($charset eq 'l') {
+                            if (! defined $locale) {
+                                skip "No UTF-8 locale", 2;
+                            }
+
+                            use POSIX;
+                            POSIX::setlocale(&LC_CTYPE, $locale);
+                        }
+
+                        my $pat = $p;
+                        utf8::upgrade($pat) if $utf8;
+                        my $should_pass =
+                            (    $charset eq 'u'
+                             || ($charset eq 'd' && $utf8)
+                             || ($charset eq 'd' && (   $char =~ /[[:ascii:]]/
+                                                     || ord $char > 255))
+                             || ($charset eq 'aa' && $char =~ /[[:ascii:]]/)
+                             || ($charset eq 'l' && $locale ne 'C')
+                             || ($charset eq 'l' && $char =~ /[[:ascii:]]/)
+                            );
+                        my $name = "(?i$charset), utf8=$utf8, locale=$locale,"
+                                 . " char=" . sprintf "%x", ord $char;
+                        no warnings 'locale';
+                        is (eval " '$s' =~ qr/(?i$charset)$pat/;",
+                            $should_pass, $name);
+                        fail "$name: $@" if $@;
+                        is (eval " 'a$s' =~ qr/(?i$charset)a$pat/;",
+                            $should_pass, "extra a, $name");
+                        fail "$name: $@" if $@;
+                    }
+                }
+            }
+        }
+    }
+
+    {
+        my $s = ("0123456789" x 26214) x 2; # Should fill 2 LEXACTS, plus
+                                            # small change
+        my $pattern_prefix = "use utf8; use re qw(Debug COMPILE)";
+        my $pattern = "$pattern_prefix; qr/$s/;";
+        my $result = fresh_perl($pattern);
+        if ($? != 0) {  # Re-run so as to display STDERR.
+            fail($pattern);
+            fresh_perl($pattern, { stderr => 0, verbose => 1 });
+        }
+        like($result, qr/Final program[^X]*\bLEXACT\b[^X]*\bLEXACT\b[^X]*\bEXACT\b[^X]*\bEND\b/s,
+             "Check that LEXACT nodes are generated");
+        like($s, qr/$s/, "Check that LEXACT nodes match");
+        like("a$s", qr/a$s/, "Previous test preceded by an 'a'");
+        substr($s, 260000, 1) = "\x{100}";
+        $pattern = "$pattern_prefix; qr/$s/;";
+        $result = fresh_perl($pattern, { 'wide_chars' => 1 } );
+        if ($? != 0) {  # Re-run so as to display STDERR.
+            fail($pattern);
+            fresh_perl($pattern, { stderr => 0, verbose => 1 });
+        }
+        like($result, qr/Final program[^X]*\bLEXACT_ONLY8\b[^X]*\bLEXACT\b[^X]*\bEXACT\b[^X]*\bEND\b/s,
+             "Check that an LEXACT_ONLY node is generated with a \\x{100}");
+        like($s, qr/$s/, "Check that LEXACT_ONLY8 nodes match");
+    }
+
     {
         for my $char (":", uni_to_native("\x{f7}"), "\x{2010}") {
             my $utf8_char = $char;

-- 
Perl5 Master Repository



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About