develooper Front page | perl.perl5.changes | Postings from February 2018

[perl.git] branch blead updated. v5.27.8-148-g2ea7b253ec

From:
Karl Williamson
Date:
February 3, 2018 17:33
Subject:
[perl.git] branch blead updated. v5.27.8-148-g2ea7b253ec
Message ID:
E1ei1ga-0000NU-Hf@git.dc.perl.space
In perl.git, the branch blead has been updated

<https://perl5.git.perl.org/perl.git/commitdiff/2ea7b253ec46e8acd1ff2b09220c60eed34cd337?hp=9fc6ca9da4208eb58a2ad8169b81757082a52f85>

- Log -----------------------------------------------------------------
commit 2ea7b253ec46e8acd1ff2b09220c60eed34cd337
Author: Karl Williamson <khw@cpan.org>
Date:   Sat Feb 3 10:29:33 2018 -0700

    regcomp.c: Clarify comment

commit 792b461851b63487dbcd63fc1948671db8f1dbe5
Author: Karl Williamson <khw@cpan.org>
Date:   Sat Feb 3 10:25:31 2018 -0700

    regcomp.c: Pack EXACTish nodes more fully
    
    Prior to this commit, nodes that are to match a string exactly, or
    possibly case insensitively used only half the potential space available
    (that being limited by the length field which is a U8).  (The optimizer
    might later pack some together to make a larger node.)  Talking it over
    with Yves, we suspect that this is a relic of an earlier time.  It makes
    more sense to have longer nodes when possible to lower overhead in
    the matching engine.

-----------------------------------------------------------------------

Summary of changes:
 regcomp.c | 36 ++++++++++++++++--------------------
 1 file changed, 16 insertions(+), 20 deletions(-)

diff --git a/regcomp.c b/regcomp.c
index 6dbfed52ab..8cfe6a1b38 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -13287,8 +13287,12 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
 	    UV ender = 0;
 	    char *p;
 	    char *s;
-#define MAX_NODE_STRING_SIZE 127
+
+/* This allows us to fill a node with just enough spare so that if the final
+ * character folds, its expansion is guaranteed to fit */
+#define MAX_NODE_STRING_SIZE (255-UTF8_MAXBYTES_CASE)
 	    char foldbuf[MAX_NODE_STRING_SIZE+UTF8_MAXBYTES_CASE+1];
+
 	    char *s0;
 	    U8 upper_parse = MAX_NODE_STRING_SIZE;
             U8 node_type = compute_EXACTish(pRExC_state);
@@ -13310,7 +13314,8 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
 
             /* If a folding node contains only code points that don't
              * participate in folds, it can be changed into an EXACT node,
-             * which allows the optimizer more things to look for */
+             * which allows the optimizer more things to look for, and is
+             * faster to match */
             bool maybe_exact;
 
 	    ret = reg_node(pRExC_state, node_type);
@@ -13332,24 +13337,15 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
              * use a pseudo regnode like 'EXACT_ORIG_FOLD' */
             maybe_exact = FOLD && PASS2;
 
-	    /* XXX The node can hold up to 255 bytes, yet this only goes to
-             * 127.  I (khw) do not know why.  Keeping it somewhat less than
-             * 255 allows us to not have to worry about overflow due to
-             * converting to utf8 and fold expansion, but that value is
-             * 255-UTF8_MAXBYTES_CASE.  join_exact() may join adjacent nodes
-             * split up by this limit into a single one using the real max of
-             * 255.  Even at 127, this breaks under rare circumstances.  If
-             * folding, we do not want to split a node at a character that is a
-             * non-final in a multi-char fold, as an input string could just
-             * happen to want to match across the node boundary.  The join
-             * would solve that problem if the join actually happens.  But a
-             * series of more than two nodes in a row each of 127 would cause
-             * the first join to succeed to get to 254, but then there wouldn't
-             * be room for the next one, which could at be one of those split
-             * multi-char folds.  I don't know of any fool-proof solution.  One
-             * could back off to end with only a code point that isn't such a
-             * non-final, but it is possible for there not to be any in the
-             * entire node. */
+            /* This breaks under rare circumstances.  If folding, we do not
+             * want to split a node at a character that is a non-final in a
+             * multi-char fold, as an input string could just happen to want to
+             * match across the node boundary.  The code at the end of the loop
+             * looks for this, and backs off until it finds not such a
+             * character, but it is possible (though extremely, extremely
+             * unlikely) for all characters in the node to be non-final fold
+             * ones, in which case we just leave the node fully filled, and
+             * hope that it doesn't match the string in just the wrong place */
 
             assert(   ! UTF     /* Is at the beginning of a character */
                    || UTF8_IS_INVARIANT(UCHARAT(RExC_parse))

-- 
Perl5 Master Repository



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About