develooper Front page | perl.perl5.porters | Postings from January 2019

Re: [perl #133756] //g flag on regex with UTF-8 source causes regexoptimiser to wrongly reject a match

Thread Next
From:
Nicholas Clark
Date:
January 10, 2019 09:41
Subject:
Re: [perl #133756] //g flag on regex with UTF-8 source causes regexoptimiser to wrongly reject a match
Message ID:
20190110090543.k2xysqttjwyujnr2@ceres.etla.org

On Wed, Jan 09, 2019 at 04:27:35PM -0800, Tony Cook via RT wrote:
> You're going to facepalm yourself.

Yes. :-)

> On Wed, 09 Jan 2019 06:11:06 -0800, nicholas wrote:
> > For the case where the eval'd source code contains Ā in a code comment
> > this
> > doesn't match. Where the code comment is ÿ it does:
> > 
> > nicholas@dromedary-001 perl6$ cat /tmp/rule.pl
> > use strict;
> > use warnings;
> > 
> > my $mr = "L\xFCften Kalt";
> > my $text = "L\xFCften Kalt";
> > 
> > # Culprit here is the //g flag:
> > for my $char ("\xFF", "\x100", "\xFF", "\x100") {
> 
> Add:
> 
>   pos($text) = 0;
> 
> here and they all pass.


Yes. I totally failed to figure out the proper test case. I now have a fresh
head, the office coffee machine bootstrapped, and...

On Wed, Jan 09, 2019 at 10:25:42AM -0700, Karl Williamson wrote:

> I looked at it a little more, and it seems weird that a comment would cause
> this variance in behavior.  OTOH, eval of UTF-8 is buggy, and that's why we
> have the unicode_eval feature.


OK, *here* is the test case. It's very sensitive to string length:

nc@I34 perl2$ cat /tmp/133756.pl
#!perl

my $mr = "Programme: Automatic plus, Baumwolle, Pflegeleicht, Synthetic, Finish Wolle, Express, Jeans, Schongl\344tten, L\374ften Warm, L\374ften Kalt";
my $text = $mr;

if (shift) {
    utf8::downgrade($mr);
} else {
    utf8::upgrade($mr);
}

# use Devel::Peek; Dump($mr);

my $got = eval "\$text =~ /$mr/i;";

if ($got) {
    print "Y\n";
    exit 1;
} elsif ($@) {
    print "\$\@: $@\n";
    die;
} else {
    print "n\n";
    exit 0;
}

__END__
nc@I34 perl2$ perl /tmp/133756.pl 1
Y
nc@I34 perl2$ perl /tmp/133756.pl 0
n



It's "fixed" in blead:


792b461851b63487dbcd63fc1948671db8f1dbe5 is the first bad commit
commit 792b461851b63487dbcd63fc1948671db8f1dbe5
Author: Karl Williamson <khw@cpan.org>
Date:   Sat Feb 3 10:25:31 2018 -0700

    regcomp.c: Pack EXACTish nodes more fully

    Prior to this commit, nodes that are to match a string exactly, or
    possibly case insensitively used only half the potential space available
    (that being limited by the length field which is a U8).  (The optimizer
    might later pack some together to make a larger node.)  Talking it over
    with Yves, we suspect that this is a relic of an earlier time.  It makes
    more sense to have longer nodes when possible to lower overhead in
    the matching engine.

:100644 100644 6dbfed52ab7d78f0909fce42f000127a13295155 74c8eb2e21b622e7c40dd726933f457353bfa9ee M      regcomp.c
bisect run success
That took 491 seconds.


and that commit is just a refactoring - no bug fixes, no new tests:

commit 792b461851b63487dbcd63fc1948671db8f1dbe5
Author: Karl Williamson <khw@cpan.org>
Date:   Sat Feb 3 10:25:31 2018 -0700

    regcomp.c: Pack EXACTish nodes more fully
    
    Prior to this commit, nodes that are to match a string exactly, or
    possibly case insensitively used only half the potential space available
    (that being limited by the length field which is a U8).  (The optimizer
    might later pack some together to make a larger node.)  Talking it over
    with Yves, we suspect that this is a relic of an earlier time.  It makes
    more sense to have longer nodes when possible to lower overhead in
    the matching engine.
---
 regcomp.c | 33 ++++++++++++++-------------------
 1 file changed, 14 insertions(+), 19 deletions(-)

diff --git a/regcomp.c b/regcomp.c
index 6dbfed52ab..74c8eb2e21 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -13287,8 +13287,12 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
 	    UV ender = 0;
 	    char *p;
 	    char *s;
-#define MAX_NODE_STRING_SIZE 127
+
+/* This allows us to fill a node with just enough spare so that if the final
+ * character folds, its expansion is guaranteed to fit */
+#define MAX_NODE_STRING_SIZE (255-UTF8_MAXBYTES_CASE)
 	    char foldbuf[MAX_NODE_STRING_SIZE+UTF8_MAXBYTES_CASE+1];
+
 	    char *s0;
 	    U8 upper_parse = MAX_NODE_STRING_SIZE;
             U8 node_type = compute_EXACTish(pRExC_state);
@@ -13332,24 +13336,15 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
              * use a pseudo regnode like 'EXACT_ORIG_FOLD' */
             maybe_exact = FOLD && PASS2;
 
-	    /* XXX The node can hold up to 255 bytes, yet this only goes to
-             * 127.  I (khw) do not know why.  Keeping it somewhat less than
-             * 255 allows us to not have to worry about overflow due to
-             * converting to utf8 and fold expansion, but that value is
-             * 255-UTF8_MAXBYTES_CASE.  join_exact() may join adjacent nodes
-             * split up by this limit into a single one using the real max of
-             * 255.  Even at 127, this breaks under rare circumstances.  If
-             * folding, we do not want to split a node at a character that is a
-             * non-final in a multi-char fold, as an input string could just
-             * happen to want to match across the node boundary.  The join
-             * would solve that problem if the join actually happens.  But a
-             * series of more than two nodes in a row each of 127 would cause
-             * the first join to succeed to get to 254, but then there wouldn't
-             * be room for the next one, which could at be one of those split
-             * multi-char folds.  I don't know of any fool-proof solution.  One
-             * could back off to end with only a code point that isn't such a
-             * non-final, but it is possible for there not to be any in the
-             * entire node. */
+            /* This breaks under rare circumstances.  If folding, we do not
+             * want to split a node at a character that is a non-final in a
+             * multi-char fold, as an input string could just happen to want to
+             * match across the node boundary.  The code at the end of the loop
+             * looks for this, and backs off until it finds not such a
+             * character, but it is possible (though extremely, extremely
+             * unlikely) for all characters in the node to be non-final fold
+             * ones, in which case we just leave the node fully filled, and
+             * hope that it doesn't match the string in just the wrong place */
 
             assert(   ! UTF     /* Is at the beginning of a character */
                    || UTF8_IS_INVARIANT(UCHARAT(RExC_parse))




So I think that we still have a real live bug here, now just better hidden.
And sorry for the distraction of the poor test case.


Given that that commit just changes the value of MAX_NODE_STRING_SIZE
I figure that I *ought* to be able to tweak my test script to use a
different sized string and still expose a problem, but I'm failing to
make it (not) work.

The original problem was revealed as exactly 2 lines of difference in a pair
of 1.2G files, which are produced from something like 7M of generated code,
when I tested a refactoring of the how that code is generated.

So whatever it is, it's obscure.

Nicholas Clark

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About