develooper Front page | perl.perl5.porters | Postings from January 2019

Re: [perl #133756] //g flag on regex with UTF-8 source causes regexoptimiser to wrongly reject a match

Thread Previous | Thread Next
From:
Karl Williamson
Date:
January 11, 2019 05:01
Subject:
Re: [perl #133756] //g flag on regex with UTF-8 source causes regexoptimiser to wrongly reject a match
Message ID:
14dc4ee8-2f71-cbf2-c167-ba75611b2b0f@khwilliamson.com
On 1/10/19 2:05 AM, Nicholas Clark wrote:
> 
> On Wed, Jan 09, 2019 at 04:27:35PM -0800, Tony Cook via RT wrote:
>> You're going to facepalm yourself.
> 
> Yes. :-)
> 
>> On Wed, 09 Jan 2019 06:11:06 -0800, nicholas wrote:
>>> For the case where the eval'd source code contains Ā in a code comment
>>> this
>>> doesn't match. Where the code comment is ÿ it does:
>>>
>>> nicholas@dromedary-001 perl6$ cat /tmp/rule.pl
>>> use strict;
>>> use warnings;
>>>
>>> my $mr = "L\xFCften Kalt";
>>> my $text = "L\xFCften Kalt";
>>>
>>> # Culprit here is the //g flag:
>>> for my $char ("\xFF", "\x100", "\xFF", "\x100") {
>>
>> Add:
>>
>>    pos($text) = 0;
>>
>> here and they all pass.
> 
> 
> Yes. I totally failed to figure out the proper test case. I now have a fresh
> head, the office coffee machine bootstrapped, and...
> 
> On Wed, Jan 09, 2019 at 10:25:42AM -0700, Karl Williamson wrote:
> 
>> I looked at it a little more, and it seems weird that a comment would cause
>> this variance in behavior.  OTOH, eval of UTF-8 is buggy, and that's why we
>> have the unicode_eval feature.
> 
> 
> OK, *here* is the test case. It's very sensitive to string length:
> 
> nc@I34 perl2$ cat /tmp/133756.pl
> #!perl
> 
> my $mr = "Programme: Automatic plus, Baumwolle, Pflegeleicht, Synthetic, Finish Wolle, Express, Jeans, Schongl\344tten, L\374ften Warm, L\374ften Kalt";
> my $text = $mr;
> 
> if (shift) {
>      utf8::downgrade($mr);
> } else {
>      utf8::upgrade($mr);
> }
> 
> # use Devel::Peek; Dump($mr);
> 
> my $got = eval "\$text =~ /$mr/i;";
> 
> if ($got) {
>      print "Y\n";
>      exit 1;
> } elsif ($@) {
>      print "\$\@: $@\n";
>      die;
> } else {
>      print "n\n";
>      exit 0;
> }
> 
> __END__
> nc@I34 perl2$ perl /tmp/133756.pl 1
> Y
> nc@I34 perl2$ perl /tmp/133756.pl 0
> n
> 
> 
> 
> It's "fixed" in blead:
> 
> 
> 792b461851b63487dbcd63fc1948671db8f1dbe5 is the first bad commit
> commit 792b461851b63487dbcd63fc1948671db8f1dbe5
> Author: Karl Williamson <khw@cpan.org>
> Date:   Sat Feb 3 10:25:31 2018 -0700
> 
>      regcomp.c: Pack EXACTish nodes more fully
> 
>      Prior to this commit, nodes that are to match a string exactly, or
>      possibly case insensitively used only half the potential space available
>      (that being limited by the length field which is a U8).  (The optimizer
>      might later pack some together to make a larger node.)  Talking it over
>      with Yves, we suspect that this is a relic of an earlier time.  It makes
>      more sense to have longer nodes when possible to lower overhead in
>      the matching engine.
> 
> :100644 100644 6dbfed52ab7d78f0909fce42f000127a13295155 74c8eb2e21b622e7c40dd726933f457353bfa9ee M      regcomp.c
> bisect run success
> That took 491 seconds.
> 
> 
> and that commit is just a refactoring - no bug fixes, no new tests:
> 
> commit 792b461851b63487dbcd63fc1948671db8f1dbe5
> Author: Karl Williamson <khw@cpan.org>
> Date:   Sat Feb 3 10:25:31 2018 -0700
> 
>      regcomp.c: Pack EXACTish nodes more fully
>      
>      Prior to this commit, nodes that are to match a string exactly, or
>      possibly case insensitively used only half the potential space available
>      (that being limited by the length field which is a U8).  (The optimizer
>      might later pack some together to make a larger node.)  Talking it over
>      with Yves, we suspect that this is a relic of an earlier time.  It makes
>      more sense to have longer nodes when possible to lower overhead in
>      the matching engine.
> ---
>   regcomp.c | 33 ++++++++++++++-------------------
>   1 file changed, 14 insertions(+), 19 deletions(-)
> 
> diff --git a/regcomp.c b/regcomp.c
> index 6dbfed52ab..74c8eb2e21 100644
> --- a/regcomp.c
> +++ b/regcomp.c
> @@ -13287,8 +13287,12 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
>   	    UV ender = 0;
>   	    char *p;
>   	    char *s;
> -#define MAX_NODE_STRING_SIZE 127
> +
> +/* This allows us to fill a node with just enough spare so that if the final
> + * character folds, its expansion is guaranteed to fit */
> +#define MAX_NODE_STRING_SIZE (255-UTF8_MAXBYTES_CASE)
>   	    char foldbuf[MAX_NODE_STRING_SIZE+UTF8_MAXBYTES_CASE+1];
> +
>   	    char *s0;
>   	    U8 upper_parse = MAX_NODE_STRING_SIZE;
>               U8 node_type = compute_EXACTish(pRExC_state);
> @@ -13332,24 +13336,15 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
>                * use a pseudo regnode like 'EXACT_ORIG_FOLD' */
>               maybe_exact = FOLD && PASS2;
>   
> -	    /* XXX The node can hold up to 255 bytes, yet this only goes to
> -             * 127.  I (khw) do not know why.  Keeping it somewhat less than
> -             * 255 allows us to not have to worry about overflow due to
> -             * converting to utf8 and fold expansion, but that value is
> -             * 255-UTF8_MAXBYTES_CASE.  join_exact() may join adjacent nodes
> -             * split up by this limit into a single one using the real max of
> -             * 255.  Even at 127, this breaks under rare circumstances.  If
> -             * folding, we do not want to split a node at a character that is a
> -             * non-final in a multi-char fold, as an input string could just
> -             * happen to want to match across the node boundary.  The join
> -             * would solve that problem if the join actually happens.  But a
> -             * series of more than two nodes in a row each of 127 would cause
> -             * the first join to succeed to get to 254, but then there wouldn't
> -             * be room for the next one, which could at be one of those split
> -             * multi-char folds.  I don't know of any fool-proof solution.  One
> -             * could back off to end with only a code point that isn't such a
> -             * non-final, but it is possible for there not to be any in the
> -             * entire node. */
> +            /* This breaks under rare circumstances.  If folding, we do not
> +             * want to split a node at a character that is a non-final in a
> +             * multi-char fold, as an input string could just happen to want to
> +             * match across the node boundary.  The code at the end of the loop
> +             * looks for this, and backs off until it finds not such a
> +             * character, but it is possible (though extremely, extremely
> +             * unlikely) for all characters in the node to be non-final fold
> +             * ones, in which case we just leave the node fully filled, and
> +             * hope that it doesn't match the string in just the wrong place */
>   
>               assert(   ! UTF     /* Is at the beginning of a character */
>                      || UTF8_IS_INVARIANT(UCHARAT(RExC_parse))
> 
> 
> 
> 
> So I think that we still have a real live bug here, now just better hidden.
> And sorry for the distraction of the poor test case.
> 
> 
> Given that that commit just changes the value of MAX_NODE_STRING_SIZE
> I figure that I *ought* to be able to tweak my test script to use a
> different sized string and still expose a problem, but I'm failing to
> make it (not) work.
> 
> The original problem was revealed as exactly 2 lines of difference in a pair
> of 1.2G files, which are produced from something like 7M of generated code,
> when I tested a refactoring of the how that code is generated.
> 
> So whatever it is, it's obscure.
> 
> Nicholas Clark
> 

Thank you for reporting this.

And your surmisal was correct.  There is a real bug, as hopefully 
explained by this commit message, now in blead:

 From 88c2ae302de11db0b66ff2269e2c4e9bf46efd69 Mon Sep 17 00:00:00 2001
From: Karl Williamson <khw@cpan.org>
Date: Thu, 10 Jan 2019 20:59:31 -0700
Subject: [PATCH 2/2] PATCH: [perl #133756] Failure to match properly

This was caused by a counting error.

An EXACTFish regnode has a finite length it can hold for the string
being matched.  If that length is exceeded, a 2nd node is used for the
next segment of the string, for as many regnodes as are needed.

A problem occurs if a regnode ends with one of the 22 characters in
Unicode 11 that occur in non-final positions of a multi-character fold.
The design of the pattern matching engine doesn't allow matches across
regnodes.  Consider, for example if a node ended in the letter 'f' and
the next node begins with the letter 'i'.  That sequence should match,
under /i, the ligature "fi" (U+FB01).  But it wouldn't because the
pattern splits them across nodes.  The solution I adopted was to forbid
a node to end with one of those 22 characters if there is another string
node that follows it.  This is not fool proof, for example, if the
entire node consisted of only these characters, one would have to split
it at some position (In that case, we just take as much of the string as
will fit.) But for real life applications, it is good enough.

What happens if a node ends with one of the 22, is that the node is
shortened so that those are instead placed at the beginning of the
following node.  When the code encounters this situation, it backs off
until it finds a character that isn't a non-final fold one, and closes
the node with that one.

A /i node is filled with the fold of the input, for several reasons.
The most obvious is that it saves time, you can skip folding the pattern
at runtime.  But there are reasons based on the design of the optimzer
as well, which I won't go into here, but are documented in regcomp.c.
When we back out the final characters in a node, we also have to back
out the corresponding unfolded characters in the input, so that those
can be (folded) into the following node.  Since the number of characters
in the fold may not be the same as unfolded, there is not an easily
discernable correspondence between the input and the folded output.
That means that generally, what has to be done is that the input is
reparsed from the beginning of the node, but the permitted length has
been shortened (we know precisely how much to shorten it to) so that it
will end with something other than the 22.  But, the code saves the
previous input character's position (for other reasons), so if we only
have to backup one character, we can just use that and not have to
reparse.

This bug was that the code thought a two character backup was really a
one character one, and did not reparse the node, creating an off-by-one
error, and a character was simply omitted in the pattern (that should
have started the following node).  And the input had two of the 22
characters adjacent to each other in just the right positions that the
node was split.  The bisect showed that when the node size was changed
the bug went away, at least for this particular input string.  But a
different, longer, string would have triggered the bug, and this commit
fixes that.

This bug is actually very unlikely to occur in most real world
applications.  That is because other changes in the regex compiler have
caused nodes to be split so that things that don't particpate in folds
at all are separated out into EXACT nodes.  (The reason for that is it
allows the optimizer things to grab on to under /i that it wouldn't
otherwise have known about.)  That means that anything like this string
would never cause the bug to happen because blanks and commas, etc.
would be in separate nodes, and so no node would ever get large enough
to fill the 238 available byte slots in a node (235 on EBCDIC).  Only a
long string without punctuation would trigger it.  I have artificially
constructed such a string in the tests added by this commit.

One of the 22 characters is 't', so long strings of DNA "ACTG" could
trigger this bug.  I find it somewhat amusing that this is something
like a DNA transcription error, which occurs in nature at very low
rates, but selection, it is believed, will make sure the error rate is
above zero.

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About