develooper Front page | perl.perl5.porters | Postings from September 2003

Re: unicode regex problem

Thread Next
From:
hv
Date:
September 13, 2003 07:16
Subject:
Re: unicode regex problem
Message ID:
200309131421.h8DELYx25894@zen.crypt.org
hv@crypt.org wrote:
:Jarkko Hietaniemi <jhi@iki.fi> wrote:
::http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2003-09/msg00612.html
[...]
:This can be achieved by suppressing SIMPLE for non-invariants: I'm not
:entirely confident that the patch below does this correctly, since I
:find it difficult to follow the codepaths to guarantee that 'ender' is
:always the last character parsed.

Ok, I've taken a more careful look through and I'm rather more confident
that the original patch (applied as #21174) is correct.

There is an odd bit of code handling \x{...}:
    ender = grok_hex(p + 1, &numlen, &flags, NULL);
    if (ender > 0xff)
        RExC_utf8 = 1;
    /* numlen is generous */
    if (numlen + len >= 127) {
        p--;
        goto loopdone;
    }
I have no idea why numlen is looked at here, since it is the resulting
EXACT string that needs to fit in 127 chars, not the stretch of source
string contributing to it. This is then wrongly picked up after breaking
out:

perl -Dr -e '/\x61\x{0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000061}/'
   1: EXACT <a>(3)
   3: CURLY {61,61}(7)
   5:   EXACT <x>(0)
   7: END(0)

This happens because the "p--; goto loopdone;" leaves the parser pointing
at the 'x', not the '\', so it is parsed as '/\x61x{61}/. Patch below
fixes that by removing the check, which breaks no tests.

Here's a more relevant example:
    my $tail = '\x{1234}';
    for (120 .. 130) {
        my $head = "x" x $_;
        if (eval qq{"$head$tail" =~ /$head$tail/}) {
            print "ok $_\n";
        } else {
            print "not ok $_\n";
        }
    }
.. which fails for 123..126 in all current versions of perl.

I tried to find the patch that introduced the check to find out what
it was attempting to achieve, but while the p5p archive records
3 patches nearby the check itself predates May 2000. I suspect that
when the \x{...} syntax was originally introduced the author simply
misgrokked how to extend the existing \xhh code.

Hugo
--- regcomp.c.old	Tue Aug 26 07:35:32 2003
+++ regcomp.c	Sat Sep 13 14:23:47 2003
@@ -3162,11 +3162,6 @@
 				ender = grok_hex(p + 1, &numlen, &flags, NULL);
 				if (ender > 0xff)
 				    RExC_utf8 = 1;
-				/* numlen is generous */
-				if (numlen + len >= 127) {
-				    p--;
-				    goto loopdone;
-				}
 				p = e + 1;
 			    }
 			}
--- t/op/pat.t.old	Sat Sep 13 15:05:24 2003
+++ t/op/pat.t	Sat Sep 13 15:09:23 2003
@@ -6,7 +6,7 @@
 
 $| = 1;
 
-print "1..1033\n";
+print "1..1055\n";
 
 BEGIN {
     chdir 't' if -d 't';
@@ -3250,5 +3250,15 @@
     ok("\xc4\xc4\xc4" !~ /(\x{100}+?)/, "[perl #23769] don't match first byte of utf8 representation");
 }
 
-# last test 1033
+for (120 .. 130) {
+    my $head = 'x' x $_;
+    for my $tail ('\x{0061}', '\x{1234}') {
+	ok(
+	    eval qq{ "$head$tail" =~ /$head$tail/ },
+	    '\x{...} misparsed in regexp near 127 char EXACT limit'
+	);
+    }
+}
+
+# last test 1055
 

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About