[ID 20000731.001] regex optimizer problems with utf8 and (??{ ... })

Jeffrey Friedl
July 31, 2000 11:14
[ID 20000731.001] regex optimizer problems with utf8 and (??{ ... })
I think I've found a place where the regex optimizer is rejecting a match
that it shouldn't.

I would expect that the program:

     #!/usr/local/bin/perl -w
     use re 'debug';
     use strict;
     use utf8;

     $_ = "A \x{263a} B z C";

     if (m/A . B (??{ "z" }) C/) {
	 print "match\n";
     } else {
	 print "no match\n";

would print that there was a match.

Here's what I'm getting (when piped through something to show non-ASCII
bytes as {FF}):

    % utf8-5
    Compiling REx `A . B (??{ "z" }) C'
    size 11 first at 1
    synthetic stclass `ANYOF[A]'.
       1: EXACT <A >(3)
       3: ANYUTF8(4)
       4: EXACT < B >(6)
       6: LOGICAL[2](7)
       7: EVAL(9)
       9: EXACT < C>(11)
      11: END(0)
    anchored ` B ' at 3 floating ` C' at 6..2147483647 (checking anchored) stclass `ANYOF[A]' minlen 8 with eval 
    Guessing start of match, REx `A . B (??{ "z" }) C' against `A {e2}{98}{ba} B z C'...
    Found anchored substr ` B ' at offset 5...
    Found floating substr ` C' at offset 9...
    This position contradicts STCLASS...
    Trying anchored substr starting at offset 8...
    Did not find anchored substr ` B '...
    Match rejected by optimizer
    Freeing REx: `A . B (??{ "z" }) C'
    no match

The {e2}{98}{ba} is the proper UTF-8 for the single smiley character.


Site configuration information for perl v5.6.0:

Configured by jfriedl at Sat Jul 29 20:09:33 PDT 2000.

Summary of my perl5 (revision 5.0 version 6 subversion 0) configuration:
    osname=linux, osvers=2.2.15, archname=i686-linux
    uname='linux 2.2.16 #6 smp sun jul 23 11:26:16 pdt 2000 i686 unknown '
    config_args='-ds -e -A optimize=-g'
    hint=previous, useposix=true, d_sigaction=define
    usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef
    useperlio=undef d_sfio=undef uselargefiles=define 
    use64bitint=undef use64bitall=undef uselongdouble=undef usesocks=undef
    cc='cc', optimize='-O2 -g', gccversion=pgcc-2.91.66 19990314 (egcs-1.1.2 release)
    cppflags='-fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64'
    ccflags ='-fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64'
    stdchar='char', d_stdstdio=define, usevfork=false
    intsize=4, longsize=4, ptrsize=4, doublesize=8
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, usemymalloc=n, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib
    libs=-lnsl -lndbm -lgdbm -ldb -ldl -lm -lc -lposix -lcrypt
    libc=/lib/, so=so, useshrplib=false, libperl=libperl.a
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic'
    cccdlflags='-fpic', lddlflags='-shared -L/usr/local/lib'

Locally applied patches:

@INC for perl v5.6.0:

Environment for perl v5.6.0:
    LANG (unset)
    LANGUAGE (unset)
    LOGDIR (unset)
    PERL_BADLANG (unset)
