[ID 20020630.002] utf8 regex only matches 32k

Marc Lehmann
June 30, 2002 17:06
[ID 20020630.002] utf8 regex only matches 32k
Message ID:

This is a bug report for perl from root@cerebro.laendle,
generated with the help of perlbug 1.33 running under perl v5.8.0.

[Please enter your report here]

I was trying to match strings of the form <quoting character> <hex
string>, but perl mysteriously fails to match strings longer than 32k when
the quoting character is > 255;

   $dx = "\x{1ff}";
   #$dx = "\x{ff}"; # endless loop

   for ($length = 32500; $length < 33000; $length ++) {
      print "$length\n";
      $y = ("f") x $length;;
      $y = "$dx$y";

      $y =~ /$dx([f]*)/gcso or die;
      $y !~ /\G(.{1,20})/gcs or die "internal error: trailing characters in pcode-string ($1)";

This program generates strings of the form "$dx + many trailing f's". It
works fine for up to 32767 f's, but only matches the first 32767
characters when more f's are following. Changing the $dx character from
U+01FF to U+00FF creates an endless loop (and the program also runs many
times faster!).

Replacing the character class "[f]" by the single character "f" also
"fixes" this problem, so it might be character-class related.

The problem is independent of the loop, I just wanted to verify that the
max size, indeed, is 32767.

It seems to me that a "use bytes" should work around this issue, but
"use bytes" makes the regex not match at all, which looks like another
(related?) bug to me.

[Please do not change anything below this line]
Site configuration information for perl v5.8.0:

Configured by root at Fri Jun 14 14:54:38 CEST 2002.

Summary of my perl5 (revision 5.0 version 8 subversion 0 patch 17236) configuration:
    osname=linux, osvers=2.4, archname=i686-linux
    uname='linux cerebro 2.4.18-pre8-ac3 #2 smp tue feb 5 17:35:23 cet 2002 i686 unknown '
    hint=recommended, useposix=true, d_sigaction=define
    usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef
    useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
    use64bitint=undef use64bitall=undef uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
    cc='gcc-2.95.4', ccflags ='-I/opt/include -D_GNU_SOURCE -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-Os -funroll-loops -mcpu=pentium -march=pentium -g',
    cppflags='-I/opt/include -D_GNU_SOURCE'
    ccversion='', gccversion='2.95.4 20010319 (prerelease)', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='gcc-2.95.4', ldflags =''
    libpth=/usr/lib /opt/lib
    libs=-lcrypt -ldl -lm -lc
    perllibs=-lcrypt -ldl -lm -lc
    libc=/lib/, so=so, useshrplib=false, libperl=libperl.a
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic'
    cccdlflags='-fpic', lddlflags='-shared'

Locally applied patches:

@INC for perl v5.8.0:

Environment for perl v5.8.0:
    LANG (unset)
    LANGUAGE (unset)
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PERL_BADLANG (unset)

