develooper Front page | perl.perl5.porters | Postings from June 2001

[ID 20010626.005] regex (?<name>...) capture-to-var paren, new$^N magic variable

From:
Jeffrey Friedl
Date:
June 26, 2001 13:18
Subject:
[ID 20010626.005] regex (?<name>...) capture-to-var paren, new$^N magic variable
Message ID:
200106262017.NAA09712@ventrue.corp.yahoo.com

This is a bug report for perl from jfriedl@yahoo.com,
generated with the help of perlbug 1.33 running under perl v5.7.1.

-----------------------------------------------------------------
[Please enter your report here]

SUMMARY:
    1) Gee, it'd be nice to support (?<name>...) "named capture" parens
       within regular expressions.

    2) I created a new magic variable $^N, similar to $+.

    3) If this is of interest, is $^N a good name?


Feeling envy for highly advanced :-) languages like Java and Visual Basic,
whose regex languages allows named captures a'la

     (?<myval>\d+)

(sets $myval to the digits captured), I thought I'd use regex overloading
to convert this syntaxt to

    (\d+)  (?{ $varname = $+ })

(capture to normal parens, then use $+ to access it and assign to the variable)


This works fine except for a few details, such as being able to have
capturing parens nested within, since $+ refers to the *hightest-numbered*
set of parens used so far, not the most-recently *closed* set of parens
used so far.

So, I created a new magic variable, $^N, that pretty much parallels $+
except it does indeed refer to the most-recently *closed* set of parens.

Now, converting to

    (\d+)  (?{ $varname = $^N })

works even with nesting.  This is very nice.

I can send the patches if wanted. But if wanted, what is a good variable name?
I picked $^N simply because I saw it was free.

For those interested, I've appended my package to allow this, and a short
test program.

There are still "issues" with my overloading -- the variables named are not
checked at runtime ('use strict' doesn't save you from referring to a
nonexistant variable), and the variables are not "protected", so even if
during a match a variable is set, it won't be unset if the match later
fails.

So, it really would be nice if named captures were officially supported. I
spent some hours digging through regcomp and regexec, and succeeded only in
killing massive amounts of neurons )-:, so I don't think I'll be able to
add it.

	Jeffrey

Here is a short test script:

---snip------------------------------------------------------------------
    use strict;
    use warnings;
    use Regex::SupportNamedCapture;

    my  $areacode; ## both lexical
    our $exchange; ## and global
    my  $number;   ## variables work fine.

    "My number is 408-555-1212." =~ m{
	\b
	(?<areacode>\d\d\d)
	-
	(?<exchange>\d\d\d)
	-
	(?<number>\d\d\d\d \b)
    }x;

    print "phone number is: ($areacode) $exchange-$number\n";


    my $fullnumber;

    "My number is 408-555-1212." =~ m{
	\b
	(?<fullnumber>
	   (?<areacode>\d\d\d)
	   -
	   (?<exchange>\d\d\d)
	   -
	   (?<number>\d\d\d\d \b)
	)
    }x;

    if ($fullnumber ne "$areacode-$exchange-$number")
    {
       print "Bummer, you don't have \$^N support: fullnumber is [$fullnumber]\n";
    }
---snip------------------------------------------------------------------





Here is the package:

---snip------------------------------------------------------------------
package Regex::SupportNamedCapture;

##
## This package allows regular expressions to have named captures, a'la
##
##      (?<varname>...)
## sets $varname to the result of what's matched by the /.../
##
## If your Perl supports the $^N "most-recently-closed-paren text" variable,
## the /.../ part may itself contain capturing parens. Otherwise, it
## shouldn't.
##
## jfriedl@yahoo.com
## 6/2001
##
use strict;
use warnings;
use re 'eval';
use overload;
sub import { overload::constant 'qr' => \&convert }

##
## Test to see if my proposed $^N is supported.
## Set $GutsResult to $^N if so, set to $+ if not.
##
my $GutsResult = do {
    no warnings;
    "1" =~ m/(1)/;
    if ($^N) {
	'$^N';
    } else {
	'$+';
    }
};

our $OpenParens; ## needed for matching nested parens

my $NestedParenGuts = qr{

    (?{ local $OpenParens = 0 }) ## counts the number of nested opens waiting to close

    (?>
     (?>
        ## stuff not parens, not escaped
 	 [^()\\]+

        ## escaped stuff
        | (?s: \\. )

        # another opening paren
        | \( (?{ $OpenParens++ })

        # a closing paren, if we're expecting any
        | (?(?{ $OpenParens }) (?{ $OpenParens-- }) \))
      )*
     )
}x;

##
## Mimics named capturing parens by converting something like
##        (?<varname>...)
## to
##        (?:  (...)  (?{ $varname = $^N })  )
##
##
sub convert
{
    my $re = shift;  ## regex to mangle

    $re =~ s{
        (?<! \\ )                 # an unescaped...
        \(\?                      #  "(?"
	  <                       #     '<'
  	     (\w+)                #      $1 - an identifier
	  >                       #     '>'
          ($NestedParenGuts)      #     $2 - regex guts 
        \)                        #  ')'
    }{
       my $id   =   '$' . $1;
       my $guts = convert($2);
       "(?:($guts)(?{ $id=$GutsResult }))";
    }exg;

    return $re;  ## return mangled regex
}

1;
---snip------------------------------------------------------------------









[Please do not change anything below this line]
-----------------------------------------------------------------
---
Flags:
    category=core
    severity=wishlist
---
Site configuration information for perl v5.7.1:

Configured by jfriedl at Sun Jun 17 23:29:39 PDT 2001.

Summary of my perl5 (revision 5.0 version 7 subversion 17) configuration:
  Platform:
    osname=linux, osvers=2.4.5, archname=i686-linux
    uname='linux fummy.telocity.com 2.4.5 #3 smp mon jun 4 22:43:14 pdt 2001 i686 unknown '
    config_args='-e -s -O -D optimize=-O2 -g'
    hint=previous, useposix=true, d_sigaction=define
    usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef
    useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
    use64bitint=undef use64bitall=undef uselongdouble=undef
  Compiler:
    cc='cc', ccflags ='-Wall -DDEBUGGING -fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-O2 -g',
    cppflags='-Wall -DDEBUGGING -fno-strict-aliasing -I/usr/local/include -Wall -DDEBUGGING -fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64'
    ccversion='', gccversion='2.95.3 20010315 (release)', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, usemymalloc=n, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib
    libs=-lnsl -lndbm -lgdbm -ldb -ldl -lm -lc -lposix -lcrypt -lutil
    perllibs=-lnsl -ldl -lm -lc -lposix -lcrypt -lutil
    libc=/lib/libc-2.2.3.so, so=so, useshrplib=false, libperl=libperl.a
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic'
    cccdlflags='-fpic', lddlflags='-shared -L/usr/local/lib'

Locally applied patches:
    DEVEL10654

---
@INC for perl v5.7.1:
    /home/jfriedl/lib/perl
    /home/jfriedl/lib/perl/yahoo
    /usr/local/lib/perl5/5.7.1/i686-linux
    /usr/local/lib/perl5/5.7.1
    /usr/local/lib/perl5/site_perl/5.7.1/i686-linux
    /usr/local/lib/perl5/site_perl/5.7.1
    /usr/local/lib/perl5/site_perl/5.6.1/i686-linux
    /usr/local/lib/perl5/site_perl/5.6.1
    /usr/local/lib/perl5/site_perl/5.6.0/i686-linux
    /usr/local/lib/perl5/site_perl/5.6.0
    /usr/local/lib/perl5/site_perl
    .

---
Environment for perl v5.7.1:
    HOME=/home/jfriedl
    LANG (unset)
    LANGUAGE (unset)
    LD_LIBRARY_PATH=/usr/local/pgsql/lib:/home/jfriedl/src/rvplayer5.0
    LOGDIR (unset)
    PATH=/home/jfriedl/bin:/home/jfriedl/common/bin:.:/usr/local/pgsql/bin:/usr/local/bin:/usr/X11R6/bin:/bin:/usr/bin:/usr/sbin:/sbin:/home/jfriedl/src/rvplayer5.0:/usr/local/prod/bin:/usr/local/java/bin
    PERLLIB=/home/jfriedl/lib/perl:/home/jfriedl/lib/perl/yahoo
    PERL_BADLANG (unset)
    SHELL=/bin/tcsh




nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About