develooper Front page | perl.perl5.porters | Postings from June 2015

Re: Bringing the regex compiler into the current millenium.

Thread Previous | Thread Next
From:
Christian Millour
Date:
June 26, 2015 22:51
Subject:
Re: Bringing the regex compiler into the current millenium.
Message ID:
558DD752.1000004@abtela.com
Le 26/06/2015 20:01, Christian Millour a écrit :
> Le 23/10/2014 21:54, demerphq a écrit :
>> I added support for maxlen earlier this year as part of working toward
>> making $/ support regexes (pretty much the same use case you mention).
>> We now set flags to determine if the regex is potentially infinite
>> (RXf_UNBOUNDED_QUANTIFIER), or if not we calculate the maxlen. It should
>> be in Perl 5.19.9 and later. (maxlen is meaningless when
>> RXf_UNBOUNDED_QUANTIFIER is set).
>>
>> $ ./perl -Ilib -Mre=Debug,OPTIMISE,DUMP,FLAGS -e'/fo+o/'
>> Compiling REx "fo+o"
>> first:>  1: EXACT <f> (3) [ ]
>> Peep>  1: EXACT <f> (3) [ SCF_DO_SUBSTR SCF_DO_STCLASS_AND
>> SCF_DO_STCLASS SCF_WHILEM_VISITED_POS ]
>>    join>  1: EXACT <f> (3)
>> Peep>  3: PLUS (6) [ SCF_DO_SUBSTR SCF_WHILEM_VISITED_POS ]
>>    Peep>  4: EXACT <o> (0) [ SCF_DO_SUBSTR SCF_WHILEM_VISITED_POS ]
>>      join>  4: EXACT <o> (0)
>> Peep>  6: EXACT <o> (8) [ SCF_DO_SUBSTR SCF_WHILEM_VISITED_POS ]
>>    join>  6: EXACT <o> (8)
>> minlen: 3 r->minlen:0 maxlen:0
>> Final program:
>>     1: EXACT <f> (3)
>>     3: PLUS (6)
>>     4:   EXACT <o> (0)
>>     6: EXACT <o> (8)
>>     8: END (0)
>> anchored "fo" at 0 floating "oo" at 1..9223372036854775807 (checking
>> floating) minlen 3
>> r->extflags: UNBOUNDED_QUANTIFIER_SEEN USE_INTUIT_NOML USE_INTUIT_ML
>> r->intflags: [none-set]
>> Freeing REx: "fo+o"
>>
>> $ ./perl -Ilib -Mre=Debug,OPTIMISE,DUMP,FLAGS -e'/foo/'
>> Compiling REx "foo"
>> first:>  1: EXACT <foo> (3) [ ]
>> Peep>  1: EXACT <foo> (3) [ SCF_DO_SUBSTR SCF_DO_STCLASS_AND
>> SCF_DO_STCLASS SCF_WHILEM_VISITED_POS ]
>>    join>  1: EXACT <foo> (3)
>> minlen: 3 r->minlen:0 maxlen:3
>> Final program:
>>     1: EXACT <foo> (3)
>>     3: END (0)
>> anchored "foo" at 0 (checking anchored isall) minlen 3
>> r->extflags: CHECK_ALL USE_INTUIT_NOML USE_INTUIT_ML
>> r->intflags: [none-set]
>> Freeing REx: "foo"
>>
>>
>> (some of that output is specific to blead, the relevant parts are in
>> 5.20).
>>
>> cheers,
>> Yves
>>
>> --
>> perl -Mre=debug -e "/just|another|perl|hacker/"
>
> Hi,
>
> do you have any ETA wrt. "making $/ support regexes" which would be
> awesome ?
>
> In the meantime, is it OK to access a regexp's minlen and maxlen from
> perl code ? The only documented functions/macros documented in perlapi
> wrt. regular expressions are SvRX and SvRXOK.
>
> What I have in mind is something like
>
> ---8<----8<----8<----8<----8<----8<----8<----
> use 5.020;
> use strict;
> use warnings;
> package Regexp::Len;
>
> require Exporter;
> our @ISA = qw( Exporter );
> our @EXPORT;
> our @EXPORT_OK = qw(regexp_minlen regexp_maxlen);
> our %EXPORT_TAGS = (all => \@EXPORT_OK, default => \@EXPORT);
>
> use Inline C => << "EOC";
> int regexp_minlen (SV* re) {
>      if (!SvRXOK(re)) {
>      croak("not a regexp");
>      }
>      return RX_MINLEN(SvRX(re));
> }
>
> int regexp_maxlen (SV* re) {
>      if (!SvRXOK(re)) {
>      croak("not a regexp");
>      }
>      return ReANY(SvRX(re))->maxlen;
> }
> EOC
>      ;
> 1
> ---8<----8<----8<----8<----8<----8<----8<----

Whoops. That was too fast. maxlen is meaningless for potentially 
infinite regexes. so regexp_maxlen must test for that :

---8<----8<----8<----8<----8<----8<----8<----
int regexp_maxlen (SV* re) {
     if (!SvRXOK(re)) {
	croak("not a regexp");
     }
     return
	RX_EXTFLAGS(SvRX(re)) & RXf_UNBOUNDED_QUANTIFIER_SEEN
	? -1
	: ReANY(SvRX(re))->maxlen;
}
---8<----8<----8<----8<----8<----8<----8<----

I am using -1 to signal unbounded regexes, even though re::Debug dumps 
show maxlen:0 in that case (see line 3 below), because the empty regex 
is a legitimate and occasionally useful regex, with maxlen 0.

$ perl -Ilib -MRegexp::Len=:all -MCapture::Tiny=capture_stderr -E'($d) = 
capture_stderr( sub { use re qw(Debug OPTIMISE DUMP FLAGS); $r = qr/$_/ 
} ) =~ /(maxlen:\d+)/ and say ++$i, " $r: ", regexp_minlen($r), "..", 
regexp_maxlen($r), "  (re::Debug says: $d)" for @ARGV' '' 'a?b' 'a*b' 
'(?=bar)' '(?!bar)' 'foo(?=bar)' 'foo(?!bar)' 'foo(?:bar)'
1 (?^u:): 0..0  (re::Debug says: maxlen:0)
2 (?^u:a?b): 1..2  (re::Debug says: maxlen:2)
3 (?^u:a*b): 1..-1  (re::Debug says: maxlen:0)
4 (?^u:(?=bar)): 0..3  (re::Debug says: maxlen:3)
5 (?^u:(?!bar)): 0..3  (re::Debug says: maxlen:3)
6 (?^u:foo(?=bar)): 3..3  (re::Debug says: maxlen:3)
7 (?^u:foo(?!bar)): 3..3  (re::Debug says: maxlen:3)
8 (?^u:foo(?:bar)): 6..6  (re::Debug says: maxlen:6)
Freeing REx: "foo(?:bar)"
$

Line 6 and 7 seem to point to an oversight in the computation. Given 
results of line 4 and 5, and the definition given for maxlen in recomp.c 
(mininum possible number of chars in string to match), I would expect 
lines 6 and 7 to show a maxlen of 6.

>
> which works just fine right now (strawberry perl portable 5.22)
> $ perl -Ilib -MRegexp::Len=:all -E'$r = qr/a{1,2}b{3,4}/; say "$r: ",
> regexp_minlen($r), "..", regexp_maxlen($r)'
> (?^u:a{1,2}b{3,4}): 4..6
> $
>
> Two questions though :
> 1) is there a reason why RX_MAXLEN is not #define'd in regexp.h, on the
> model of RX_MINLEN :
> #define RX_MINLEN(prog)        (ReANY(prog)->minlen)
> 2) how bad is it to use undocumented APIs as above ?

I understand that "undocumented" may mean "work in progress" and thus 
incomplete. My main concern is with the stability of the interface. 
Direct access to the maxlen field looks definitely iffy to me. What I'd 
like to know is how 'public' and stable RX_MINLEN (and a possible future 
  RX_MAXLEN), RX_EXTFLAGS and RXf_UNBOUNDED_QUANTIFIER_SEEN are supposed 
to be...

>
>
> TIA,
>
> --Christian
>
>


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About