develooper Front page | perl.perl5.porters | Postings from February 2020

Anomalies in parsing regex quantifiers

Thread Next
Karl Williamson
February 11, 2020 15:58
Anomalies in parsing regex quantifiers
Message ID:
I have been looking at the code in regcomp.c in regpiece() that deals 
with qwuantifiers.

After reordering things so that goto's don't cause it to jump back then 
forth, some anonmalies became clear.   I also found some potential easy 

I would expect that the results of parsing {1,} would be the same as 
'+', and they both do generate the PLUS regnode, but the flags passed to 
the higher level aren't set the same.  This is true of the other 
shortcuts '*' and '?' as well.

I then tried to figure out what the consequences of those differences 
are.  Two of the flags WORST and SPSTART do not appear to ever be looked 
at.  Should we remove them, or dig to find out how they used to be used, 
or might they come back again, and we should set them consistently?

regpiece assumes that any quantifier whose upper limit is non-zero 
causes the construct to not match the null string, and sets HASWIDTH. 
That simply isn't true when quantifying a zero-width assertion.  I 
didn't look at what the optimizer does with that, but when I change that 
a higher level warning is emitted:

  "Quantifier unexpected on zero-length expression "

Now to the optimizations:  I believe the quantifier {1,1} can simply be 
optimized out.  There are occurrences in our test suite of this; I 
believe from Abigail.  And I can see machine generated or interpolated 
code ending up with this.  So we don't need to create a loop that gets 
executed precisely once.  But there is {1,1}+, that has to be 
considered; and that's easy to do.

Generally, in {m,m}? the ? is a no-op and can be omitted.

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About