develooper Front page | perl.perl5.porters | Postings from February 2015

Re: [perl #123820] documentation error in perlrecharclass

Thread Previous | Thread Next
From:
demerphq
Date:
February 14, 2015 05:13
Subject:
Re: [perl #123820] documentation error in perlrecharclass
Message ID:
CANgJU+VJXsFGKvop6Mb+EKJTB_ZcFp38Xb4XgKz0qJLFbO47kA@mail.gmail.com
On 14 February 2015 at 07:58, James E Keenan via RT
<perlbug-followup@perl.org> wrote:
> On Fri Feb 13 11:28:10 2015, saint.snit@gmail.com wrote:
>>
>> This is a bug report for perl from saint.snit@gmail.com,
>> generated with the help of perlbug 1.39 running under perl 5.18.2.
>>
>>
>> -----------------------------------------------------------------
>> The "perlrecharclass" documentation -- both that shipped with perl
>> 5.18.2
>> and that appearing at http://perldoc.perl.org/perlrecharclass.html --
>> contains an error.
>>
>> It claims that the regular expression /[[]]/ "contains a character
>> class containing just ], and the character class is followed by a ]".
>> This does not appear to be an accurate description of this regular
>> expression: the leading character class appears to contain just [.
>>
>
> I believe the analysis is correct.
>
> Here is the way the documentation appears in perl-5.10.1 (some whitespace trimmed):
>
> #####
> "[]"  =~ /[[]]/      #  Match, the pattern contains a character class
>                      #  containing just ], and the character class is
>                      #  followed by a ].
> #####
>

It looks like this is a typo. it should say "containing just [".


> Let's stipulate that the final ']' is outside the character class.  Then I ought to be able to rewrite the pattern to capture the contents of the character class, like so:
>
[snip]
> This suggests that the character class holds a single open-bracket '[' -- not a single close-bracket ']'.  This in turn suggests that the documentation is indeed wrong.
>

Interesting approach. For future reference the way I would analyse it
is as follows:

$ perl -Mre=debug -e'/[[]]/'
Compiling REx "[[]]"
Final program:
   1: EXACT <[]> (5)
   5: END (0)
anchored "[]" at 0 (checking anchored isall) minlen 2
Freeing REx: "[[]]"

Which shows that the original pattern is exactly equivalent to m/ \[
\] /x (using /x mode for legibility)

Meaning it can't be what the documentation says.

And you can drill deeper and see exactly what happens like this (added
comments by me starting with #)

$ perl -Mre=Debug,COMPILE -e'/[[]]/'
Assembling pattern from 1 elements
Compiling REx "[[]]"
Starting first pass (sizing)
 >[[]]<         |   1|  reg
                |    |    brnc
                |    |      piec
                |    |        atom
 >[]]<          |    |          clas

#At this point we have consumed the first open square bracket as the
beginning of a char class.

 >]<            |   3|      piec

#At this point we have consumed the second open square backet as an
element of the char-class, and also the first close square bracket, as
the close of the char-class definition, and we have one more close
square bracket left to parse,

                |    |        atom

#Which we parse as an "atom", in  this case a literal.

Required size 5 nodes
Starting second pass (creation)
 >[[]]<         |   1|  reg
                |    |    brnc
                |    |      piec
                |    |        atom
 >[]]<          |    |          clas
 >]<            |   3|      piec
                |    |        atom
 ><             |   5|      tail~ EXACT <[> (1) -> EXACT

#Here we can see that the  charclass containing a single item has been
converted into the literal item (EXACT)

                |   6|  lsbr~ tying lastbr EXACT <[> (1) to ender END
(5) offset 4
                |    |    tail~ EXACT <[> (1)
                |    |        ~ EXACT <]> (3) -> END
first:>  1: EXACT <[> (3)
first at 1
Peep>  1: EXACT <[> (3)
  join>  1: EXACT <[> (3)
  merg>  3: EXACT <]> (5)
  finl>  1: EXACT <[]> (5)

#And here we can see that the two EXACT nodes, one containing '[' and
the other containing ']' are joined together into a single EXACT node
which contains '[]'

minlen: 2 r->minlen:0
Final program:
   1: EXACT <[]> (5)
   3: OPTIMIZED (2 nodes)

#This "OPTIMIZED" node is the remainder of the second EXACT that was
left over after merging.

   5: END (0)
anchored "[]" at 0 (checking anchored isall) minlen 2
r->extflags: CHECK_ALL USE_INTUIT_NOML USE_INTUIT_ML
Freeing REx: "[[]]"

#And this says that to match the string must be 2 chars long, it must
contain the string '[]', and that the internals need not execute the
regex engine at all, and instead will simply use FBM matching instead.


> If others agree, I will patch pod/perlrecharclass.pod.

I agree the analysis is sound, and I /think/ the original
documentation was just a typo, but that does not rule out that this is
a subtle regression and that older perls did actually parse as
documented. So to be sure it would be good to test this on 5.8.x, if
it also reduces down to EXACT <[]> then we are good to go. If not then
this is a regression. My money is on it NOT being a regression, (if i
were a betting man anyway).

Yves

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About