develooper Front page | perl.perl5.porters | Postings from January 2017

[perl #130648] regcomp.c:6195: voidS_pat_upgrade_to_utf8(RExC_state_t *const, char **, STRLEN *, int): Assertion`*(d - 1) == ')'' failed

Thread Previous
Father Chrysostomos via RT
January 30, 2017 22:40
[perl #130648] regcomp.c:6195: voidS_pat_upgrade_to_utf8(RExC_state_t *const, char **, STRLEN *, int): Assertion`*(d - 1) == ')'' failed
Message ID:
On Mon, 30 Jan 2017 08:42:51 -0800, davem wrote:
> On Sun, Jan 29, 2017 at 08:17:33AM -0800, Hugo van der Sanden via RT wrote:
> > On Thu, 26 Jan 2017 02:19:19 -0800, randir wrote:
> > > While fuzzing perl v5.25.9-35-g32207c637b built with afl and run
> > > under libdislocator, I found the following 16-bytes program
> > > 
> > > hexdump -C 0042
> > > 00000000  6d 27 5c 34 30 30 28 3f  7b 3c 3c 7d 29 0a 0a 27
> > > |m'\400(?{<<})..'|
> > > 00000010
> > > 
> > > to cause an assertion failure.
> > 
> > We're hitting S_pat_upgrade_to_utf8() with a code block of
> > "(?{<<})\n\n". My initial suspicion is that that's fine, and the
> > assumption that the last char of such a code block must be ')' is wrong,
> > but I don't know.
> Hmmm... the assertion is correct, the toker is very wrong.
> When compile-time code is seen in a pattern, the code is parsed, so that
> for
>     /abc(?{...})def/
> the toker returns this sequence of tokens:
>     FUNC, '(', const("abc"), 'DO', '{', ...., '}, '(?{...})', 'def', ')'
> As well as the individual parsed tokens for the code block, the text of
> the code block is returned afterwards as a separate const op, which is
> used by re_op_compile() to reconstruct the original text of the regex
> (in case a regex is ever stringified).
> The problem with
>     m{\x{100}(?{<<EOF})
>     x
>     EOF
>     }
> is that the stringification of the code block is being returned by yylex()
> as
>     "(?{<<EOF})\nx\nEOF"
> rather than what I'd expect:
>     "(?{\"x\n\"})"
> (or similar).
> But to a certain extent it depends on how heredocs are supposed to operate
> within regex codeblocks, and how such regexes are supposed to stringify.
> I think FC did a lot of fixups in this area recently.

I fixed up the deparsing of code blocks, by actually deparsing the code inside the regexp, instead of just stringifying it.

Prior to that, I did many fix-ups in the parsing of here-docs, but I don’t recall doing anything specific to (?{...}) blocks; in fact, I think it predated your rewrite of those blocks.

> This is all too horrible to contemplate at the moment.

What’s funny is that the length of the string that is supposed to represent the stringification of the code block amounts to the length of the code block plus the length of the trailing here-doc.  But the code that gets used is a string of that length taken indiscriminately from the source code, beginning at the start of the code block.

In other words,


produces the token PV("(?{<<EOF})123456")

because the here-doc is 6 characters lon ("x\nEOF\n" or maybe "\nx\nEOF"--I don’t know which).

So I can get past the assertion by putting a parenthesis at the right spot:

print qr{\x{100}((?{<<EOF})12345)
}, "\n"

This gives me


which is completely wrong.

Traditionally the stringification of a regular expression with a here-doc body outside the pattern has not included the here-doc body.  It still behaves that way:

$ ./perl -lIlib -e 'print qr/(?{<<EOF})/' -eEOF

I think that is acceptable.  There is really no way to make it behave correctly when stringified and then recompiled as a regexp (which is generally true of code blocks, which may or may not work).

So can we do something similar with here-doc bodies inside the pattern?  (Actually, I though we were already doing that.  Look for the ‘Paranoia’ comment in toke.c.  Why is that not working?)


Father Chrysostomos

via perlbug:  queue: perl5 status: open

Thread Previous Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About