develooper Front page | perl.perl5.porters | Postings from December 2018

Re: regexec.c: LGTM analysis warns about two comparisons, but weactually need them

Thread Previous | Thread Next
December 6, 2018 09:07
Re: regexec.c: LGTM analysis warns about two comparisons, but weactually need them
Message ID:
James E Keenan <> wrote:
:Here is a case where the analysis of the Perl 5 source code 
:produces results that are at first plausible but ultimately incorrect.

I think the analysis is indeed correct, but it is relying on the stability
of things deliberately hidden behind macros. The question here is not
whether what it has spotted is correct today, but whether relying on that
is a safe thing to do.

The initial check on which it is relying is:

9763:    if (c < NUM_ANYOF_CODE_POINTS) {

.. checking against a macro defined in regcomp.h thus:

> /* This give the number of code points that can be in the bitmap of an ANYOF
>  * node.  The shift number must currently be one of: 8..12.  It can't be less
>  * than 8 (256) because some code relies on it being at least that.  Above 12
>  * (4096), and you start running into warnings that some data structure widths
>  * have been exceeded, though the test suite as of this writing still passes
>  * for up through 16, which is as high as anyone would ever want to go,
>  * encompassing all of the Unicode BMP, and thus including all the economically
>  * important world scripts.  At 12 most of them are: including Arabic,
>  * Cyrillic, Greek, Hebrew, Indian subcontinent, Latin, and Thai; but not Han,
>  * Japanese, nor Korean.  (The regarglen structure in regnodes.h is a U8, and
>  * the trie types TRIEC and AHOCORASICKC are larger than U8 for shift values
>  * below above 12.)  Be sure to benchmark before changing, as larger sizes do
>  * significantly slow down the test suite */
> #define NUM_ANYOF_CODE_POINTS   (1 << 8)

So for the current definition, it is entirely true that code guarded by
that check knows c is less than 256, but the contract implied by those
comments are that things should continue to work correctly even if that
definition changes (within the described range).

Now in the supposedly redundant checks, we're comparing against a literal 256,
and it isn't clear why - it is doing so in the first case just before
an ANYOF table lookup, suggesting a table-size guard that you'd think
would use the NUM_ANYOF_CODE_POINTS macro; in the second case it is
just _after_ such a lookup, which is somewhat crazier, but probably
explained by the '(U8) c' cast in the guarded block (though if that
were the reason, the literal 256 should more correctly/usefully be

If we're lucky, Karl might understand the intent of those tests and
be able to advise. It is certainly possible that one or both is

:By inserting debugging statements (as suggested by xenu++) and then 
:running the test suite, I confirmed that, at least within the test suite 
:the value of 'c' ranges between 0 and 255.
:I then created a branch (jkeenan/lgtm-regexec-c; 
:in which I simply removed those two comparisons.  Notwithstanding the 
:alleged superfluousness of the two comparisons, I got these failures:

Do those test failures go away for you when you remove the debugging
statements? I think it's those causing your problem rather than the
code change.

But for the reasons above even if the tests were to pass we should not
make this change without understanding the purpose of those checks.


Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About