develooper Front page | perl.perl5.porters | Postings from December 2018

Re: regexec.c: LGTM analysis warns about two comparisons, but weactually need them

Thread Previous | Thread Next
James E Keenan
December 6, 2018 13:15
Re: regexec.c: LGTM analysis warns about two comparisons, but weactually need them
Message ID:
On 12/6/18 3:04 AM, wrote:
> James E Keenan <> wrote:
> :Here is a case where the analysis of the Perl 5 source code
> :produces results that are at first plausible but ultimately incorrect.
> I think the analysis is indeed correct, but it is relying on the stability
> of things deliberately hidden behind macros. The question here is not
> whether what it has spotted is correct today, but whether relying on that
> is a safe thing to do.
> The initial check on which it is relying is:
> 9763:    if (c < NUM_ANYOF_CODE_POINTS) {
> .. checking against a macro defined in regcomp.h thus:
>> /* This give the number of code points that can be in the bitmap of an ANYOF
>>   * node.  The shift number must currently be one of: 8..12.  It can't be less
>>   * than 8 (256) because some code relies on it being at least that.  Above 12
>>   * (4096), and you start running into warnings that some data structure widths
>>   * have been exceeded, though the test suite as of this writing still passes
>>   * for up through 16, which is as high as anyone would ever want to go,
>>   * encompassing all of the Unicode BMP, and thus including all the economically
>>   * important world scripts.  At 12 most of them are: including Arabic,
>>   * Cyrillic, Greek, Hebrew, Indian subcontinent, Latin, and Thai; but not Han,
>>   * Japanese, nor Korean.  (The regarglen structure in regnodes.h is a U8, and
>>   * the trie types TRIEC and AHOCORASICKC are larger than U8 for shift values
>>   * below above 12.)  Be sure to benchmark before changing, as larger sizes do
>>   * significantly slow down the test suite */
>> #define NUM_ANYOF_CODE_POINTS   (1 << 8)
> So for the current definition, it is entirely true that code guarded by
> that check knows c is less than 256, but the contract implied by those
> comments are that things should continue to work correctly even if that
> definition changes (within the described range).
> Now in the supposedly redundant checks, we're comparing against a literal 256,
> and it isn't clear why - it is doing so in the first case just before
> an ANYOF table lookup, suggesting a table-size guard that you'd think
> would use the NUM_ANYOF_CODE_POINTS macro; in the second case it is
> just _after_ such a lookup, which is somewhat crazier, but probably
> explained by the '(U8) c' cast in the guarded block (though if that
> were the reason, the literal 256 should more correctly/usefully be
> U8_MAX).
> If we're lucky, Karl might understand the intent of those tests and
> be able to advise. It is certainly possible that one or both is
> redundant.
> :By inserting debugging statements (as suggested by xenu++) and then
> :running the test suite, I confirmed that, at least within the test suite
> :the value of 'c' ranges between 0 and 255.
> :
> :I then created a branch (jkeenan/lgtm-regexec-c;
> :
> :in which I simply removed those two comparisons.  Notwithstanding the
> :alleged superfluousness of the two comparisons, I got these failures:
> Do those test failures go away for you when you remove the debugging
> statements? I think it's those causing your problem rather than the
> code change.

No, the debugging statements were put into the code before I 
experimented with removal of the two comparisons.  They were then 
removed when I created the branch.  So they cannot be the cause of the 

> But for the reasons above even if the tests were to pass we should not
> make this change without understanding the purpose of those checks.
> Hugo

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About