develooper Front page | perl.perl5.porters | Postings from December 2018

Re: regexec.c: LGTM analysis warns about two comparisons, but weactually need them

Thread Previous | Thread Next
Tony Cook
December 6, 2018 09:28
Re: regexec.c: LGTM analysis warns about two comparisons, but weactually need them
Message ID:

On Thu, Dec 06, 2018 at 08:04:44AM +0000, wrote:
> James E Keenan <> wrote:
> :Here is a case where the analysis of the Perl 5 source code 
> :produces results that are at first plausible but ultimately incorrect.
> I think the analysis is indeed correct, but it is relying on the stability
> of things deliberately hidden behind macros. The question here is not
> whether what it has spotted is correct today, but whether relying on that
> is a safe thing to do.
> The initial check on which it is relying is:
> 9763:    if (c < NUM_ANYOF_CODE_POINTS) {
> .. checking against a macro defined in regcomp.h thus:
> > /* This give the number of code points that can be in the bitmap of an ANYOF
> >  * node.  The shift number must currently be one of: 8..12.  It can't be less
> >  * than 8 (256) because some code relies on it being at least that.  Above 12
> >  * (4096), and you start running into warnings that some data structure widths
> >  * have been exceeded, though the test suite as of this writing still passes
> >  * for up through 16, which is as high as anyone would ever want to go,
> >  * encompassing all of the Unicode BMP, and thus including all the economically
> >  * important world scripts.  At 12 most of them are: including Arabic,
> >  * Cyrillic, Greek, Hebrew, Indian subcontinent, Latin, and Thai; but not Han,
> >  * Japanese, nor Korean.  (The regarglen structure in regnodes.h is a U8, and
> >  * the trie types TRIEC and AHOCORASICKC are larger than U8 for shift values
> >  * below above 12.)  Be sure to benchmark before changing, as larger sizes do
> >  * significantly slow down the test suite */
> > #define NUM_ANYOF_CODE_POINTS   (1 << 8)
> So for the current definition, it is entirely true that code guarded by
> that check knows c is less than 256, but the contract implied by those
> comments are that things should continue to work correctly even if that
> definition changes (within the described range).
> Now in the supposedly redundant checks, we're comparing against a literal 256,
> and it isn't clear why - it is doing so in the first case just before
> an ANYOF table lookup, suggesting a table-size guard that you'd think
> would use the NUM_ANYOF_CODE_POINTS macro; in the second case it is
> just _after_ such a lookup, which is somewhat crazier, but probably
> explained by the '(U8) c' cast in the guarded block (though if that
> were the reason, the literal 256 should more correctly/usefully be
> U8_MAX).
> If we're lucky, Karl might understand the intent of those tests and
> be able to advise. It is certainly possible that one or both is
> redundant.

In both cases the code point is used where it's limited to 8 bits.

In the first case it's used as an index in PL_fold_locale[], which is
an array of 256 folded code points.

In the second case it's used as a parameter to Perl_isFOO_lc() for
which the character parameter is a U8.

It might be preferable to make the first c < 256 into a comparison
against a symbolic constant (so a possible local change could be
extending PL_fold_locale to 512 or more fold entries, though the type
would also need to change from unsigned char[] to unsigned short[].

The second might be better to compare against a cast value, eg.
something like c == (U8)c


Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About