develooper Front page | perl.perl5.porters | Postings from August 2010

Re: [perl #77414] bug report

Thread Previous | Thread Next
From:
demerphq
Date:
August 26, 2010 03:15
Subject:
Re: [perl #77414] bug report
Message ID:
AANLkTimJFJ=9auZVxUfYoXC1z97WtgOrVTL-19zco2jd@mail.gmail.com
On 25 August 2010 17:37, Abigail <abigail@abigail.be> wrote:
> On Tue, Aug 24, 2010 at 09:42:14AM -0700, Dave U . Random wrote:
>>
>> This is a bug report for perl from loomisk@trash-mail.com,
>> generated with the help of perlbug 1.39 running under perl 5.12.1.
>>
>> Hello everyone!
>>
>> Using the following code only lines 2 and 3 match:
>>
>> print '- ' =~ /\s*\p{Dash}/;    # Version 1, no match
>> print '- ' =~ /\s*\p{Dash}{1}/; # Version 2, match
>> print '- ' =~ /\s*-/;           # Version 3, match
>
>
> I've committed some test cases as commit c920e01.

Thanks a lot Abigail.

I have done some additional analysis on this problem and I have
concluded a fix is most definitely too large for me to accomplish
while on a laptop on vacation. So i post my analysis here while it is
fresh in my mind:

The bug seems to affect any pattern that starts with the form:

 /X*\p{Whatever}/

where X is any escape style charclass, thus \w, \d, \s, \p{...}, etc.
Also the \p{Whatever} may be in a charclass, or by itself, same bug.

As far as I can tell the problem reduces down to the fact that the
optimiser includes /near/ duplicates of the code used to parse
charclasses, but omits any unicode support.

This is the code from the optimiser:

               case SPACE:
                    if (flags & SCF_DO_STCLASS_AND) {
                        if (!(data->start_class->flags & ANYOF_LOCALE)) {
                            ANYOF_CLASS_CLEAR(data->start_class,ANYOF_NSPACE);
                            for (value = 0; value < 256; value++)
                                if (!isSPACE(value))

ANYOF_BITMAP_CLEAR(data->start_class, value);
                        }
                    }
                    else {
                        if (data->start_class->flags & ANYOF_LOCALE)
                            ANYOF_CLASS_SET(data->start_class,ANYOF_SPACE);
                        else {
                            for (value = 0; value < 256; value++)
                                if (isSPACE(value))
                                    ANYOF_BITMAP_SET(data->start_class, value);
                        }
                    }
                    break;

This is the equivalent code from the parser:

#define _C_C_T_(NAME,TEST,WORD)                         \
ANYOF_##NAME:                                           \
    if (LOC)                                            \
        ANYOF_CLASS_SET(ret, ANYOF_##NAME);             \
    else {                                              \
        for (value = 0; value < 256; value++)           \
            if (TEST)                                   \
                ANYOF_BITMAP_SET(ret, value);           \
    }                                                   \
    yesno = '+';                                        \
    what = WORD;                                        \
    break;                                              \
case ANYOF_N##NAME:                                     \
    if (LOC)                                            \
        ANYOF_CLASS_SET(ret, ANYOF_N##NAME);            \
    else {                                              \
        for (value = 0; value < 256; value++)           \
            if (!TEST)                                  \
                ANYOF_BITMAP_SET(ret, value);           \
    }                                                   \
    yesno = '!';                                        \
    what = WORD;                                        \
    break

Which is used in code like:

case _C_C_T_(SPACE, isSPACE(value), "PerlSpace");

And which is then "finalized" with code like this:

                if (what) {
                    /* Strings such as "+utf8::isWord\n" */
                    Perl_sv_catpvf(aTHX_ listsv, "%cutf8::Is%s\n", yesno, what);
                }

Notice the code in the optimiser entirely omits the yesno/what
setting, and includes NO support for utf8. I contend this is quite
wrong.

We have a huge amount of code duplication here, which in perl IMO
would be readily refactorable into parameterized routines, with C im
less sure how to best proceed to eliminate the duplication, especially
as some of the "functions" used such as isSpace(value) are in fact
macros.

Anyway, a simple solution here might be to simply disable startclass
optimizations in this case. A better solution would be to rework the
handling so the _C_C_T_ macros (stands for "char class test") so that
they can be reused in the optimiser.

Long term however i really think there must be an elegant way to
eliminate the vast amount of code duplication that we have in the
regex engine, which is currently, mostly for maintenance purposes,
hidden away behind macros.

Lastly, im inclined to say that we should raise the priority of this
ticket, I think it is somewhat serious. At the very least we should
make sure that this "optimisation" is disabled As Soon As Possible in
5.12 - its one thing to be slow, but being *wrong* is really bad.

Im actually inclined to say that this might be important enough to
delay the next 5.12.x release until it is fixed. (I think disabling it
should be relatively quick to do, i just dont have enough contiguous
time in my schedule to *promise* it myself - I will try, but....)

Yves

-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About