develooper Front page | perl.perl5.porters | Postings from August 2013

Re: [perl #113824] Regexp error messages are not UTF8-clean

Thread Previous | Thread Next
From:
Brian Fraser
Date:
August 30, 2013 16:27
Subject:
Re: [perl #113824] Regexp error messages are not UTF8-clean
Message ID:
CA+nL+nZPXJvPchS_bS+gKgiBRr0dqC0Jspz03oBOOaFkKdz5Lw@mail.gmail.com
On Tue, Jun 18, 2013 at 8:39 PM, Father Chrysostomos via RT <
perlbug-followup@perl.org> wrote:

> On Tue Jun 18 13:09:32 2013, demerphq wrote:
> > On 18 June 2013 21:47, Father Chrysostomos via RT
> > <perlbug-followup@perl.org> wrote:
> > > On Tue Jun 18 08:36:45 2013, demerphq wrote:
> > >> On 18 June 2013 15:29, Father Chrysostomos via RT
> > >> <perlbug-followup@perl.org> wrote:
> > >> > And this is what it would look like in practice;
> > >> >
> > >> > Perl_croak("Couldn't twiggle the twoggle in \"%"UTF8f"\"",
> > is_utf8, s);
> > >> >
> > >> > UTF8f could take two arguments, the first being a boolean.  That
> > would
> > >> > be the most useful way to implement it.
> > >>
> > >> Very nice idea! ++FC.
> > >
> > > We need the length, too.  Which order should they come?  is_utf8,
> > len, str?
> >
> > Well that is the order I would expect for a sprintf() type format like
> > this.
>
> I have implemented that in commit 670610ebb, providing groundwork for
> fixing *this* bug.
>
> So I think this ticket should stay open until it is fixed, since it is
> clearly (to me at least) not a won’t-fix any more.
>

Thank you for the UTF8f format, Father C! It was all that was needed to
wrap up this ticket.
The linked branch[*] fixes the error messages from regcomp.c. However, it
introduces one incompatible change, which I've documented in the perldelta.
Copying that here:

---

The "Unknown switch condition" error message has some slight changes.
This error triggers when there is an unknown condition in a (?(foo))
conditional; The error message used to read:

    Unknown switch condition (?(%s in regex;

But what %s could be was mostly up to luck; For (?(foobar)), you
might've seen "fo" or "f".  For Unicode characters, you'd generally
get a corrupted string.
The message was changed to read:

    Unknown switch condition (?(...)) in regex;

And additionally, the '<-- HERE' marker in the error will now point
to the correct spot in the regex.

---

I was forced to change the error because, at that point in the parsing of
the regex, we don't know if the (?()) has paired parens -- so for example,
/(?(foobar/ would trigger the error too -- which means that we can't know
where the condition ends without further scanning the pattern. It's likely
possible to refactor the code so that the unpaired parens error triggers
first, and thus allow us to print the entirety of the unknown condition,
but it seemed too much work for something that would die anyway.

[*] https://github.com/Hugmeir/utf8mess/tree/regcomp_cleanliness

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About