develooper Front page | perl.perl5.porters | Postings from June 2013

[perl #113824] Regexp error messages are not UTF8-clean

Thread Previous | Thread Next
From:
Father Chrysostomos via RT
Date:
June 18, 2013 13:17
Subject:
[perl #113824] Regexp error messages are not UTF8-clean
Message ID:
rt-3.6.HEAD-2552-1371561435-851.113824-15-0@perl.org
On Tue Jun 18 02:54:32 2013, demerphq wrote:
> On 18 June 2013 03:28, Father Chrysostomos via RT
> <perlbug-followup@perl.org> wrote:
> > On Mon Jun 17 17:58:45 2013, jkeenan wrote:
> >> On Sun Jun 24 14:18:50 2012, webmasters@ctosonline.org wrote:
> >> > On a UTF-8 terminal:
> >> >
> >> > $ ./perl -Ilib -CS -e 'use utf8; /�+++/'
> >> > Nested quantifiers in regex; marked by <-- HERE in m/ü+++ <--
> HERE /
> >> >    at -e line 1.
> >> >
> >> > ---
> >> > Flags:
> >> >     category=core
> >> >     severity=low
> >> > ---
> >> > Site configuration information for perl 5.17.0:
> >>
> >> This ticket has not received a response since filing more than a
> year ago.
> >>
> >> Could someone who understands what a "UTF-8 terminal" is take a
> look?
> >
> > A UTF-8 terminal is one that uses UTF-8 for its character set, so
> that
> > typing the character ā inputs the sequence c4 81 and likewise the
> > sequence c4 81 is displayed as ā.
> >
> > Now, RT has managed to screw it up completely, but the bug can be
> > demonstrated like this instead:
> >
> > $ ./perl -Ilib -CS -e '$c = chr 0x100; /$c+++/' 2>&1 | LANG=C less
> -U
> >
> > And less shows:
> >
> > Nested quantifiers in regex; marked by <-- HERE in
> m/<C3><84><C2><80>+++
> > <-- HERE / at -e line 1.
> >
> > I have -CS set, so the standard handles should output UTF-8.  c3 84
> c2
> > 80 is not the UTF-8 sequence for chr 0x100.
> >
> > Another way to demonstrate it:
> >
> > use Data::Dumper;
> > ++$Data::Dumper::Useqq;
> >
> > $c = chr 0x100;
> > print Dumper $c;
> > eval '/$c+++/';
> > print Dumper $@;
> > __END__
> > $VAR1 = "\x{100}";
> > $VAR1 = "Nested quantifiers in regex; marked by <-- HERE in
> > m/\304\200+++ <-- HERE / at (eval 3) line 1.\n";
> >
> > The \304\200 should be \x{100}.
> 
> This is a Perl API fail. I do not see how it can be fixed without
> grevious trauma. Apparently much of our internal error message
> handling code is not UTF8 safe.
> 
> See the code for vFAIL() in regcomp.c which calls Perl_croak() which
> calls vcroak().
> 
> The interface for Perl_croak() and friends do not support UTF8 at all.
> They accept only a char* pointer, and have no facility for a UTF8
> flag.
> 
> We could fix the direct problem by rewriting all the code in the regex
> engine which uses UTF8, but imo that is just a bandage. The real
> problem is our core API's were never modernized to work properly with
> Unicode.
> 
> IMO, this ticket should be closed as a "won't fix", or merged with a
> ticket which relates to our internal error reporting API's lacking
> proper Unicode support and fixed as part of resolving THAT ticket.
> 
> Also IMO, if we want to really fix this stuff we should just bite the
> bullet, deprecate ALL of the char * only internal API's and switch to
> something that ALWAYS includes a utf8 flag. Across the board.

All our printf-style functions *do* support utf8, but just not as a
separate flag.  The input must be a HEK or SV.  Look for instances of
HEKf and SVf in the perl source.

Now, it has just occurred to be that it would not be at all hard to add
support for utf8 char*s.  We have pretty much infinite space in %-p
formats, so we can define %-4p to whatever we want.

The latter would not be necessary for fixing this bug, but it may be a
good thing to do anyway, and it might turn out to be the simplest way to
fix this bug.

-- 

Father Chrysostomos


---
via perlbug:  queue: perl5 status: open
https://rt.perl.org:443/rt3/Ticket/Display.html?id=113824

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About