develooper Front page | perl.perl5.porters | Postings from June 2013

Re: [perl #113824] Regexp error messages are not UTF8-clean

Thread Previous | Thread Next
From:
demerphq
Date:
June 18, 2013 09:53
Subject:
Re: [perl #113824] Regexp error messages are not UTF8-clean
Message ID:
CANgJU+UTzKyV9ktZwD=ViG2GOfw6VspniKpgJf3KtJvKfXW4fA@mail.gmail.com
On 18 June 2013 03:28, Father Chrysostomos via RT
<perlbug-followup@perl.org> wrote:
> On Mon Jun 17 17:58:45 2013, jkeenan wrote:
>> On Sun Jun 24 14:18:50 2012, webmasters@ctosonline.org wrote:
>> > On a UTF-8 terminal:
>> >
>> > $ ./perl -Ilib -CS -e 'use utf8; /�+++/'
>> > Nested quantifiers in regex; marked by <-- HERE in m/ü+++ <-- HERE /
>> >    at -e line 1.
>> >
>> > ---
>> > Flags:
>> >     category=core
>> >     severity=low
>> > ---
>> > Site configuration information for perl 5.17.0:
>>
>> This ticket has not received a response since filing more than a year ago.
>>
>> Could someone who understands what a "UTF-8 terminal" is take a look?
>
> A UTF-8 terminal is one that uses UTF-8 for its character set, so that
> typing the character ā inputs the sequence c4 81 and likewise the
> sequence c4 81 is displayed as ā.
>
> Now, RT has managed to screw it up completely, but the bug can be
> demonstrated like this instead:
>
> $ ./perl -Ilib -CS -e '$c = chr 0x100; /$c+++/' 2>&1 | LANG=C less -U
>
> And less shows:
>
> Nested quantifiers in regex; marked by <-- HERE in m/<C3><84><C2><80>+++
> <-- HERE / at -e line 1.
>
> I have -CS set, so the standard handles should output UTF-8.  c3 84 c2
> 80 is not the UTF-8 sequence for chr 0x100.
>
> Another way to demonstrate it:
>
> use Data::Dumper;
> ++$Data::Dumper::Useqq;
>
> $c = chr 0x100;
> print Dumper $c;
> eval '/$c+++/';
> print Dumper $@;
> __END__
> $VAR1 = "\x{100}";
> $VAR1 = "Nested quantifiers in regex; marked by <-- HERE in
> m/\304\200+++ <-- HERE / at (eval 3) line 1.\n";
>
> The \304\200 should be \x{100}.

This is a Perl API fail. I do not see how it can be fixed without
grevious trauma. Apparently much of our internal error message
handling code is not UTF8 safe.

See the code for vFAIL() in regcomp.c which calls Perl_croak() which
calls vcroak().

The interface for Perl_croak() and friends do not support UTF8 at all.
They accept only a char* pointer, and have no facility for a UTF8
flag.

We could fix the direct problem by rewriting all the code in the regex
engine which uses UTF8, but imo that is just a bandage. The real
problem is our core API's were never modernized to work properly with
Unicode.

IMO, this ticket should be closed as a "won't fix", or merged with a
ticket which relates to our internal error reporting API's lacking
proper Unicode support and fixed as part of resolving THAT ticket.

Also IMO, if we want to really fix this stuff we should just bite the
bullet, deprecate ALL of the char * only internal API's and switch to
something that ALWAYS includes a utf8 flag. Across the board.

Yves






--
perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About