develooper Front page | perl.perl5.porters | Postings from April 2006

Re: [perl #38293] chr(65535) should be allowed in regexes

Thread Previous | Thread Next
From:
SADAHIRO Tomoyuki
Date:
April 2, 2006 06:44
Subject:
Re: [perl #38293] chr(65535) should be allowed in regexes
Message ID:
20060402224657.B942.BQW10602@nifty.com

On Sat, 21 Jan 2006 18:39:26 +0100, Marc Lehmann <schmorp@schmorp.de> wrote

> On Sat, Jan 21, 2006 at 04:18:21AM -0800, Tels via RT <perlbug-followup@perl.org> wrote:
> > doesn't yield any results, maybe it is used as flags += 4; or something - 
> > though I doubt it.
> > 
> > Disabling the warnings just works around the bug that FFFF is not allowed 
> > and there seems to be no way to actually allow it.
> 
> Well, not allowing FFFF has some merit, too, but either all illegal
> codepoints should be disallowed or none at all.
> 
> The "Malformed UTF-8"... is also not quite a warning, as the resulting
> regex won't work (works neither in s/// nor in y///, and its not related
> to character constants):
> 
>    # perl -e '$c = chr 65535; $c=~s/$c//g; print $c'|xxd
>    Malformed UTF-8 character (character 0xffff) in regexp compilation at -e line 1.
>    Malformed UTF-8 character (character 0xffff) in regexp compilation at -e line 1.
>    0000000: efbf bf                                  ...

At least when the scope is out of use warnnings 'utf8', such code points
as U+FFFF should be allowed.
Perl-current allows s/\x{ffff}//g (escaped) to remove U+FFFF,
but neither tr/\x{ffff}//d nor s/${\chr(0xffff)}//g (interpolated
and parsed as a literal); that is inconsistent.

Patch is attached to this mail;
   the filename : allowFFFF.patch.gz

There I define UTF8_ALLOW_DEFAULT macro in utf8.h, to help the consistent
choice of flags for utf8n_to_uv(chr|uni).

#define UTF8_ALLOW_DEFAULT  (ckWARN(WARN_UTF8) ? 0 : UTF8_ALLOW_ANYUV)

cf. a report on what flags are used perl-current:
  http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2006-01/msg00842.html

The reason why utf8n_to_uvchr in S_reginclass has
 (UTF8_ALLOW_DEFAULT & UTF8_ALLOW_ANYUV) instead of UTF8_ALLOW_DEFAULT
is that the problem of [perl #37836] should not come back when
UTF8_ALLOW_DEFAULT would include UTF8_ALLOW_ANY instead of UTF8_ALLOW_ANYUV.

Then this patch includes a test for #37836 as well as tests for
this problem #38293.

Regards,
SADAHIRO Tomoyuki

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About