develooper Front page | perl.perl5.porters | Postings from September 2013

Re: RFC: What to do about warning: "Code point 0xFOO is not Unicode,all \\p{} matches fail; all \\P{} matches succeed"

Thread Previous | Thread Next
Karl Williamson
September 28, 2013 18:55
Re: RFC: What to do about warning: "Code point 0xFOO is not Unicode,all \\p{} matches fail; all \\P{} matches succeed"
Message ID:
I have very little time now and for the next few weeks to devote to Perl.

On 09/16/2013 04:09 PM, Ricardo Signes wrote:
> * Karl Williamson <> [2013-09-14T15:08:17]
>> "Any matches all code points. This could also be captured with
>> [\x{0}-\x{10FFFF}].... In some regular expression languages, \p{Any}
>> may be expressed by a period, but that may exclude newline
>> characters."
>> Even a non-purist is expecting "Any" to match only up through
>> 10FFFF. That's why I leaning against changing it.  Since I was wrong
>> about "All", that would further argue that we can change it, and
>> leave "Any" alone.
> If /./s matches \p{Any} does it mean that it excludes \0x11_0000 ?

It doesn't match \p{Any}.  See below.

> I think if \p{All} matches all values, then /./s should probably mean that.  To
> say, "I am a person who think about Unicode a lot," you may want to
> specifically say \p{Any}.
> Does that make seem reasonable to you, too?

I don't understand what you're saying, but I think we agree.

/./s currently matches any code point representable on the platform 
(which  currently is [\x{0}-\x{UV_MAX}] (see footnote *))  I'm proposing 
changing \p{All}, which is a Perl extension, to be the same as /./s.  I 
did not find any actual uses of \p{All} in CPAN.

I think we should leave \p{Any} matching [\x00-\x{10FFFF}], as the 
Unicode consortium says it should.  We could also add an extension, 
\p{Unicode}, to be a synonym for \p{Any}.  And we could change the 
documentation to tell people to use these two if they are a more strict 
Unicode person; I think that's what you were getting at.
>>> 1) All Unicode property matches fail.
>>>     (Which means that \p{PROPERTY=false} and both \p{PROPERTY=true} fail,
>>>      even though that seems a contradiction.)
>>> 2) The code points are treated as if they are unassigned Unicode code points.
>>>     (Exactly one of \p{PROPERTY=false} and \p{PROPERTY=true} will match.)
>>> And that your current favoured approach is to take the behaviour of (2),
>>> but warn if it differs from the behaviour of (1).
>> The above appears to me to be an accurate restatement of what I was
>> trying to say.
> I think this is a good solution.
>>> But what it doesn't (directly) offer is a way for a Unicode purist to treat
>>> as fatal any attempt to match an out-of-range code point.
>> Exactly. This proposal doesn't fully support the purist approach,
>> and that is problematic.

I was leaning towards more purist support, but the problem is that we 
can't really guarantee to catch all such instances.  An example is if we 
have a pattern that contains a Unicode property, and we know the pattern 
as a whole requires a string to be at least N bytes long to match, and 
we are presented with a shorter string, we currently just fail.  We 
don't want to attempt to match anyway, just in case there is  an 
above-Unicode code point that could match the Unicode property on the 
road to inevitable failure, just to raise a warning.

Thus, I don't think we can efficiently and accurately implement an 
approach that allows the purist to always get a warning that they can 

> I recently advised someone:
>    If your code really really needs to make sure that its existing regexes
>    never, ever match non-ASCII characters, you first need to scan for \P{ASCII}
>    and then do the rest of your work, because Perl really wants to say text
>    operations imply Unicode.
> Perhaps it's "imply Unicode++" and similarly nervous users should also be
> scanning for \P{Unicode}.

It's a contradiction to be using Unicode properties and expecting to 
work only on ASCII.  The whole point of Unicode is to go beyond ASCII, 
and beyond single locales.

If the code uses /aa and never has utf8 strings, it shouldn't match any 
non-Latin1.  But things like \W, [[:^alpha:]] will match non-ASCII even 
so.  If someone only wants to match only ASCII, they shouldn't be using 
any Unicode property match at all, and they shouldn't be using any 
complement-POSIX property either.

* There has been some thought to making IV_MAX be the largest code point 
we accept on a platform, given that many operations won't work for 
larger code points.

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About