develooper Front page | perl.perl5.porters | Postings from September 2013

Re: RFC: What to do about warning: "Code point 0xFOO is not Unicode,all \\p{} matches fail; all \\P{} matches succeed"

Thread Previous | Thread Next
Karl Williamson
September 14, 2013 19:08
Re: RFC: What to do about warning: "Code point 0xFOO is not Unicode,all \\p{} matches fail; all \\P{} matches succeed"
Message ID:
On 09/13/2013 02:37 AM, Nicholas Clark wrote:
> On Thu, Sep 12, 2013 at 10:35:18PM -0400, Ricardo Signes wrote:
>> * Karl Williamson <> [2013-09-11T22:57:24]
>>> I've done some more thinking about this, and am presenting here my
>>> current thoughts.
>> Thanks for this.  I think I an on board with you, for the most part.
>> Figuring out who the various I's are in DWIM is always a good exercise.
> Having re-read the thread several times, I think I have my head round the
> conflicting requirements and expectations. Thanks for taking the time
> to explain them.
>>> There are some flies in the ointment though.  I don't think \p{Any}
>>> and \p{All} should match anything but strictly Unicode code points.
>>> We already have a well established way to match all code points, and
>>> that is to use the dot ".".  But I'm open to arguments the other
>>> way.
>> A dot might work, but then you're thinking about /s again.  But more to the
>> point, I just wonder why you think those properties shouldn't match?  Do you
>> think users will be trying to exclude weird codepoints by using those?  I guess
>> what I want to know is:  who is the DWIM person who benefits from excluding
>> \x{FF_FFFF}?  Does he or she know that there may be trans-Unicode points
>> incoming and want to exclude them?  If so, why not fatalize warnings, or
>> also require Unicode explicitly?  Or are we protecting them from weird input?
> I thought the same about "." and the /s flag.
> To check - "All" and "Any" are Perl defined extensions, not Unicode
> consortium? And they have always been documented both as (a) synonyms
> (b) to match [\x{0000}-\x{10FFFF}] ?

I had forgotten about needing the /s flag on dot to make it equivalent. 
  And, I was mistaken in my understanding about "Any" and "All", or else 
Unicode has changed it's text since I last studied this aspect of it (I 
didn't bother to go digging).  "All" doesn't appear as a property there 
(I thought it did), so it must be a Perl extension, and so I would 
suggest that we change it to match every possible code point on the 
platform, that is, decouple it from "Any"

And strictly, "Any" is a Perl extension, but that's only in the very 
strictest sense.  It is defined in "Unicode Technical Standard #18 
Unicode Regular Expressions"

which isn't strictly considered part of the Standard proper, but Perl 
has tried to follow its "recommendations", which are phrased in the text 
as requirements even though it isn't part of the Standard.  It is in 
effect, a de-facto standard.  It defines "Any" as (with slight editing 

"Any matches all code points. This could also be captured with
[\x{0}-\x{10FFFF}].... In some regular expression languages, \p{Any} may 
be expressed by a period, but that may exclude newline characters."

Even a non-purist is expecting "Any" to match only up through 10FFFF. 
That's why I leaning against changing it.  Since I was wrong about 
"All", that would further argue that we can change it, and leave "Any" 

>>> But I'm tempted to move somewhat more towards the Perlish side of
>>> things, and change the warning/error message so that it is raised
>>> only when the result would be different under a Perlish vs Unicodish
>>> regime.
>> I think this makes sense.  Maybe I want to think about it more, or see whether
>> somebody else has an objection. :)
> So, if I have the summary correct
> There are three (obvious) ways to treat all code points outside of Unicode's
> range \x{0000}-\x{10FFFF}
> 0) Just croak.
>     (A very purist Unicode approach.)
> 1) All Unicode property matches fail.
>     (Which means that \p{PROPERTY=false} and both \p{PROPERTY=true} fail,
>      even though that seems a contradiction.)
> 2) The code points are treated as if they are unassigned Unicode code points.
>     (Exactly one of \p{PROPERTY=false} and \p{PROPERTY=true} will match.)
> And that your current favoured approach is to take the behaviour of (2),
> but warn if it differs from the behaviour of (1).

The above appears to me to be an accurate restatement of what I was 
trying to say.

> Because
> 1) this means that it's still viable to use out-of-range code points for
>     "internal" purposes without generating so many warnings that they get
>     turned off
> 2) it permits a warning that is useful to leave on by default
> 3) the warning can be made fatal for strict(er) behaviour

I would add that the warning's sole purpose is to notify you that you 
are using a Unicode construct to apply to non-Unicode things.  A similar 
warning is generated if you, uc(), lc() etc. a non-Unicode code point. 
If this is what the code is supposed to be doing, simply turn off this 
warning category.  The point of making it a category on its own was so 
these are the only messages in it, and so turning it off doesn't mean 
you might miss something you might care about.

> But what it doesn't (directly) offer is a way for a Unicode purist to treat
> as fatal any attempt to match an out-of-range code point.

Exactly. This proposal doesn't fully support the purist approach, and 
that is problematic.

I want to emphasize that currently there is a potential security hole 
with collating non-Unicode code points.  This will soon be closed, but 
it shows the potential dangers of not following the Standard precisely. 
  Unicode makes mistakes, but generally, they have more expertise in 
these matters and have thought through things more than we.  Thus it 
seems like the right thing to fully support a purist approach

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About