Re: RFC: What to do about warning: "Code point 0xFOO is not Unicode,all \\p{} matches fail; all \\P{} matches succeed"

Karl Williamson
December 31, 2013 16:04
Re: RFC: What to do about warning: "Code point 0xFOO is not Unicode,all \\p{} matches fail; all \\P{} matches succeed"
On 11/27/2013 09:52 PM, Karl Williamson wrote:
> On 09/13/2013 02:37 AM, Nicholas Clark wrote:
>> On Thu, Sep 12, 2013 at 10:35:18PM -0400, Ricardo Signes wrote:
>>> * Karl Williamson <> [2013-09-11T22:57:24]
>>>> I've done some more thinking about this, and am presenting here my
>>>> current thoughts.
>>> Thanks for this.  I think I an on board with you, for the most part.
>>> Figuring out who the various I's are in DWIM is always a good exercise.
>> Having re-read the thread several times, I think I have my head round the
>> conflicting requirements and expectations. Thanks for taking the time
>> to explain them.
>>>> There are some flies in the ointment though.  I don't think \p{Any}
>>>> and \p{All} should match anything but strictly Unicode code points.
>>>> We already have a well established way to match all code points, and
>>>> that is to use the dot ".".  But I'm open to arguments the other
>>>> way.
>>> A dot might work, but then you're thinking about /s again.  But more
>>> to the
>>> point, I just wonder why you think those properties shouldn't match?
>>> Do you
>>> think users will be trying to exclude weird codepoints by using
>>> those?  I guess
>>> what I want to know is:  who is the DWIM person who benefits from
>>> excluding
>>> \x{FF_FFFF}?  Does he or she know that there may be trans-Unicode points
>>> incoming and want to exclude them?  If so, why not fatalize warnings, or
>>> also require Unicode explicitly?  Or are we protecting them from
>>> weird input?
>> I thought the same about "." and the /s flag.
>> To check - "All" and "Any" are Perl defined extensions, not Unicode
>> consortium? And they have always been documented both as (a) synonyms
>> (b) to match [\x{0000}-\x{10FFFF}] ?
>>>> But I'm tempted to move somewhat more towards the Perlish side of
>>>> things, and change the warning/error message so that it is raised
>>>> only when the result would be different under a Perlish vs Unicodish
>>>> regime.
>>> I think this makes sense.  Maybe I want to think about it more, or
>>> see whether
>>> somebody else has an objection. :)
>> So, if I have the summary correct
>> There are three (obvious) ways to treat all code points outside of
>> Unicode's
>> range \x{0000}-\x{10FFFF}
>> 0) Just croak.
>>     (A very purist Unicode approach.)
>> 1) All Unicode property matches fail.
>>     (Which means that \p{PROPERTY=false} and both \p{PROPERTY=true} fail,
>>      even though that seems a contradiction.)
>> 2) The code points are treated as if they are unassigned Unicode code
>> points.
>>     (Exactly one of \p{PROPERTY=false} and \p{PROPERTY=true} will match.)
>> And that your current favoured approach is to take the behaviour of (2),
>> but warn if it differs from the behaviour of (1).
>> Because
>> 1) this means that it's still viable to use out-of-range code points for
>>     "internal" purposes without generating so many warnings that they get
>>     turned off
>> 2) it permits a warning that is useful to leave on by default
>> 3) the warning can be made fatal for strict(er) behaviour
>> But what it doesn't (directly) offer is a way for a Unicode purist to
>> treat
>> as fatal any attempt to match an out-of-range code point.
>> Nicholas Clark
> Having now implemented this, I have a couple of refinements to propose.
> The first addresses the Unicode purist.  Part of my concern has been
> that what's been available up to now didn't always raise the appropriate
> warnings.  This happens if the regex compiler optimizes the \p{}
> expression into something else.  This happens for example in
> \p{Line_Break=Line_Feed}.  This is optimized as if the backslash
> sequence \n had been in the pattern instead.  (Surprisingly many Unicode
> properties match a single code point, about 200 of them; all get
> optimized like the example.)  If one matches such a property against a
> non-Unicode code point, no warning is raised.  I have finally figured
> out a way to fix this without slowing down or complicating the execution
> code or requiring a new pragma.  I propose to check during regex
> compilation if the non-unicode warnings are enabled and fatalized.  If
> so, the optimizations are skipped and the warnings will be enabled for
> all Unicode properties, not just the ones where the outcome is different
> than what a purist would expect.  Only a few lines of code need be added
> to regcomp.c to accomplish this.  A purist will get their desired
> behavior.  However, if the match was skipped entirely because the string
> is, say, too short, no warning would be raised.  But then, no match was
> attempted.  Note that the behavior is based on the lexical scope of the
> pattern compilation, rather than its ultimate execution.  This is
> already true of almost all aspects of regex execution.  A purist will
> get full purist behavior.  The remaining 99+% will get the
> already-agreed on changes that are more DWIM.  I'd rather not create a
> pragma to do this.

This is finally now in blead.
> The other refinement is one already alluded to earlier in this thread. I
> propose to only output the non-Unicode match warning at most once per
> pattern execution.  Thus backtracking would not cause the message to be
> output again and again; and if two properties were matched against
> above-Unicode code points during the same match instance, only the first
> would raise the warning.  If the pattern were matched in repeated
> executions in a loop, each time through could generate the warning.

I haven't done this yet.

