develooper Front page | perl.perl5.porters | Postings from November 2013

Re: RFC: What to do about warning: "Code point 0xFOO is not Unicode,all \\p{} matches fail; all \\P{} matches succeed"

Thread Previous | Thread Next
Karl Williamson
November 28, 2013 04:53
Re: RFC: What to do about warning: "Code point 0xFOO is not Unicode,all \\p{} matches fail; all \\P{} matches succeed"
Message ID:
On 09/13/2013 02:37 AM, Nicholas Clark wrote:
> On Thu, Sep 12, 2013 at 10:35:18PM -0400, Ricardo Signes wrote:
>> * Karl Williamson <> [2013-09-11T22:57:24]
>>> I've done some more thinking about this, and am presenting here my
>>> current thoughts.
>> Thanks for this.  I think I an on board with you, for the most part.
>> Figuring out who the various I's are in DWIM is always a good exercise.
> Having re-read the thread several times, I think I have my head round the
> conflicting requirements and expectations. Thanks for taking the time
> to explain them.
>>> There are some flies in the ointment though.  I don't think \p{Any}
>>> and \p{All} should match anything but strictly Unicode code points.
>>> We already have a well established way to match all code points, and
>>> that is to use the dot ".".  But I'm open to arguments the other
>>> way.
>> A dot might work, but then you're thinking about /s again.  But more to the
>> point, I just wonder why you think those properties shouldn't match?  Do you
>> think users will be trying to exclude weird codepoints by using those?  I guess
>> what I want to know is:  who is the DWIM person who benefits from excluding
>> \x{FF_FFFF}?  Does he or she know that there may be trans-Unicode points
>> incoming and want to exclude them?  If so, why not fatalize warnings, or
>> also require Unicode explicitly?  Or are we protecting them from weird input?
> I thought the same about "." and the /s flag.
> To check - "All" and "Any" are Perl defined extensions, not Unicode
> consortium? And they have always been documented both as (a) synonyms
> (b) to match [\x{0000}-\x{10FFFF}] ?
>>> But I'm tempted to move somewhat more towards the Perlish side of
>>> things, and change the warning/error message so that it is raised
>>> only when the result would be different under a Perlish vs Unicodish
>>> regime.
>> I think this makes sense.  Maybe I want to think about it more, or see whether
>> somebody else has an objection. :)
> So, if I have the summary correct
> There are three (obvious) ways to treat all code points outside of Unicode's
> range \x{0000}-\x{10FFFF}
> 0) Just croak.
>     (A very purist Unicode approach.)
> 1) All Unicode property matches fail.
>     (Which means that \p{PROPERTY=false} and both \p{PROPERTY=true} fail,
>      even though that seems a contradiction.)
> 2) The code points are treated as if they are unassigned Unicode code points.
>     (Exactly one of \p{PROPERTY=false} and \p{PROPERTY=true} will match.)
> And that your current favoured approach is to take the behaviour of (2),
> but warn if it differs from the behaviour of (1).
> Because
> 1) this means that it's still viable to use out-of-range code points for
>     "internal" purposes without generating so many warnings that they get
>     turned off
> 2) it permits a warning that is useful to leave on by default
> 3) the warning can be made fatal for strict(er) behaviour
> But what it doesn't (directly) offer is a way for a Unicode purist to treat
> as fatal any attempt to match an out-of-range code point.
> Nicholas Clark

Having now implemented this, I have a couple of refinements to propose.

The first addresses the Unicode purist.  Part of my concern has been 
that what's been available up to now didn't always raise the appropriate 
warnings.  This happens if the regex compiler optimizes the \p{} 
expression into something else.  This happens for example in 
\p{Line_Break=Line_Feed}.  This is optimized as if the backslash 
sequence \n had been in the pattern instead.  (Surprisingly many Unicode 
properties match a single code point, about 200 of them; all get 
optimized like the example.)  If one matches such a property against a 
non-Unicode code point, no warning is raised.  I have finally figured 
out a way to fix this without slowing down or complicating the execution 
code or requiring a new pragma.  I propose to check during regex 
compilation if the non-unicode warnings are enabled and fatalized.  If 
so, the optimizations are skipped and the warnings will be enabled for 
all Unicode properties, not just the ones where the outcome is different 
than what a purist would expect.  Only a few lines of code need be added 
to regcomp.c to accomplish this.  A purist will get their desired 
behavior.  However, if the match was skipped entirely because the string 
is, say, too short, no warning would be raised.  But then, no match was 
attempted.  Note that the behavior is based on the lexical scope of the 
pattern compilation, rather than its ultimate execution.  This is 
already true of almost all aspects of regex execution.  A purist will 
get full purist behavior.  The remaining 99+% will get the 
already-agreed on changes that are more DWIM.  I'd rather not create a 
pragma to do this.

The other refinement is one already alluded to earlier in this thread. 
I propose to only output the non-Unicode match warning at most once per 
pattern execution.  Thus backtracking would not cause the message to be 
output again and again; and if two properties were matched against 
above-Unicode code points during the same match instance, only the first 
would raise the warning.  If the pattern were matched in repeated 
executions in a loop, each time through could generate the warning.

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About