develooper Front page | perl.perl5.porters | Postings from August 2013

Re: RFC: What to do about warning: "Code point 0xFOO is not Unicode,all \\p{} matches fail; all \\P{} matches succeed"

Thread Previous | Thread Next
From:
Karl Williamson
Date:
August 29, 2013 19:00
Subject:
Re: RFC: What to do about warning: "Code point 0xFOO is not Unicode,all \\p{} matches fail; all \\P{} matches succeed"
Message ID:
521F9A4A.7080508@khwilliamson.com
On 08/29/2013 11:47 AM, Eric Brine wrote:
> On Mon, Aug 26, 2013 at 11:00 PM, Karl Williamson
> <public@khwilliamson.com <mailto:public@khwilliamson.com>> wrote:
>
>     The other option was to make \p{gc=unassigned} succeed for
>     non-Unicode code points.  But this isn't what Unicode says.  A
>     strict interpretation fails this because Unicode has never said that
>     a non-Unicode code point should be considered unassigned.  But I now
>     believe it is more DWIM to consider them so.
>
>
> If you're worried that someone might want to distinguish
> Unicode-but-unassigned from non-Unicode, then you could extend gc to
> include gc=NonUnicode. However, I suspect suspect such a distinction is
> rarely needed, so it's probably better to include non-Unicode code
> points in gc=unassigned, and let those who want to distinguish
> unassigned code points from non-Unicode code points use (?[
> [\p{gc=unassigned] - [\x{0}-\x{10FFFF}] ]) and (?[ [\p{gc=unassigned] -
> [^\x{0}-\x{10FFFF}] ]). (Maybe provide \p{Unicode}?)
>
> In general, it's clear that non-Unicode code points should behave as a
> Unicode code point without the property. No more /\p{XXX}/ && /\P{XXX}/
> being true.
>

What to do then about \p{Any} ?  Unicode explicitly says it should match 
[0-\x{10FFFF}].  Do we leave it like that, or should it be a synonym for 
dot?  I think the former.

If we leave it alone, what about \p{All}, which is supposed to be a 
synonym for \p{Any}, but whose name seems to indicate everything possible?

I've had some more insights since I posted things, and have recalled 
more as to why the message is raised.

First of all, its highly arguable, and Unicode would strongly make this 
argument, that one should not be attempting to use Unicode properties on 
non-Unicode code points; hence it is appropriate to raise a warning or 
even die (like division by zero does) when you violate that.

Second, the current behavior is explained by the simple intuitive 
statement expressed in the warning: "for non-Unicode code points, all 
\p{} fail; all \P{} succeed".

Third, the differences in behavior between the current behavior and 
changing it, apply to only a few property-value combinations that are 
likely to occur.  \p{BINARY_PROPERTY=false} for all binary properties 
are examples of these differences, but are unlikely to ever occur in 
practice.

The most likely one to occur is \p{Unassigned} or its synonyms like 
\p{gc=cn}, but there are others, such as \p{Unknown} (though I bet most 
people would have to look up what this one matches).  We could therefore 
change the warning message to be raised only when a non-Unicode code 
point is matched against a property that has the potentially 
counter-intuitive results, cutting down the frequency of its occurrence 
significantly.  I'm pretty confident that these would never get 
optimized into something other than a regular property matching regnode, 
so my concern about dealing with this possibility goes away.  If we 
retained the current behavior, we still would have to decide if this 
message gets turned off after a certain number of them being output, 
besides the current ability to say "no warnings 'non_unicode'".

What I'm hearing though is more sentiment in favor of changing 
\p{Unassigned} to extending beyond the Unicode range.  I'm fine with 
that.  We have sufficient weasel words in the pods, and the warning 
raised even for properties where you get the expected results, that we 
shouldn't have to have a deprecation cycle, etc.  The implications of 
this on \p{Any} and \p{All} need to be addressed.  I believe I 
understand the other implications of making this change.

But we aren't going to be adding new General_Categories.  What those are 
is guaranteed to be immutable by Unicode, and playing with them could 
cause algorithms not to be so easily transferred to Perl.  We shouldn't 
be getting on a slippery slope of disregarding Unicode's decisions, even 
if they sometimes, like us, don't think things out fully and find 
unintended consequences down the road.


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About