develooper Front page | perl.perl5.porters | Postings from September 2013

Re: RFC: What to do about warning: "Code point 0xFOO is not Unicode,all \\p{} matches fail; all \\P{} matches succeed"

Thread Previous | Thread Next
Karl Williamson
September 12, 2013 02:57
Re: RFC: What to do about warning: "Code point 0xFOO is not Unicode,all \\p{} matches fail; all \\P{} matches succeed"
Message ID:
On 08/29/2013 01:00 PM, Karl Williamson wrote:
> On 08/29/2013 11:47 AM, Eric Brine wrote:
>> On Mon, Aug 26, 2013 at 11:00 PM, Karl Williamson
>> < <>> wrote:
>>     The other option was to make \p{gc=unassigned} succeed for
>>     non-Unicode code points.  But this isn't what Unicode says.  A
>>     strict interpretation fails this because Unicode has never said that
>>     a non-Unicode code point should be considered unassigned.  But I now
>>     believe it is more DWIM to consider them so.
>> If you're worried that someone might want to distinguish
>> Unicode-but-unassigned from non-Unicode, then you could extend gc to
>> include gc=NonUnicode. However, I suspect suspect such a distinction is
>> rarely needed, so it's probably better to include non-Unicode code
>> points in gc=unassigned, and let those who want to distinguish
>> unassigned code points from non-Unicode code points use (?[
>> [\p{gc=unassigned] - [\x{0}-\x{10FFFF}] ]) and (?[ [\p{gc=unassigned] -
>> [^\x{0}-\x{10FFFF}] ]). (Maybe provide \p{Unicode}?)
>> In general, it's clear that non-Unicode code points should behave as a
>> Unicode code point without the property. No more /\p{XXX}/ && /\P{XXX}/
>> being true.
> What to do then about \p{Any} ?  Unicode explicitly says it should match
> [0-\x{10FFFF}].  Do we leave it like that, or should it be a synonym for
> dot?  I think the former.
> If we leave it alone, what about \p{All}, which is supposed to be a
> synonym for \p{Any}, but whose name seems to indicate everything possible?
> I've had some more insights since I posted things, and have recalled
> more as to why the message is raised.
> First of all, its highly arguable, and Unicode would strongly make this
> argument, that one should not be attempting to use Unicode properties on
> non-Unicode code points; hence it is appropriate to raise a warning or
> even die (like division by zero does) when you violate that.
> Second, the current behavior is explained by the simple intuitive
> statement expressed in the warning: "for non-Unicode code points, all
> \p{} fail; all \P{} succeed".
> Third, the differences in behavior between the current behavior and
> changing it, apply to only a few property-value combinations that are
> likely to occur.  \p{BINARY_PROPERTY=false} for all binary properties
> are examples of these differences, but are unlikely to ever occur in
> practice.
> The most likely one to occur is \p{Unassigned} or its synonyms like
> \p{gc=cn}, but there are others, such as \p{Unknown} (though I bet most
> people would have to look up what this one matches).  We could therefore
> change the warning message to be raised only when a non-Unicode code
> point is matched against a property that has the potentially
> counter-intuitive results, cutting down the frequency of its occurrence
> significantly.  I'm pretty confident that these would never get
> optimized into something other than a regular property matching regnode,
> so my concern about dealing with this possibility goes away.  If we
> retained the current behavior, we still would have to decide if this
> message gets turned off after a certain number of them being output,
> besides the current ability to say "no warnings 'non_unicode'".
> What I'm hearing though is more sentiment in favor of changing
> \p{Unassigned} to extending beyond the Unicode range.  I'm fine with
> that.  We have sufficient weasel words in the pods, and the warning
> raised even for properties where you get the expected results, that we
> shouldn't have to have a deprecation cycle, etc.  The implications of
> this on \p{Any} and \p{All} need to be addressed.  I believe I
> understand the other implications of making this change.
> But we aren't going to be adding new General_Categories.  What those are
> is guaranteed to be immutable by Unicode, and playing with them could
> cause algorithms not to be so easily transferred to Perl.  We shouldn't
> be getting on a slippery slope of disregarding Unicode's decisions, even
> if they sometimes, like us, don't think things out fully and find
> unintended consequences down the road.

I've done some more thinking about this, and am presenting here my 
current thoughts.

First, I'll briefly summarize an experience I had several decades ago. 
You can skip the rest of this paragraph if not interested, though I'm 
cutting out much of the details.  I designed and wrote the software for 
a product (using K&R C; cross-compiled for a 16-bit Unix machine with a 
40Mb disk; not very much memory).  A portion of it worked in a 
particular way, I'll call X.  I wrote it that way because it DWIM.  The 
product was successful enough to warrant a second release, with expanded 
functionality to serve hotel uses.  At this point my company decided to 
become more rigorous, and involved someone to examine and dictate how it 
should really work.  He did not like X, and insisted behavior Y was the 
only proper thing to do.  As a result, I came to realize that some 
people were wired to prefer X, and some to prefer Y.  (I don't think he 
ever realized that the alternative to Y was in any way valid.)  As a 
result of his position in the company versus mine, the next version of 
the product changed behavior to Y.  (I felt somewhat vindicated that 
this change led to our getting sued by a famous singer, then in the 
twilight of his career, staying at a hotel with behavior Y, but who 
expected X, and hence missed his gig.  But it didn't change anybody's 
mind: Y stayed.)

The bottom line of that experience, is that I realized that DWIM can 
depend on who "I" is, and there are potentially multiple valid 
definitions of DWIM.  I believe that the topic of this thread is one of 
those situations where two different behaviors have substantial validity.

As I said in the message included above, people coming from primarily a 
Unicode background would be surprised at how we do things.  As evidence, 
a recent update to Unicode::Collate added this pod text:

     Perl seems to allow out-of-range values (greater than 0x10FFFF).

(It also turns out that collation for these has been a security issue in 
the Unicode 6 series, supposed to be fixed in the new 6.3 to be 
released, probably in October.)

Some such people think we shouldn't allow what we already do, croaking 
when using an above-Unicode code point, not even supporting them at all, 
even for non-Unicode operations.

On the other hand, some people coming from more of a Perl background 
want to allow more than we do.

And some people with a strong background in both Perl and Unicode are 
conflicted.  As evidence, Tom C privately wrote me that the solution to 
this "is a puzzle".

Each approach has validity.  I believe that the behavior should change 
to allow both to be accommodated as far as possible.

That means that the warning should be kept.  It is in the "non_unicode" 
subclass of utf8 warnings.  That means that someone who wants strict 
Unicode behavior can make fatal just these warnings.

It also means the behavior needs to change so that the above-Unicode 
code points generally are treated as unassigned Unicode code points. 
This accommodates a more Perlish behavior, and those people can just 
turn off that warning subclass.

So you get both options available to you.  Turn off the warning for 
Perlish; make it fatal for Unicodish.

There are some flies in the ointment though.  I don't think \p{Any} and 
\p{All} should match anything but strictly Unicode code points.  We 
already have a well established way to match all code points, and that 
is to use the dot ".".  But I'm open to arguments the other way.

The other fly is that the warning message currently does not get 
generated if the regular expression construction for it gets optimized 
to something else.  Surprisingly many (almost 200, including synonyms) 
Unicode properties only match a single character.  These all get 
optimized.  An example is \p{Line Break=CR}, which matches just a 
carriage return.

The code could be changed so that the message gets generated for all 
these, but there is added complexity, and it would slow down slightly 
regular expression matching generally for what is an edge case.  But it 
may indeed be what we should do.

But I'm tempted to move somewhat more towards the Perlish side of 
things, and change the warning/error message so that it is raised only 
when the result would be different under a Perlish vs Unicodish regime. 
  The reason this is moving away from Unicodish interpretation is that 
you wouldn't get a warning that you can make fatal when using 
non-Unicode code points -- unless the answer is different than what you 
would get under strict Unicode rules.  Most real world uses of these 
give the same result under both; the principal exception has already 
been mentioned in this thread: \p{Unassigned}.  This would mean the 
Perlish people would get many fewer instances where the message is 
raised, and might never encounter it under normal uses of above-Unicode 
code points.  But again, it would mean that the Unicodish people 
wouldn't get a chance to catch all such uses.

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About