develooper Front page | perl.perl5.porters | Postings from August 2013

Re: RFC: What to do about warning: "Code point 0xFOO is not Unicode,all \\p{} matches fail; all \\P{} matches succeed"

Thread Previous | Thread Next
From:
Karl Williamson
Date:
August 27, 2013 03:01
Subject:
Re: RFC: What to do about warning: "Code point 0xFOO is not Unicode,all \\p{} matches fail; all \\P{} matches succeed"
Message ID:
521C1664.2080405@khwilliamson.com
On 08/25/2013 12:30 AM, Aristotle Pagaltzis wrote:
> * Karl Williamson <public@khwilliamson.com> [2013-08-25 07:20]:
>> Thoughts?
>
> What other reasonable choice was there in the first place? Isn’t it kind
> of self-evident that codepoints that are not part of Unicode do not have
> Unicode properties? In short, is the warning here warning the programmer
> of something at all unexpected? It seems to me it isn’t, and so it also
> seems to me that mentioning in the documentation that this is indeed the
> chosen semantics should be perfectly sufficient.
>
> So assuming I haven’t overlooked any facts, I’ll agree that the warning
> should go.
>

It is coming back to me why I felt compelled to add this warning, as the 
perldiag description points out some counterintuitive cases.

First some background.  Unicode defines only properties of the form 
\p{PROPERTY=VALUE}.  The single forms like \p{PROPERTY} are all Perl 
extensions (though sanctioned in some of Unicode's texts) that mean the 
same as \p{PROPERTY=true}.  Thus the single forms are only valid for 
binary properties.

Most, but not all, Unicode properties are binary, and the values for all 
properties are all undefined on non-Unicode code points.  When I was 
fixing how Perl handles Unicode properties, I chose to do a strict 
interpretation of Unicode's rules.  In retrospect, I'm thinking that 
decision leads to non-DWIM, and should be revisited, which I'm starting 
in this email.  I made the decision when I was pretty much first 
starting out, and did not have the confidence to say Unicode could 
possibly be wrong.  Now that I know that they are very fallible, and I 
have a much better understanding of the issues involved, I can say that 
a very strict interpretation isn't absolutely necessary, unlike what I 
thought before.

I'll use a common non-binary property for illustration: 
"General_Category", often abbreviated to "gc".  The gc can be an 
uppercase letter, or a decimal digit, or a control, or a few other 
things.  But what the gc is for a non-Unicode code point is not defined 
by Unicode, or rather, is more-or-less explicitly undefined.  But in 
regex matching, Perl only has two states, "matches" and "doesn't match". 
  We don't have an undefined state, nor does it make sense to have it 
there.  What I chose to do is to make \p{gc=foo} fail for all "foo" when 
matching a non-Unicode code point.  The other option was to make 
\p{gc=unassigned} succeed for non-Unicode code points.  But this isn't 
what Unicode says.  A strict interpretation fails this because Unicode 
has never said that a non-Unicode code point should be considered 
unassigned.  But I now believe it is more DWIM to consider them so.

All Unicode properties have a default fall-back value for code points 
not explicitly having some other value.  For the gc property, it is 
"Unassigned"; for the Script property, it is "Unknown"; for the Block 
property, it is "No Block"; for the Uppercase_Mapping property, it is 
the code point itself; etc.  Perl could change to make the fall-back 
value be what happens for non-Unicode code points.  This, I believe, is 
more DWIM.

The reason I didn't do this, besides wanting to be very strict Unicode, 
is that there is a complication.  Consider the Perl extension 
\p{Unassigned}, which is the same as \p{gc=Unassigned}.  Currently these 
match 864_348 code points.  If we changed the decision I made, these 
would now match billions of code points.

So it isn't clear-cut what the decision should be.  A more glaring 
example of the non-DWIM, though, of the current scheme is that 
\p{PROPERTY=false} is equal to \p{PROPERTY=true} is equal to "false" for 
all non-Unicode code points.

The text in perldiag now points out very briefly what I've described in 
much more depth above.

The situation isn't perfect for either way we do things.  Before, I 
thought that they way I implemented it was the least bad way; now I'm 
not so sure.





Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About