On 08/25/2013 12:30 AM, Aristotle Pagaltzis wrote: > * Karl Williamson <public@khwilliamson.com> [2013-08-25 07:20]: >> Thoughts? > > What other reasonable choice was there in the first place? Isn’t it kind > of self-evident that codepoints that are not part of Unicode do not have > Unicode properties? In short, is the warning here warning the programmer > of something at all unexpected? It seems to me it isn’t, and so it also > seems to me that mentioning in the documentation that this is indeed the > chosen semantics should be perfectly sufficient. > > So assuming I haven’t overlooked any facts, I’ll agree that the warning > should go. > It is coming back to me why I felt compelled to add this warning, as the perldiag description points out some counterintuitive cases. First some background. Unicode defines only properties of the form \p{PROPERTY=VALUE}. The single forms like \p{PROPERTY} are all Perl extensions (though sanctioned in some of Unicode's texts) that mean the same as \p{PROPERTY=true}. Thus the single forms are only valid for binary properties. Most, but not all, Unicode properties are binary, and the values for all properties are all undefined on non-Unicode code points. When I was fixing how Perl handles Unicode properties, I chose to do a strict interpretation of Unicode's rules. In retrospect, I'm thinking that decision leads to non-DWIM, and should be revisited, which I'm starting in this email. I made the decision when I was pretty much first starting out, and did not have the confidence to say Unicode could possibly be wrong. Now that I know that they are very fallible, and I have a much better understanding of the issues involved, I can say that a very strict interpretation isn't absolutely necessary, unlike what I thought before. I'll use a common non-binary property for illustration: "General_Category", often abbreviated to "gc". The gc can be an uppercase letter, or a decimal digit, or a control, or a few other things. But what the gc is for a non-Unicode code point is not defined by Unicode, or rather, is more-or-less explicitly undefined. But in regex matching, Perl only has two states, "matches" and "doesn't match". We don't have an undefined state, nor does it make sense to have it there. What I chose to do is to make \p{gc=foo} fail for all "foo" when matching a non-Unicode code point. The other option was to make \p{gc=unassigned} succeed for non-Unicode code points. But this isn't what Unicode says. A strict interpretation fails this because Unicode has never said that a non-Unicode code point should be considered unassigned. But I now believe it is more DWIM to consider them so. All Unicode properties have a default fall-back value for code points not explicitly having some other value. For the gc property, it is "Unassigned"; for the Script property, it is "Unknown"; for the Block property, it is "No Block"; for the Uppercase_Mapping property, it is the code point itself; etc. Perl could change to make the fall-back value be what happens for non-Unicode code points. This, I believe, is more DWIM. The reason I didn't do this, besides wanting to be very strict Unicode, is that there is a complication. Consider the Perl extension \p{Unassigned}, which is the same as \p{gc=Unassigned}. Currently these match 864_348 code points. If we changed the decision I made, these would now match billions of code points. So it isn't clear-cut what the decision should be. A more glaring example of the non-DWIM, though, of the current scheme is that \p{PROPERTY=false} is equal to \p{PROPERTY=true} is equal to "false" for all non-Unicode code points. The text in perldiag now points out very briefly what I've described in much more depth above. The situation isn't perfect for either way we do things. Before, I thought that they way I implemented it was the least bad way; now I'm not so sure.Thread Previous | Thread Next