On 08/29/2013 11:47 AM, Eric Brine wrote: > On Mon, Aug 26, 2013 at 11:00 PM, Karl Williamson > <public@khwilliamson.com <mailto:public@khwilliamson.com>> wrote: > > The other option was to make \p{gc=unassigned} succeed for > non-Unicode code points. But this isn't what Unicode says. A > strict interpretation fails this because Unicode has never said that > a non-Unicode code point should be considered unassigned. But I now > believe it is more DWIM to consider them so. > > > If you're worried that someone might want to distinguish > Unicode-but-unassigned from non-Unicode, then you could extend gc to > include gc=NonUnicode. However, I suspect suspect such a distinction is > rarely needed, so it's probably better to include non-Unicode code > points in gc=unassigned, and let those who want to distinguish > unassigned code points from non-Unicode code points use (?[ > [\p{gc=unassigned] - [\x{0}-\x{10FFFF}] ]) and (?[ [\p{gc=unassigned] - > [^\x{0}-\x{10FFFF}] ]). (Maybe provide \p{Unicode}?) > > In general, it's clear that non-Unicode code points should behave as a > Unicode code point without the property. No more /\p{XXX}/ && /\P{XXX}/ > being true. > What to do then about \p{Any} ? Unicode explicitly says it should match [0-\x{10FFFF}]. Do we leave it like that, or should it be a synonym for dot? I think the former. If we leave it alone, what about \p{All}, which is supposed to be a synonym for \p{Any}, but whose name seems to indicate everything possible? I've had some more insights since I posted things, and have recalled more as to why the message is raised. First of all, its highly arguable, and Unicode would strongly make this argument, that one should not be attempting to use Unicode properties on non-Unicode code points; hence it is appropriate to raise a warning or even die (like division by zero does) when you violate that. Second, the current behavior is explained by the simple intuitive statement expressed in the warning: "for non-Unicode code points, all \p{} fail; all \P{} succeed". Third, the differences in behavior between the current behavior and changing it, apply to only a few property-value combinations that are likely to occur. \p{BINARY_PROPERTY=false} for all binary properties are examples of these differences, but are unlikely to ever occur in practice. The most likely one to occur is \p{Unassigned} or its synonyms like \p{gc=cn}, but there are others, such as \p{Unknown} (though I bet most people would have to look up what this one matches). We could therefore change the warning message to be raised only when a non-Unicode code point is matched against a property that has the potentially counter-intuitive results, cutting down the frequency of its occurrence significantly. I'm pretty confident that these would never get optimized into something other than a regular property matching regnode, so my concern about dealing with this possibility goes away. If we retained the current behavior, we still would have to decide if this message gets turned off after a certain number of them being output, besides the current ability to say "no warnings 'non_unicode'". What I'm hearing though is more sentiment in favor of changing \p{Unassigned} to extending beyond the Unicode range. I'm fine with that. We have sufficient weasel words in the pods, and the warning raised even for properties where you get the expected results, that we shouldn't have to have a deprecation cycle, etc. The implications of this on \p{Any} and \p{All} need to be addressed. I believe I understand the other implications of making this change. But we aren't going to be adding new General_Categories. What those are is guaranteed to be immutable by Unicode, and playing with them could cause algorithms not to be so easily transferred to Perl. We shouldn't be getting on a slippery slope of disregarding Unicode's decisions, even if they sometimes, like us, don't think things out fully and find unintended consequences down the road.Thread Previous | Thread Next