On 09/13/2013 02:37 AM, Nicholas Clark wrote: > On Thu, Sep 12, 2013 at 10:35:18PM -0400, Ricardo Signes wrote: >> * Karl Williamson <public@khwilliamson.com> [2013-09-11T22:57:24] >>> I've done some more thinking about this, and am presenting here my >>> current thoughts. >> >> Thanks for this. I think I an on board with you, for the most part. >> >> Figuring out who the various I's are in DWIM is always a good exercise. > > Having re-read the thread several times, I think I have my head round the > conflicting requirements and expectations. Thanks for taking the time > to explain them. > >>> There are some flies in the ointment though. I don't think \p{Any} >>> and \p{All} should match anything but strictly Unicode code points. >>> We already have a well established way to match all code points, and >>> that is to use the dot ".". But I'm open to arguments the other >>> way. >> >> A dot might work, but then you're thinking about /s again. But more to the >> point, I just wonder why you think those properties shouldn't match? Do you >> think users will be trying to exclude weird codepoints by using those? I guess >> what I want to know is: who is the DWIM person who benefits from excluding >> \x{FF_FFFF}? Does he or she know that there may be trans-Unicode points >> incoming and want to exclude them? If so, why not fatalize warnings, or >> also require Unicode explicitly? Or are we protecting them from weird input? > > I thought the same about "." and the /s flag. > > To check - "All" and "Any" are Perl defined extensions, not Unicode > consortium? And they have always been documented both as (a) synonyms > (b) to match [\x{0000}-\x{10FFFF}] ? > >>> But I'm tempted to move somewhat more towards the Perlish side of >>> things, and change the warning/error message so that it is raised >>> only when the result would be different under a Perlish vs Unicodish >>> regime. >> >> I think this makes sense. Maybe I want to think about it more, or see whether >> somebody else has an objection. :) > > So, if I have the summary correct > > There are three (obvious) ways to treat all code points outside of Unicode's > range \x{0000}-\x{10FFFF} > > 0) Just croak. > (A very purist Unicode approach.) > 1) All Unicode property matches fail. > (Which means that \p{PROPERTY=false} and both \p{PROPERTY=true} fail, > even though that seems a contradiction.) > 2) The code points are treated as if they are unassigned Unicode code points. > (Exactly one of \p{PROPERTY=false} and \p{PROPERTY=true} will match.) > > And that your current favoured approach is to take the behaviour of (2), > but warn if it differs from the behaviour of (1). > > Because > > 1) this means that it's still viable to use out-of-range code points for > "internal" purposes without generating so many warnings that they get > turned off > 2) it permits a warning that is useful to leave on by default > 3) the warning can be made fatal for strict(er) behaviour > > > But what it doesn't (directly) offer is a way for a Unicode purist to treat > as fatal any attempt to match an out-of-range code point. > > Nicholas Clark > Having now implemented this, I have a couple of refinements to propose. The first addresses the Unicode purist. Part of my concern has been that what's been available up to now didn't always raise the appropriate warnings. This happens if the regex compiler optimizes the \p{} expression into something else. This happens for example in \p{Line_Break=Line_Feed}. This is optimized as if the backslash sequence \n had been in the pattern instead. (Surprisingly many Unicode properties match a single code point, about 200 of them; all get optimized like the example.) If one matches such a property against a non-Unicode code point, no warning is raised. I have finally figured out a way to fix this without slowing down or complicating the execution code or requiring a new pragma. I propose to check during regex compilation if the non-unicode warnings are enabled and fatalized. If so, the optimizations are skipped and the warnings will be enabled for all Unicode properties, not just the ones where the outcome is different than what a purist would expect. Only a few lines of code need be added to regcomp.c to accomplish this. A purist will get their desired behavior. However, if the match was skipped entirely because the string is, say, too short, no warning would be raised. But then, no match was attempted. Note that the behavior is based on the lexical scope of the pattern compilation, rather than its ultimate execution. This is already true of almost all aspects of regex execution. A purist will get full purist behavior. The remaining 99+% will get the already-agreed on changes that are more DWIM. I'd rather not create a pragma to do this. The other refinement is one already alluded to earlier in this thread. I propose to only output the non-Unicode match warning at most once per pattern execution. Thus backtracking would not cause the message to be output again and again; and if two properties were matched against above-Unicode code points during the same match instance, only the first would raise the warning. If the pattern were matched in repeated executions in a loop, each time through could generate the warning.Thread Previous | Thread Next