On 11/27/2013 09:52 PM, Karl Williamson wrote: > On 09/13/2013 02:37 AM, Nicholas Clark wrote: >> On Thu, Sep 12, 2013 at 10:35:18PM -0400, Ricardo Signes wrote: >>> * Karl Williamson <public@khwilliamson.com> [2013-09-11T22:57:24] >>>> I've done some more thinking about this, and am presenting here my >>>> current thoughts. >>> >>> Thanks for this. I think I an on board with you, for the most part. >>> >>> Figuring out who the various I's are in DWIM is always a good exercise. >> >> Having re-read the thread several times, I think I have my head round the >> conflicting requirements and expectations. Thanks for taking the time >> to explain them. >> >>>> There are some flies in the ointment though. I don't think \p{Any} >>>> and \p{All} should match anything but strictly Unicode code points. >>>> We already have a well established way to match all code points, and >>>> that is to use the dot ".". But I'm open to arguments the other >>>> way. >>> >>> A dot might work, but then you're thinking about /s again. But more >>> to the >>> point, I just wonder why you think those properties shouldn't match? >>> Do you >>> think users will be trying to exclude weird codepoints by using >>> those? I guess >>> what I want to know is: who is the DWIM person who benefits from >>> excluding >>> \x{FF_FFFF}? Does he or she know that there may be trans-Unicode points >>> incoming and want to exclude them? If so, why not fatalize warnings, or >>> also require Unicode explicitly? Or are we protecting them from >>> weird input? >> >> I thought the same about "." and the /s flag. >> >> To check - "All" and "Any" are Perl defined extensions, not Unicode >> consortium? And they have always been documented both as (a) synonyms >> (b) to match [\x{0000}-\x{10FFFF}] ? >> >>>> But I'm tempted to move somewhat more towards the Perlish side of >>>> things, and change the warning/error message so that it is raised >>>> only when the result would be different under a Perlish vs Unicodish >>>> regime. >>> >>> I think this makes sense. Maybe I want to think about it more, or >>> see whether >>> somebody else has an objection. :) >> >> So, if I have the summary correct >> >> There are three (obvious) ways to treat all code points outside of >> Unicode's >> range \x{0000}-\x{10FFFF} >> >> 0) Just croak. >> (A very purist Unicode approach.) >> 1) All Unicode property matches fail. >> (Which means that \p{PROPERTY=false} and both \p{PROPERTY=true} fail, >> even though that seems a contradiction.) >> 2) The code points are treated as if they are unassigned Unicode code >> points. >> (Exactly one of \p{PROPERTY=false} and \p{PROPERTY=true} will match.) >> >> And that your current favoured approach is to take the behaviour of (2), >> but warn if it differs from the behaviour of (1). >> >> Because >> >> 1) this means that it's still viable to use out-of-range code points for >> "internal" purposes without generating so many warnings that they get >> turned off >> 2) it permits a warning that is useful to leave on by default >> 3) the warning can be made fatal for strict(er) behaviour >> >> >> But what it doesn't (directly) offer is a way for a Unicode purist to >> treat >> as fatal any attempt to match an out-of-range code point. >> >> Nicholas Clark >> > > Having now implemented this, I have a couple of refinements to propose. > > The first addresses the Unicode purist. Part of my concern has been > that what's been available up to now didn't always raise the appropriate > warnings. This happens if the regex compiler optimizes the \p{} > expression into something else. This happens for example in > \p{Line_Break=Line_Feed}. This is optimized as if the backslash > sequence \n had been in the pattern instead. (Surprisingly many Unicode > properties match a single code point, about 200 of them; all get > optimized like the example.) If one matches such a property against a > non-Unicode code point, no warning is raised. I have finally figured > out a way to fix this without slowing down or complicating the execution > code or requiring a new pragma. I propose to check during regex > compilation if the non-unicode warnings are enabled and fatalized. If > so, the optimizations are skipped and the warnings will be enabled for > all Unicode properties, not just the ones where the outcome is different > than what a purist would expect. Only a few lines of code need be added > to regcomp.c to accomplish this. A purist will get their desired > behavior. However, if the match was skipped entirely because the string > is, say, too short, no warning would be raised. But then, no match was > attempted. Note that the behavior is based on the lexical scope of the > pattern compilation, rather than its ultimate execution. This is > already true of almost all aspects of regex execution. A purist will > get full purist behavior. The remaining 99+% will get the > already-agreed on changes that are more DWIM. I'd rather not create a > pragma to do this. This is finally now in blead. > > The other refinement is one already alluded to earlier in this thread. I > propose to only output the non-Unicode match warning at most once per > pattern execution. Thus backtracking would not cause the message to be > output again and again; and if two properties were matched against > above-Unicode code points during the same match instance, only the first > would raise the warning. If the pattern were matched in repeated > executions in a loop, each time through could generate the warning. > I haven't done this yet.Thread Previous | Thread Next