On Thu, Sep 12, 2013 at 10:35:18PM -0400, Ricardo Signes wrote: > * Karl Williamson <public@khwilliamson.com> [2013-09-11T22:57:24] > > I've done some more thinking about this, and am presenting here my > > current thoughts. > > Thanks for this. I think I an on board with you, for the most part. > > Figuring out who the various I's are in DWIM is always a good exercise. Having re-read the thread several times, I think I have my head round the conflicting requirements and expectations. Thanks for taking the time to explain them. > > There are some flies in the ointment though. I don't think \p{Any} > > and \p{All} should match anything but strictly Unicode code points. > > We already have a well established way to match all code points, and > > that is to use the dot ".". But I'm open to arguments the other > > way. > > A dot might work, but then you're thinking about /s again. But more to the > point, I just wonder why you think those properties shouldn't match? Do you > think users will be trying to exclude weird codepoints by using those? I guess > what I want to know is: who is the DWIM person who benefits from excluding > \x{FF_FFFF}? Does he or she know that there may be trans-Unicode points > incoming and want to exclude them? If so, why not fatalize warnings, or > also require Unicode explicitly? Or are we protecting them from weird input? I thought the same about "." and the /s flag. To check - "All" and "Any" are Perl defined extensions, not Unicode consortium? And they have always been documented both as (a) synonyms (b) to match [\x{0000}-\x{10FFFF}] ? > > But I'm tempted to move somewhat more towards the Perlish side of > > things, and change the warning/error message so that it is raised > > only when the result would be different under a Perlish vs Unicodish > > regime. > > I think this makes sense. Maybe I want to think about it more, or see whether > somebody else has an objection. :) So, if I have the summary correct There are three (obvious) ways to treat all code points outside of Unicode's range \x{0000}-\x{10FFFF} 0) Just croak. (A very purist Unicode approach.) 1) All Unicode property matches fail. (Which means that \p{PROPERTY=false} and both \p{PROPERTY=true} fail, even though that seems a contradiction.) 2) The code points are treated as if they are unassigned Unicode code points. (Exactly one of \p{PROPERTY=false} and \p{PROPERTY=true} will match.) And that your current favoured approach is to take the behaviour of (2), but warn if it differs from the behaviour of (1). Because 1) this means that it's still viable to use out-of-range code points for "internal" purposes without generating so many warnings that they get turned off 2) it permits a warning that is useful to leave on by default 3) the warning can be made fatal for strict(er) behaviour But what it doesn't (directly) offer is a way for a Unicode purist to treat as fatal any attempt to match an out-of-range code point. Nicholas ClarkThread Previous | Thread Next