* Karl Williamson <public@khwilliamson.com> [2013-09-14T15:08:17] > "Any matches all code points. This could also be captured with > [\x{0}-\x{10FFFF}].... In some regular expression languages, \p{Any} > may be expressed by a period, but that may exclude newline > characters." > > Even a non-purist is expecting "Any" to match only up through > 10FFFF. That's why I leaning against changing it. Since I was wrong > about "All", that would further argue that we can change it, and > leave "Any" alone. If /./s matches \p{Any} does it mean that it excludes \0x11_0000 ? I think if \p{All} matches all values, then /./s should probably mean that. To say, "I am a person who think about Unicode a lot," you may want to specifically say \p{Any}. Does that make seem reasonable to you, too? > >1) All Unicode property matches fail. > > (Which means that \p{PROPERTY=false} and both \p{PROPERTY=true} fail, > > even though that seems a contradiction.) > >2) The code points are treated as if they are unassigned Unicode code points. > > (Exactly one of \p{PROPERTY=false} and \p{PROPERTY=true} will match.) > > > >And that your current favoured approach is to take the behaviour of (2), > >but warn if it differs from the behaviour of (1). > > The above appears to me to be an accurate restatement of what I was > trying to say. I think this is a good solution. > >But what it doesn't (directly) offer is a way for a Unicode purist to treat > >as fatal any attempt to match an out-of-range code point. > > Exactly. This proposal doesn't fully support the purist approach, > and that is problematic. I recently advised someone: If your code really really needs to make sure that its existing regexes never, ever match non-ASCII characters, you first need to scan for \P{ASCII} and then do the rest of your work, because Perl really wants to say text operations imply Unicode. Perhaps it's "imply Unicode++" and similarly nervous users should also be scanning for \P{Unicode}. -- rjbsThread Previous | Thread Next