develooper Front page | perl.perl5.porters | Postings from September 2013

Re: RFC: What to do about warning: "Code point 0xFOO is not Unicode,all \\p{} matches fail; all \\P{} matches succeed"

Thread Previous | Thread Next
From:
Ricardo Signes
Date:
September 16, 2013 22:09
Subject:
Re: RFC: What to do about warning: "Code point 0xFOO is not Unicode,all \\p{} matches fail; all \\P{} matches succeed"
Message ID:
20130916220926.GA17523@cancer.codesimply.com
* Karl Williamson <public@khwilliamson.com> [2013-09-14T15:08:17]
> "Any matches all code points. This could also be captured with
> [\x{0}-\x{10FFFF}].... In some regular expression languages, \p{Any}
> may be expressed by a period, but that may exclude newline
> characters."
> 
> Even a non-purist is expecting "Any" to match only up through
> 10FFFF. That's why I leaning against changing it.  Since I was wrong
> about "All", that would further argue that we can change it, and
> leave "Any" alone.

If /./s matches \p{Any} does it mean that it excludes \0x11_0000 ?

I think if \p{All} matches all values, then /./s should probably mean that.  To
say, "I am a person who think about Unicode a lot," you may want to
specifically say \p{Any}.

Does that make seem reasonable to you, too?

> >1) All Unicode property matches fail.
> >    (Which means that \p{PROPERTY=false} and both \p{PROPERTY=true} fail,
> >     even though that seems a contradiction.)
> >2) The code points are treated as if they are unassigned Unicode code points.
> >    (Exactly one of \p{PROPERTY=false} and \p{PROPERTY=true} will match.)
> >
> >And that your current favoured approach is to take the behaviour of (2),
> >but warn if it differs from the behaviour of (1).
> 
> The above appears to me to be an accurate restatement of what I was
> trying to say.

I think this is a good solution.

> >But what it doesn't (directly) offer is a way for a Unicode purist to treat
> >as fatal any attempt to match an out-of-range code point.
> 
> Exactly. This proposal doesn't fully support the purist approach,
> and that is problematic.

I recently advised someone:

  If your code really really needs to make sure that its existing regexes
  never, ever match non-ASCII characters, you first need to scan for \P{ASCII}
  and then do the rest of your work, because Perl really wants to say text
  operations imply Unicode.

Perhaps it's "imply Unicode++" and similarly nervous users should also be
scanning for \P{Unicode}.

-- 
rjbs

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About