Front page | perl.perl5.porters |
Postings from September 2013
Re: RFC: What to do about warning: "Code point 0xFOO is not Unicode,all \\p{} matches fail; all \\P{} matches succeed"
Thread Previous
|
Thread Next
From:
Karl Williamson
Date:
September 14, 2013 19:08
Subject:
Re: RFC: What to do about warning: "Code point 0xFOO is not Unicode,all \\p{} matches fail; all \\P{} matches succeed"
Message ID:
5234B421.9090100@khwilliamson.com
On 09/13/2013 02:37 AM, Nicholas Clark wrote:
> On Thu, Sep 12, 2013 at 10:35:18PM -0400, Ricardo Signes wrote:
>> * Karl Williamson <public@khwilliamson.com> [2013-09-11T22:57:24]
>>> I've done some more thinking about this, and am presenting here my
>>> current thoughts.
>>
>> Thanks for this. I think I an on board with you, for the most part.
>>
>> Figuring out who the various I's are in DWIM is always a good exercise.
>
> Having re-read the thread several times, I think I have my head round the
> conflicting requirements and expectations. Thanks for taking the time
> to explain them.
>
>>> There are some flies in the ointment though. I don't think \p{Any}
>>> and \p{All} should match anything but strictly Unicode code points.
>>> We already have a well established way to match all code points, and
>>> that is to use the dot ".". But I'm open to arguments the other
>>> way.
>>
>> A dot might work, but then you're thinking about /s again. But more to the
>> point, I just wonder why you think those properties shouldn't match? Do you
>> think users will be trying to exclude weird codepoints by using those? I guess
>> what I want to know is: who is the DWIM person who benefits from excluding
>> \x{FF_FFFF}? Does he or she know that there may be trans-Unicode points
>> incoming and want to exclude them? If so, why not fatalize warnings, or
>> also require Unicode explicitly? Or are we protecting them from weird input?
>
> I thought the same about "." and the /s flag.
>
> To check - "All" and "Any" are Perl defined extensions, not Unicode
> consortium? And they have always been documented both as (a) synonyms
> (b) to match [\x{0000}-\x{10FFFF}] ?
I had forgotten about needing the /s flag on dot to make it equivalent.
And, I was mistaken in my understanding about "Any" and "All", or else
Unicode has changed it's text since I last studied this aspect of it (I
didn't bother to go digging). "All" doesn't appear as a property there
(I thought it did), so it must be a Perl extension, and so I would
suggest that we change it to match every possible code point on the
platform, that is, decouple it from "Any"
And strictly, "Any" is a Perl extension, but that's only in the very
strictest sense. It is defined in "Unicode Technical Standard #18
Unicode Regular Expressions"
http://www.unicode.org/reports/tr18/
which isn't strictly considered part of the Standard proper, but Perl
has tried to follow its "recommendations", which are phrased in the text
as requirements even though it isn't part of the Standard. It is in
effect, a de-facto standard. It defines "Any" as (with slight editing
changes):
"Any matches all code points. This could also be captured with
[\x{0}-\x{10FFFF}].... In some regular expression languages, \p{Any} may
be expressed by a period, but that may exclude newline characters."
Even a non-purist is expecting "Any" to match only up through 10FFFF.
That's why I leaning against changing it. Since I was wrong about
"All", that would further argue that we can change it, and leave "Any"
alone.
>
>>> But I'm tempted to move somewhat more towards the Perlish side of
>>> things, and change the warning/error message so that it is raised
>>> only when the result would be different under a Perlish vs Unicodish
>>> regime.
>>
>> I think this makes sense. Maybe I want to think about it more, or see whether
>> somebody else has an objection. :)
>
> So, if I have the summary correct
>
> There are three (obvious) ways to treat all code points outside of Unicode's
> range \x{0000}-\x{10FFFF}
>
> 0) Just croak.
> (A very purist Unicode approach.)
> 1) All Unicode property matches fail.
> (Which means that \p{PROPERTY=false} and both \p{PROPERTY=true} fail,
> even though that seems a contradiction.)
> 2) The code points are treated as if they are unassigned Unicode code points.
> (Exactly one of \p{PROPERTY=false} and \p{PROPERTY=true} will match.)
>
> And that your current favoured approach is to take the behaviour of (2),
> but warn if it differs from the behaviour of (1).
The above appears to me to be an accurate restatement of what I was
trying to say.
>
> Because
>
> 1) this means that it's still viable to use out-of-range code points for
> "internal" purposes without generating so many warnings that they get
> turned off
> 2) it permits a warning that is useful to leave on by default
> 3) the warning can be made fatal for strict(er) behaviour
I would add that the warning's sole purpose is to notify you that you
are using a Unicode construct to apply to non-Unicode things. A similar
warning is generated if you, uc(), lc() etc. a non-Unicode code point.
If this is what the code is supposed to be doing, simply turn off this
warning category. The point of making it a category on its own was so
these are the only messages in it, and so turning it off doesn't mean
you might miss something you might care about.
>
>
> But what it doesn't (directly) offer is a way for a Unicode purist to treat
> as fatal any attempt to match an out-of-range code point.
Exactly. This proposal doesn't fully support the purist approach, and
that is problematic.
I want to emphasize that currently there is a potential security hole
with collating non-Unicode code points. This will soon be closed, but
it shows the potential dangers of not following the Standard precisely.
Unicode makes mistakes, but generally, they have more expertise in
these matters and have thought through things more than we. Thus it
seems like the right thing to fully support a purist approach
Thread Previous
|
Thread Next