develooper Front page | perl.perl5.porters | Postings from February 2019

Re: RFC: Adding \p{foo=/re/}

Thread Previous | Thread Next
From:
Deven T. Corzine
Date:
February 5, 2019 23:48
Subject:
Re: RFC: Adding \p{foo=/re/}
Message ID:
CAFVdu0QRmaDKWmtBgxR_bVRbKnFdHk5Wig=BsCPz8=8ofP0egw@mail.gmail.com
On Tue, Feb 5, 2019 at 5:47 PM Karl Williamson <public@khwilliamson.com>
wrote:

> The Unicode Technical Standard #18 on regular expressions suggests that
> Unicode properties have what I'm calling a subpattern and they call
> wildcard properties
>
> http://www.unicode.org/reports/tr18/#Wildcard_Properties
>
> I am proposing to implement this in 5.30.  I already have a working
> prototype, which you can find in
>
> https://perl5.git.perl.org/perl.git/shortlog/refs/heads/smoke-me/khw-core
>
> and play with.  Attached is a script that exercises it to create a
> pattern that matches IPV4 addresses in any language, and fails illegal
> ones.  Thus the script would work for Bengali or Thai  numbers.  The
> motivation for this came from Abigail.
>

Implementing this feature is a great idea.


> Certain things aren't clear to me about how it should behave.  Should
> the default be anchored (as currently) so that you have to begin and/or
> end with '.*' to unanchor it?  I think most uses will want it anchored
> as implied by the equals sign, but that's not how other patterns behave,
> and that inconsistency probably would be too confusing.  One thing that
> might emphasize that it isn't anchored is to make them write
>
> \p{foo=~/bar/}
>
> (requiring a tilde)
>
> Comments?
>

I think it would be best to use the exact syntax as shown in that Unicode
Technical Standard (and document the feature using that syntax), to be as
standards-compliant as possible.  That being said, I see nothing wrong with
allowing an _optional_ tilde as in "\p{foo=~/bar/}" for anyone who finds
that syntax more intuitive.

I'm curious why you say that the equals sign implies an anchored match?
I'm not seeing the connection there.  If anything, an equals sign alone
might be thought to signify assignment, but that's obviously inapplicable
here.  In my mind, "\p{foo=/bar/}" doesn't suggest an anchored pattern
because of the equals sign -- if anything, my inclination would be to
assume the pattern is _not_ anchored, because of the slashes around it.
But that's just my personal opinion/intuition about the semantics implied
by that syntax.

At any rate, the Unicode Technical Report appears to have pretty clear
intentions on this question...

Consider the first example in the table.  The expression "\p{toNfd=/b/}" is
described as:

Characters whose NFD form contains a "b" (U+0062) in the value


In my opinion, the word "contains" in the description above implies an
unanchored search.

More significantly, the second example is "\p{name=/^LATIN LETTER.*P$/}",
which is described as:

Characters with names starting with "LATIN LETTER" and ending with "P"


Notice that this example shows explicit "^" and "$" anchors in the regular
expression, and uses "starting with" and "ending with" in the description
instead of "contains".

This seems unambiguous to me -- if the regular expression would be anchored
by default, there would be no purpose in including explicit "^" and "$"
anchors in this example, since they would be redundant.  Also, they would
have needed "\p{toNfd=/.*b.*/}" for the first example to do a "contains"
match, if anchored by default.

I believe that the implied semantics in that Unicode Technical Report are
clear -- the regular expressions should NOT be anchored unless explicit
anchors are used.

Also, as you said, it would be confusing and inconsistent to anchor them
anyhow.

Deven

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About