develooper Front page | perl.perl5.porters | Postings from February 2019

Re: RFC: Adding \p{foo=/re/}

Thread Previous | Thread Next
From:
Karl Williamson
Date:
February 6, 2019 00:33
Subject:
Re: RFC: Adding \p{foo=/re/}
Message ID:
0192311f-8d25-d315-71ac-33686da155b8@khwilliamson.com
On 2/5/19 4:59 PM, Tony Cook wrote:
> On Tue, Feb 05, 2019 at 03:47:18PM -0700, Karl Williamson wrote:
>> The Unicode Technical Standard #18 on regular expressions suggests that
>> Unicode properties have what I'm calling a subpattern and they call wildcard
>> properties
>>
>> http://www.unicode.org/reports/tr18/#Wildcard_Properties
>>
>> I am proposing to implement this in 5.30.  I already have a working
>> prototype, which you can find in
>>
>> https://perl5.git.perl.org/perl.git/shortlog/refs/heads/smoke-me/khw-core
>>
>> and play with.  Attached is a script that exercises it to create a pattern
>> that matches IPV4 addresses in any language, and fails illegal ones.  Thus
>> the script would work for Bengali or Thai  numbers.  The motivation for this
>> came from Abigail.
>>
>> Certain things aren't clear to me about how it should behave.  Should the
>> default be anchored (as currently) so that you have to begin and/or end with
>> '.*' to unanchor it?  I think most uses will want it anchored as implied by
>> the equals sign, but that's not how other patterns behave, and that
>> inconsistency probably would be too confusing.  One thing that might
>> emphasize that it isn't anchored is to make them write
>>
>> \p{foo=~/bar/}
>>
>> (requiring a tilde)
>>
>> Comments?
> 
> Some of the examples in TR18 would fail if the regexp was anchored by
> default.
> 
> The cases that do need anchoring in the examples use anchoring syntax:
> 
> http://unicode.org/cldr/utility/list-unicodeset.jsp?a=\p{name=/^LATIN%20LETTER.*P$/}
> 
> Tony
> 

Although it's called a technical standard, it's not actually a part of 
the Unicode Standard, and even though those clauses are written as if 
they are requirements, they're not.

This was made clear to me when we followed this document closely, and 
then Unicode made a contradictory rule in the actual Standard.  When I 
pointed this out, they (did seem to be embarrassed, and) said UTS 18 
isn't a standard, and they removed the language from it, leaving us in 
the lurch.  There was a deprecation period for people who were using 
what we had furnished, before we fully supported the Standard again.

The lesson here is that Unicode doesn't always know best, and we need to 
exercise judgment in following them.  Various things from this document 
have been withdrawn as a result of my and others questioning them.  One 
I noticed again today is 2.1, where there there used to be an RL2.1 
apparent requirement.  This document appears to have been written by a 
bunch of people sitting around and brainstorming what would be nice, but 
without an implementation to test things out on.

We already differ significantly from their syntaxes.  Our set notation 
is different; we don't have a \p{name=...} syntax, etc.

I knew that they thought the patterns weren't anchored, but my 
experience indicates we should do what we think is best in this regards, 
which may be the unanchored approach.  But I want to hear what people 
think from a perl-based view.

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About