develooper Front page | perl.perl5.porters | Postings from February 2019

Re: RFC: Adding \p{foo=/re/}

Thread Previous | Thread Next
From:
Karl Williamson
Date:
February 15, 2019 17:52
Subject:
Re: RFC: Adding \p{foo=/re/}
Message ID:
2ebb4ea5-a076-57a8-eb22-6ee9298c04e8@khwilliamson.com
Top posted, the link to the branch you can try out is

https://perl5.git.perl.org/perl.git/shortlog/refs/heads/smoke-me/khw-core
On 2/15/19 10:25 AM, Karl Williamson wrote:
> On 2/11/19 6:13 PM, Deven T. Corzine wrote:
>> On Sat, Feb 9, 2019 at 12:01 PM Karl Williamson 
>> <public@khwilliamson.com <mailto:public@khwilliamson.com>> wrote:
>>
>>     I'm sorry for not being clear.  Deven is correct that his 
>> hypothetical
>>     implementation is what I have done.
>>
>>
>> That’s good to hear!  I was hoping it would be implemented in such a 
>> fashion.
>>
>>     This is a bolt-on feature to the Perl's regexes.  It implements a
>>     portion of the wildcard feature of what UTS 18 asks for, using their
>>     syntax.  It is an apparent goal, as long listed in perlunicode, to
>>     do as
>>     much of UTS 18 as we can.
>>
>>
>> Using their syntax seems worthwhile even if we already deviate elsewhere.
>>
>> However, I must say that their last example in the table doesn't make 
>> sense to me
>>
>>     Characters in the Letterlike symbol block with different toLowercase
>>     values:
>>           [\p{toLowercase≠@cp@} & \p{Block=Letterlike Symbols}]
>>
>>
>> This seems to imply some sort of boolean logic, which sounds good in 
>> principle, but this syntax seems bizarre to me.  I would expect each 
>> \p{...} expression to be independent, but if they want two different 
>> property matches to apply as a set intersection, I think one of these 
>> examples this would be a more reasonable syntax:
>>
>>       \p{toLowercase≠@cp@ & Block=Letterlike Symbols}
>> or
>>       \p{{toLowercase≠@cp@} & {Block=Letterlike Symbols}}
> 
> Well this implementation is just a start and doesn't include this 
> fancier stuff, so we can defer deciding this until later.
>>
>> Also, on the topic of syntax, since these are meant to be used for 
>> sets of characters that can be used in a character class, I would 
>> suggest that these \p{...} expressions should also work *outside* 
>> square brackets as well, and imply [\p{...}] if the square brackets 
>> are omitted.  (Perhaps you already do this too?)
> 
> Yes, already.
>>
>>     And the implementation isn't efficient.
>>
>>     It is implemented by, during the compilation of a character class,
>>     interrupting that compilation, assembling an inner pattern, then
>>     compiling that and executing it to find all the code points it 
>> matches.
>>     That list is then added to whatever else is in the character 
>> class, the
>>     inner pattern's space is freed, and compilation of the outer pattern
>>     resumed.  There is no recursive execution.  But there is recursion in
>>     the sense, as I described, that a second pattern is compiled while in
>>     the middle of compiling an outer pattern.  I don't know if that is an
>>     issue or not.  The patterns do not share anything, no groups, etc.
>>
>>
>> As long as it's all compile-time, it's probably plenty efficient 
>> enough already.  Still, it might be worth keeping a cache of the 
>> \p{...} expressions used and the set of Unicode characters each 
>> generated, to avoid incurring the cost of generating the set if the 
>> same expressions are used over and over again.  The cache could be 
>> discarded at the end of the compilation phase, either for the one 
>> containing regex, or (perhaps better) after compiling the entire 
>> program.  Beyond that, I'm not sure what else could be done to 
>> optimize it much more.
> 
> I don't think the added complexity is worth it at this stage of 
> development without real numbers to indicate that it is.  And since 
> eliminating a full pass of the compilation had no discernible effect, I 
> doubt that a cache would either.
>>
>>     I've learned that a feature like this should be marked as 
>> experimental,
>>     so that it can be refined or even removed, and marking it as such
>>     lowers
>>     expectations as to its well-thought-outness and bug-free-ness.  It
>>     allows us to try things out and get feedback without having to say we
>>     think it is fully done.  The prototype is so marked.
>>
>>
>> Good idea, especially since a later official Unicode standard could 
>> change.
>>
>>     I've also learned that inefficiencies in compilation don't really
>>     matter.  I removed an entire pass of the regex compilation process,
>>     with
>>     extra mallocs being the price.  There did not seem to be a noticeable
>>     change in the speed of execution of our test suite!  This inefficient
>>     implementation (and I don't know another way to do it) won't be
>>     noticeable in the end, because it's only done at compilation.
>>
>>
>> I would agree with this.  You're calling this implementation 
>> inefficient, but I'm not sure that word applies if there isn't a 
>> substantially better way to do it.  Creating a fixed character set at 
>> compile time is the thing that will make this efficient at runtime, 
>> and as long as the cost at compile time is small, it's not likely to 
>> even be noticed.
>>
>>     I believe PCRE doesn't do this; I don't know about other engines.  
>> But
>>     if no one does, I would think that us having a feature no one else 
>> does
>>     is a selling point.  If others do, we could perhaps learn from their
>>     syntax.  A quick google search didn't turn up anything obvious.
>>
>>
>> I doubt anyone else does it yet.  If Perl has it, perhaps PCRE would 
>> consider copying it later to try to maintain better compatibility with 
>> Perl, but they might not even bother.
>>
>>     If there are issues with various constructs, we can forbid those.  My
>>     implementation, for example, doesn't allow braces in the subpattern,
>>     and
>>     hence no construct that requires braces.  I think that's a reasonable
>>     initial restriction to make it easier to implement something, that
>>     otherwise wouldn't get implemented.
>>
>>
>> It would be good to support balanced/escaped braces, but that can 
>> certainly be a second pass...
> 
> What I'm trying to do is give people the ability to do something, while 
> punting niceties that aren't essential in favor of easier development.
> 
>>
>>     If the UTS 18 syntax is misleading, what isn't?
>>
>>
>> I'm not even sure what you mean by this!
> 
> I meant, if the reader doesn't like the syntax, make a different proposal.
> 
> 
> In any event, I've pushed a new branch for people to play around with 
> that eliminates the anchoring, and allows for more delimiter characters 
> than the initial branch did.
>>
>> Deven
> 

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About