develooper Front page | perl.perl5.porters | Postings from February 2019

Re: RFC: Adding \p{foo=/re/}

Thread Previous | Thread Next
Karl Williamson
February 15, 2019 17:25
Re: RFC: Adding \p{foo=/re/}
Message ID:
On 2/11/19 6:13 PM, Deven T. Corzine wrote:
> On Sat, Feb 9, 2019 at 12:01 PM Karl Williamson < 
> <>> wrote:
>     I'm sorry for not being clear.  Deven is correct that his hypothetical
>     implementation is what I have done.
> That’s good to hear!  I was hoping it would be implemented in such a 
> fashion.
>     This is a bolt-on feature to the Perl's regexes.  It implements a
>     portion of the wildcard feature of what UTS 18 asks for, using their
>     syntax.  It is an apparent goal, as long listed in perlunicode, to
>     do as
>     much of UTS 18 as we can.
> Using their syntax seems worthwhile even if we already deviate elsewhere.
> However, I must say that their last example in the table doesn't make 
> sense to me
>     Characters in the Letterlike symbol block with different toLowercase
>     values:
>           [\p{toLowercase≠@cp@} & \p{Block=Letterlike Symbols}]
> This seems to imply some sort of boolean logic, which sounds good in 
> principle, but this syntax seems bizarre to me.  I would expect each 
> \p{...} expression to be independent, but if they want two different 
> property matches to apply as a set intersection, I think one of these 
> examples this would be a more reasonable syntax:
>       \p{toLowercase≠@cp@ & Block=Letterlike Symbols}
> or
>       \p{{toLowercase≠@cp@} & {Block=Letterlike Symbols}}

Well this implementation is just a start and doesn't include this 
fancier stuff, so we can defer deciding this until later.
> Also, on the topic of syntax, since these are meant to be used for sets 
> of characters that can be used in a character class, I would suggest 
> that these \p{...} expressions should also work *outside* square 
> brackets as well, and imply [\p{...}] if the square brackets are 
> omitted.  (Perhaps you already do this too?)

Yes, already.
>     And the implementation isn't efficient.
>     It is implemented by, during the compilation of a character class,
>     interrupting that compilation, assembling an inner pattern, then
>     compiling that and executing it to find all the code points it matches.
>     That list is then added to whatever else is in the character class, the
>     inner pattern's space is freed, and compilation of the outer pattern
>     resumed.  There is no recursive execution.  But there is recursion in
>     the sense, as I described, that a second pattern is compiled while in
>     the middle of compiling an outer pattern.  I don't know if that is an
>     issue or not.  The patterns do not share anything, no groups, etc.
> As long as it's all compile-time, it's probably plenty efficient enough 
> already.  Still, it might be worth keeping a cache of the \p{...} 
> expressions used and the set of Unicode characters each generated, to 
> avoid incurring the cost of generating the set if the same expressions 
> are used over and over again.  The cache could be discarded at the end 
> of the compilation phase, either for the one containing regex, or 
> (perhaps better) after compiling the entire program.  Beyond that, I'm 
> not sure what else could be done to optimize it much more.

I don't think the added complexity is worth it at this stage of 
development without real numbers to indicate that it is.  And since 
eliminating a full pass of the compilation had no discernible effect, I 
doubt that a cache would either.
>     I've learned that a feature like this should be marked as experimental,
>     so that it can be refined or even removed, and marking it as such
>     lowers
>     expectations as to its well-thought-outness and bug-free-ness.  It
>     allows us to try things out and get feedback without having to say we
>     think it is fully done.  The prototype is so marked.
> Good idea, especially since a later official Unicode standard could change.
>     I've also learned that inefficiencies in compilation don't really
>     matter.  I removed an entire pass of the regex compilation process,
>     with
>     extra mallocs being the price.  There did not seem to be a noticeable
>     change in the speed of execution of our test suite!  This inefficient
>     implementation (and I don't know another way to do it) won't be
>     noticeable in the end, because it's only done at compilation.
> I would agree with this.  You're calling this implementation 
> inefficient, but I'm not sure that word applies if there isn't a 
> substantially better way to do it.  Creating a fixed character set at 
> compile time is the thing that will make this efficient at runtime, and 
> as long as the cost at compile time is small, it's not likely to even be 
> noticed.
>     I believe PCRE doesn't do this; I don't know about other engines.  But
>     if no one does, I would think that us having a feature no one else does
>     is a selling point.  If others do, we could perhaps learn from their
>     syntax.  A quick google search didn't turn up anything obvious.
> I doubt anyone else does it yet.  If Perl has it, perhaps PCRE would 
> consider copying it later to try to maintain better compatibility with 
> Perl, but they might not even bother.
>     If there are issues with various constructs, we can forbid those.  My
>     implementation, for example, doesn't allow braces in the subpattern,
>     and
>     hence no construct that requires braces.  I think that's a reasonable
>     initial restriction to make it easier to implement something, that
>     otherwise wouldn't get implemented.
> It would be good to support balanced/escaped braces, but that can 
> certainly be a second pass...

What I'm trying to do is give people the ability to do something, while 
punting niceties that aren't essential in favor of easier development.

>     If the UTS 18 syntax is misleading, what isn't?
> I'm not even sure what you mean by this!

I meant, if the reader doesn't like the syntax, make a different proposal.

In any event, I've pushed a new branch for people to play around with 
that eliminates the anchoring, and allows for more delimiter characters 
than the initial branch did.
> Deven

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About