develooper Front page | perl.perl5.porters | Postings from February 2019

Re: RFC: Adding \p{foo=/re/}

Thread Previous | Thread Next
From:
Karl Williamson
Date:
February 9, 2019 17:01
Subject:
Re: RFC: Adding \p{foo=/re/}
Message ID:
03196501-7710-13a2-f444-8aed2ffda08d@khwilliamson.com
On 2/9/19 2:29 AM, Deven T. Corzine wrote:
> On Sat, Feb 9, 2019 at 4:16 AM Deven T. Corzine <deven@ties.org 
> <mailto:deven@ties.org>> wrote:
> 
>     Karl, can you enlighten us?  Are you recursing into a subpattern at
>     runtime? What do you think of the hypothetical approach I described?
> 
> 
> I just read Karl’s description again: “The way it's implemented is a 
> separate regex is compiled and executed
> during the compilation of the outer one.”
> 
> I didn’t notice the “and executed” part the first time.  That sounds 
> exactly like the hypothetical implementation that I described, actually...
> 
> Deven
> 

I'm sorry for not being clear.  Deven is correct that his hypothetical 
implementation is what I have done.

This is a bolt-on feature to the Perl's regexes.  It implements a 
portion of the wildcard feature of what UTS 18 asks for, using their 
syntax.  It is an apparent goal, as long listed in perlunicode, to do as 
much of UTS 18 as we can.

And the implementation isn't efficient.

It is implemented by, during the compilation of a character class, 
interrupting that compilation, assembling an inner pattern, then 
compiling that and executing it to find all the code points it matches. 
That list is then added to whatever else is in the character class, the 
inner pattern's space is freed, and compilation of the outer pattern 
resumed.  There is no recursive execution.  But there is recursion in 
the sense, as I described, that a second pattern is compiled while in 
the middle of compiling an outer pattern.  I don't know if that is an 
issue or not.  The patterns do not share anything, no groups, etc.

I've learned that a feature like this should be marked as experimental, 
so that it can be refined or even removed, and marking it as such lowers 
expectations as to its well-thought-outness and bug-free-ness.  It 
allows us to try things out and get feedback without having to say we 
think it is fully done.  The prototype is so marked.

I've also learned that inefficiencies in compilation don't really 
matter.  I removed an entire pass of the regex compilation process, with 
extra mallocs being the price.  There did not seem to be a noticeable 
change in the speed of execution of our test suite!  This inefficient 
implementation (and I don't know another way to do it) won't be 
noticeable in the end, because it's only done at compilation.

I believe PCRE doesn't do this; I don't know about other engines.  But 
if no one does, I would think that us having a feature no one else does 
is a selling point.  If others do, we could perhaps learn from their 
syntax.  A quick google search didn't turn up anything obvious.

If there are issues with various constructs, we can forbid those.  My 
implementation, for example, doesn't allow braces in the subpattern, and 
hence no construct that requires braces.  I think that's a reasonable 
initial restriction to make it easier to implement something, that 
otherwise wouldn't get implemented.

If the UTS 18 syntax is misleading, what isn't?

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About