develooper Front page | perl.perl5.porters | Postings from February 2019

Re: RFC: Adding \p{foo=/re/}

Thread Previous | Thread Next
From:
Deven T. Corzine
Date:
February 12, 2019 01:13
Subject:
Re: RFC: Adding \p{foo=/re/}
Message ID:
CAFVdu0RQvLEG4sQVhM2CqDUG+B0+5TEJcb-ef3ZsriURCnAZGQ@mail.gmail.com
On Sat, Feb 9, 2019 at 12:01 PM Karl Williamson <public@khwilliamson.com>
wrote:

> I'm sorry for not being clear.  Deven is correct that his hypothetical
> implementation is what I have done.
>

That’s good to hear!  I was hoping it would be implemented in such a
fashion.

This is a bolt-on feature to the Perl's regexes.  It implements a
> portion of the wildcard feature of what UTS 18 asks for, using their
> syntax.  It is an apparent goal, as long listed in perlunicode, to do as
> much of UTS 18 as we can.


Using their syntax seems worthwhile even if we already deviate elsewhere.

However, I must say that their last example in the table doesn't make sense
to me

Characters in the Letterlike symbol block with different toLowercase values:
>      [\p{toLowercase≠@cp@} & \p{Block=Letterlike Symbols}]


This seems to imply some sort of boolean logic, which sounds good in
principle, but this syntax seems bizarre to me.  I would expect each
\p{...} expression to be independent, but if they want two different
property matches to apply as a set intersection, I think one of these
examples this would be a more reasonable syntax:

     \p{toLowercase≠@cp@ & Block=Letterlike Symbols}
or
     \p{{toLowercase≠@cp@} & {Block=Letterlike Symbols}}

Also, on the topic of syntax, since these are meant to be used for sets of
characters that can be used in a character class, I would suggest that
these \p{...} expressions should also work *outside* square brackets as
well, and imply [\p{...}] if the square brackets are omitted.  (Perhaps you
already do this too?)

And the implementation isn't efficient.
>
> It is implemented by, during the compilation of a character class,
> interrupting that compilation, assembling an inner pattern, then
> compiling that and executing it to find all the code points it matches.
> That list is then added to whatever else is in the character class, the
> inner pattern's space is freed, and compilation of the outer pattern
> resumed.  There is no recursive execution.  But there is recursion in
> the sense, as I described, that a second pattern is compiled while in
> the middle of compiling an outer pattern.  I don't know if that is an
> issue or not.  The patterns do not share anything, no groups, etc.


As long as it's all compile-time, it's probably plenty efficient enough
already.  Still, it might be worth keeping a cache of the \p{...}
expressions used and the set of Unicode characters each generated, to avoid
incurring the cost of generating the set if the same expressions are used
over and over again.  The cache could be discarded at the end of the
compilation phase, either for the one containing regex, or (perhaps better)
after compiling the entire program.  Beyond that, I'm not sure what else
could be done to optimize it much more.


> I've learned that a feature like this should be marked as experimental,
> so that it can be refined or even removed, and marking it as such lowers
> expectations as to its well-thought-outness and bug-free-ness.  It
> allows us to try things out and get feedback without having to say we
> think it is fully done.  The prototype is so marked.
>

Good idea, especially since a later official Unicode standard could change.


> I've also learned that inefficiencies in compilation don't really
> matter.  I removed an entire pass of the regex compilation process, with
> extra mallocs being the price.  There did not seem to be a noticeable
> change in the speed of execution of our test suite!  This inefficient
> implementation (and I don't know another way to do it) won't be
> noticeable in the end, because it's only done at compilation.
>

I would agree with this.  You're calling this implementation inefficient,
but I'm not sure that word applies if there isn't a substantially better
way to do it.  Creating a fixed character set at compile time is the thing
that will make this efficient at runtime, and as long as the cost at
compile time is small, it's not likely to even be noticed.


> I believe PCRE doesn't do this; I don't know about other engines.  But
> if no one does, I would think that us having a feature no one else does
> is a selling point.  If others do, we could perhaps learn from their
> syntax.  A quick google search didn't turn up anything obvious.
>

I doubt anyone else does it yet.  If Perl has it, perhaps PCRE would
consider copying it later to try to maintain better compatibility with
Perl, but they might not even bother.


> If there are issues with various constructs, we can forbid those.  My
> implementation, for example, doesn't allow braces in the subpattern, and
> hence no construct that requires braces.  I think that's a reasonable
> initial restriction to make it easier to implement something, that
> otherwise wouldn't get implemented.
>

It would be good to support balanced/escaped braces, but that can certainly
be a second pass...


> If the UTS 18 syntax is misleading, what isn't?
>

I'm not even sure what you mean by this!

Deven

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About