develooper Front page | perl.perl5.porters | Postings from February 2019

Re: RFC: Adding \p{foo=/re/}

Thread Previous | Thread Next
From:
Deven T. Corzine
Date:
February 9, 2019 05:26
Subject:
Re: RFC: Adding \p{foo=/re/}
Message ID:
CAFVdu0R3TQLEKDN4R9WPO5+rfyvAEakjrRbmGtMZT4e_ohrmLA@mail.gmail.com
On Fri, Feb 8, 2019 at 11:56 PM demerphq <demerphq@gmail.com> wrote:

> Yes I do have concerns. I replied in detail in another email, but to
> summarize succinctly, there are many features in the regex engine, how
> does this new proposal interact with them? How do we ensure that using
> this feature does not result in quadratic performance when an
> equivalent pattern using a different feature set would be linear?
>

I saw your other email, but I think this is something different which
shouldn't be like named recursion.

Quote from the UTS 18 link: "this feature allows the use of a regular
expression to pick out a set of characters based on whether the property
values match the regular expression."

If I understand correctly, any regex used in this mechanism would match
against property values of the Unicode character set, NOT against arbitrary
text.  Since the Unicode data is static, I see no reason why the property
regex shouldn't be compiled independently AND executed immediately, while
compiling the containing regex.  The results should then function as a
fixed predefined character class of Unicode characters, much like a POSIX
character class but specified in a more dynamic and flexible manner.  The
containing regex should be able to include this property-based character
class inside a normal character class.  Since the property regex can be
executed at compile time, there is no risk of making regular expressions
turn quadratic, nor should there be interactions from captures or anything
else.

For example, from UTS 18 again, the property value \p{toNfd=/b/} could be
compiled into [\x{0062}\x{1e03}\x{1e05}\x{1e07}], with the same exact
runtime semantics and performance characteristics, and the property
value \p{name=/^LATIN LETTER.*P$/} could be similarly compiled into
[\x{01aa}\x{0294}\x{0296}\x{1d18}], etc.

If these property regular expressions were compiled and executed at compile
time like this, and turned into straightforward Unicode character classes
to use at runtime, wouldn't that avoid the concerns you mentioned in the
other email?

Deven

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About