develooper Front page | perl.perl5.porters | Postings from June 2012

RFC: Add set operations for regexes

Thread Previous | Thread Next
From:
Karl Williamson
Date:
June 9, 2012 20:53
Subject:
RFC: Add set operations for regexes
Message ID:
4FD41A21.6080006@khwilliamson.com
This is the beginnings of a proposal to put set operations in the Perl 
core.  There is a CPAN module, Unicode::Regex::Set, but as it states, 
"The syntax provided by this module is considerably incompatible with 
the standard Perl's regex syntax."

This is the last major non-conformance Perl has to Unicode's Level 1 
(the most basic) regular expression support (if you accept using the 
CPAN module Unicode::Linebreaking for the other outage).

I propose starting with Yves' idea from the last time this was discussed 
on p5p, and that is to require the set operations to be wrapped within 
the currently illegal "(?[ ... ])".

The proposal is based somewhat on the Perl 6 syntax.  There would be 4 
binary operators:

& for intersection
| for union
- for subtraction
^ for the symmetric difference (the union minus the intersection-- 
essentially an exclusive or)

And one unary operator:

! for negation

Operands would be code points or sets of code points, specified by one of:

1) an escape sequence, like \n or \p{sc=greek} or \N{COLON}
2) a literal character preceeded by a backslash, like \:, but only where 
there is no ambiguity ever possible between this and an escape sequence
3) a bracketed character class, like [a-z0-5], perhaps modified slightly 
from the Perl 5 character class, as described below.  Note that this 
gives an alternate way to the "|" operator of specifying union.

Parentheses could be used for grouping, but would not capture.

The /x modifier would be considered to be always on within this 
construct, and there would be no literal characters allowed, outside 
bracketed character classes.

The changes to bracketed character classes that I'm considering are:
1) Like Perl 6, allow white space between elements, thus requiring 
literal white space to be escaped
2) Forbid adjacent doubled characters, so that [abc&&def] would be 
prohibited
3) Except perhaps, allow the Perl 6 construct of a doubled period to 
mean the same thing as a hyphen: a range.

Perl 6 uses the same precedence for the operators here that the 
operators which use the same symbol have in normal constructs.  Thus ^ 
has the same precedence in Perl 6 as the exclusive-or operator, simply 
because they share the same symbol, and have analogous meanings.

I was hoping to not have to do with precedence, as it complicates 
parsing.  Perhaps there are tools that make this easy?

This construct would compile into the same atomic node type as a 
bracketed character class now gives.

Examples:

  (?[ \w & ( \p{Greek} | \p{Latin} ) ])+

Matches Latin and Greek word characters.  And

  (?[ \s - \ck ])

makes sure that vertical tab is not included in the white space match. 
Although I think \ck is obscure (it is the same in EBCDIC as well, 
though).  I would prefer \N{VT} or \N{VERTICAL TAB}.

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About