This is the beginnings of a proposal to put set operations in the Perl
core. There is a CPAN module, Unicode::Regex::Set, but as it states,
"The syntax provided by this module is considerably incompatible with
the standard Perl's regex syntax."
This is the last major non-conformance Perl has to Unicode's Level 1
(the most basic) regular expression support (if you accept using the
CPAN module Unicode::Linebreaking for the other outage).
I propose starting with Yves' idea from the last time this was discussed
on p5p, and that is to require the set operations to be wrapped within
the currently illegal "(?[ ... ])".
The proposal is based somewhat on the Perl 6 syntax. There would be 4
binary operators:
& for intersection
| for union
- for subtraction
^ for the symmetric difference (the union minus the intersection--
essentially an exclusive or)
And one unary operator:
! for negation
Operands would be code points or sets of code points, specified by one of:
1) an escape sequence, like \n or \p{sc=greek} or \N{COLON}
2) a literal character preceeded by a backslash, like \:, but only where
there is no ambiguity ever possible between this and an escape sequence
3) a bracketed character class, like [a-z0-5], perhaps modified slightly
from the Perl 5 character class, as described below. Note that this
gives an alternate way to the "|" operator of specifying union.
Parentheses could be used for grouping, but would not capture.
The /x modifier would be considered to be always on within this
construct, and there would be no literal characters allowed, outside
bracketed character classes.
The changes to bracketed character classes that I'm considering are:
1) Like Perl 6, allow white space between elements, thus requiring
literal white space to be escaped
2) Forbid adjacent doubled characters, so that [abc&&def] would be
prohibited
3) Except perhaps, allow the Perl 6 construct of a doubled period to
mean the same thing as a hyphen: a range.
Perl 6 uses the same precedence for the operators here that the
operators which use the same symbol have in normal constructs. Thus ^
has the same precedence in Perl 6 as the exclusive-or operator, simply
because they share the same symbol, and have analogous meanings.
I was hoping to not have to do with precedence, as it complicates
parsing. Perhaps there are tools that make this easy?
This construct would compile into the same atomic node type as a
bracketed character class now gives.
Examples:
(?[ \w & ( \p{Greek} | \p{Latin} ) ])+
Matches Latin and Greek word characters. And
(?[ \s - \ck ])
makes sure that vertical tab is not included in the white space match.
Although I think \ck is obscure (it is the same in EBCDIC as well,
though). I would prefer \N{VT} or \N{VERTICAL TAB}.
Thread Next