Front page | perl.perl5.porters |
Postings from August 2010
RFC: New regex modifier flags
From: karl williamson
August 3, 2010 12:23
RFC: New regex modifier flags
Message ID: 4C586C7C.email@example.com
This is another go round in what to do about this. I hope this isn't
too long. I give extensive background, and then a place to vote your
preference at the end.
The background: It is proposed to add new modifiers to the syntax of
regular expressions, like the existing /msixopgcre... ones. These
modifiers would specify the character set semantics to use in
interpreting this regular expression. There are two existing possible
interpretations: one for if 'use locale' is in effect at the time the
regex is compiled, and the other for when it isn't. Currently, the
regex compiles one way in one situation, and another way for the other.
The problem with the existing approach is that if the regex is
interpolated into another regex later, it is re-compiled under the
interpretation existing at the point of the interpolation; what it was
compiled under is lost. If compiled not under locale, but interpolated
under locale, it will suddenly have locale semantics; and vice versa.
The only reasonable way to preserve the semantics for later
interpolation is to change the stringification of the regex to include
what semantics it was originally compiled with.
We have been living with that, I presume, since 'use locale' was
introduced. The issue has come to a head though, because of the
introduction of yet another semantics scheme: Unicode. Currently in
regexes, the characters whose ordinals are between 128 and 255 do not
always get the semantics Unicode assigns them. They do on EBCDIC
platforms, or (except for a number of bugs) if the pattern or target is
in utf8 or if the target is being compared against a Unicode property;
otherwise they don't. Thus the behavior of such characters on ASCII
platforms is dualistic. Sometimes they behave as Unicode; sometimes as
Posix. 'use feature "unicode_strings"' was created to tell Perl to
always use Unicode semantics under its scope. It currently works only
on the case changing functions, (uc(), et al), but I have a patch
prepared that extends that to include regexes.
We don't want regexes compiled under Unicode semantics to lose that
information when interpolated in a scope where those don't apply; and
vice versa. This means the stringification of the regex requires
something that tells this, as well as if locale applies.
In fact all the semantic schemas are mutually exclusive: There are 3
cases: traditional (dualistic), locale, and unicode. I can see a 4th
coming along: native, which on ASCII machines would force a Posix
interpretation, one in which, for example, \d only matches [0-9]. I
haven't thought through the consequences of this on utf8 strings.
To accommodate the above, we need to change the stringification of
regexes. Since we're having to do that, it has been thought that we
might as well create new modifiers so that the semantics could be
specified explicitly on a regex, besides being implied implicitly by the
pragma that are in effect. This would allow overriding of the current
default for individual regexes.
The problem is what those modifiers should be. Perl currently allows a
keyword to come right after a regex, like '/abc/lt 1' Code is being
added in 5.14 to deprecate that, but we are stuck with that until at
least 5.16. Thus the original implementation I did a few months ago
which used both 'l' and 't' might break existing programs. It was noted
then that these are the only modifiers that are mutually exclusive.
Even /ms are not mutually exclusive, but these are: you can have only a
single semantic interpretation in effect at a time. Several people said
that therefore these new modifiers should be distinguished from the
regular ones in some way.
There are two obvious ways to do that. One is is have these each be an
argument to a master flag, like 'Cl', 'Cu', etc, where the C stands for
character set semantics (or one could use an 'S' instead). (I suppose
that the master flag need not be an alpha.) But this approach is a
break from tradition, and when reading such a regex it can be confusing
to have single and multi-char flags intermixed, like '/abc/mClip'.
There may be other things we haven't thought of. It turns out the
implementation is not hard.
The second way is to have single letter flags, but capitalize those
flags that are mutually exclusive, like /abc/LU, would have the U
override the L, and the users would come to accept that the
capitalization meant this. The problem I see with this approach is that
it means that the only capital letters ever used would have to be for
the purpose of defining the character set semantics. I don't like tying
the hands of future porters.
Another possibility is to just not distinguish these mutually exclusive
options from regular options. They would all be lower case. The
standard Unix philosophy of "last option on the line overrides all
previous conflicting ones" would apply. If we chose this approach,
there are several sub-possibilities:
1) We choose option letters that don't conflict with any keyword
beginnings. The problem is that the obvious letters do.
2) Until 5.16 require that anyone using these options do so in the
(?:...) form, thus avoiding any ambiguities.
3) Implement them in such a way that until 5.16 any ambiguity is
resolved in favor of the interpretation that a keyword is meant; there
already is a deprecation message generated whenever such an
interpretation would happen. The implementation of this isn't difficult.
I think that's it. Here is your ballot (I briefly what I think are the
up/downsides. Feel free to add things.)
1) Do nothing. Don't implement this, leave the Unicode bugs there.
2) Use two-letter options for the mutually exclusive ones. Extensible,
visually distinguishable, but /abc/Clip may be hard to read.
3) Use single capital letters for these options; distinguishable, not
4) Use single letter lower case option letters, but choose non-mnemonic
ones to avoid ambiguities with existing keywords. They are not visibly
distinguished from options that aren't mutually exclusive.
5) Use single letter lower case option letters, but until 5.16 they are
only valid in the (?:) form. They are not visibly distinguished from
options that aren't mutually exclusive.
6) Use single letter lower case option letters implemented in such a way
that until 5.16 ambiguities are resolved to the existing
interpretations. The option letters are not visibly distinguished from
options that aren't mutually exclusive. After the deprecation cycle in
5.14, there won't be these ambiguities. My vote goes to this one.