develooper Front page | perl.perl5.porters | Postings from August 2010

RFC: New regex modifier flags

Thread Previous | Thread Next
karl williamson
August 3, 2010 12:23
RFC: New regex modifier flags
Message ID:
This is another go round in what to do about this.  I hope this isn't 
too long.  I give extensive background, and then a place to vote your 
preference at the end.

The background:  It is proposed to add new modifiers to the syntax of 
regular expressions, like the existing /msixopgcre... ones.  These 
modifiers would specify the character set semantics to use in 
interpreting this regular expression.  There are two existing possible 
interpretations: one for if 'use locale' is in effect at the time the 
regex is compiled, and the other for when it isn't.  Currently, the 
regex compiles one way in one situation, and another way for the other.

The problem with the existing approach is that if the regex is 
interpolated into another regex later, it is re-compiled under the 
interpretation existing at the point of the interpolation; what it was 
compiled under is lost.  If compiled not under locale, but interpolated 
under locale, it will suddenly have locale semantics; and vice versa. 
The only reasonable way to preserve the semantics for later 
interpolation is to change the stringification of the regex to include 
what semantics it was originally compiled with.

We have been living with that, I presume, since 'use locale' was 
introduced.  The issue has come to a head though, because of the 
introduction of yet another semantics scheme: Unicode.  Currently in 
regexes, the characters whose ordinals are between 128 and 255 do not 
always get the semantics Unicode assigns them.  They do on EBCDIC 
platforms, or (except for a number of bugs) if the pattern or target is 
in utf8 or if the target is being compared against a Unicode property; 
otherwise they don't.  Thus the behavior of such characters on ASCII 
platforms is dualistic.  Sometimes they behave as Unicode; sometimes as 
Posix.  'use feature "unicode_strings"' was created to tell Perl to 
always use Unicode semantics under its scope.  It currently works only 
on the case changing functions, (uc(), et al), but I have a patch 
prepared that extends that to include regexes.

We don't want regexes compiled under Unicode semantics to lose that 
information when interpolated in a scope where those don't apply; and 
vice versa.  This means the stringification of the regex requires 
something that tells this, as well as if locale applies.

In fact all the semantic schemas are mutually exclusive:  There are 3 
cases: traditional (dualistic), locale, and unicode.  I can see a 4th 
coming along: native, which on ASCII machines would force a Posix 
interpretation, one in which, for example, \d only matches [0-9].  I 
haven't thought through the consequences of this on utf8 strings.

To accommodate the above, we need to change the stringification of 
regexes.  Since we're having to do that, it has been thought that we 
might as well create new modifiers so that the semantics could be 
specified explicitly on a regex, besides being implied implicitly by the 
pragma that are in effect.  This would allow overriding of the current 
default for individual regexes.

The problem is what those modifiers should be.  Perl currently allows a 
keyword to come right after a regex, like '/abc/lt 1'  Code is being 
added in 5.14 to deprecate that, but we are stuck with that until at 
least 5.16.  Thus the original implementation I did a few months ago 
which used both 'l' and 't' might break existing programs.  It was noted 
then that these are the only modifiers that are mutually exclusive. 
Even /ms are not mutually exclusive, but these are: you can have only a 
single semantic interpretation in effect at a time.  Several people said 
that therefore these new modifiers should be distinguished from the 
regular ones in some way.

There are two obvious ways to do that.  One is is have these each be an 
argument to a master flag, like 'Cl', 'Cu', etc, where the C stands for 
character set semantics (or one could use an 'S' instead).  (I suppose 
that the master flag need not be an alpha.)  But this approach is a 
break from tradition, and when reading such a regex it can be confusing 
to have single and multi-char flags intermixed, like '/abc/mClip'. 
There may be other things we haven't thought of.  It turns out the 
implementation is not hard.

The second way is to have single letter flags, but capitalize those 
flags that are mutually exclusive, like /abc/LU, would have the U 
override the L, and the users would come to accept that the 
capitalization meant this.  The problem I see with this approach is that 
it means that the only capital letters ever used would have to be for 
the purpose of defining the character set semantics.  I don't like tying 
the hands of future porters.

Another possibility is to just not distinguish these mutually exclusive 
options from regular options.  They would all be lower case.  The 
standard Unix philosophy of "last option on the line overrides all 
previous conflicting ones" would apply.  If we chose this approach, 
there are several sub-possibilities:

1) We  choose option letters that don't conflict with any keyword 
beginnings.  The problem is that the obvious letters do.

2) Until 5.16 require that anyone using these options do so in the 
(?:...) form, thus avoiding any ambiguities.

3) Implement them in such a way that until 5.16 any ambiguity is 
resolved in favor of the interpretation that a keyword is meant; there 
already is a deprecation message generated whenever such an 
interpretation would happen.  The implementation of this isn't difficult.

I think that's it.  Here is your ballot (I briefly what I think are the 
up/downsides.  Feel free to add things.)

1) Do nothing.  Don't implement this, leave the Unicode bugs there.

2) Use two-letter options for the mutually exclusive ones.  Extensible, 
visually distinguishable, but /abc/Clip may be hard to read.

3) Use single capital letters for these options; distinguishable, not 

4) Use single letter lower case option letters, but choose non-mnemonic 
ones to avoid ambiguities with existing keywords.  They are not visibly 
distinguished from options that aren't mutually exclusive.

5) Use single letter lower case option letters, but until 5.16 they are 
only valid in the (?:) form.  They are not visibly distinguished from 
options that aren't mutually exclusive.

6) Use single letter lower case option letters implemented in such a way 
that until 5.16 ambiguities are resolved to the existing 
interpretations.  The option letters are not visibly distinguished from 
options that aren't mutually exclusive.  After the deprecation cycle in 
5.14, there won't be these ambiguities.  My vote goes to this one.

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About