develooper Front page | perl.perl5.porters | Postings from August 2014

Re: RFC: long regex pattern modifiers

Thread Previous
Karl Williamson
August 28, 2014 04:44
Re: RFC: long regex pattern modifiers
Message ID:
On 08/27/2014 05:41 AM, Tom Christiansen wrote:
> Karl Williamson <> wrote
>     on Tue, 26 Aug 2014 18:12:19 MDT:
>> I have mentioned in earlier posts about the upcoming need for going
>> beyond the single-char pattern modifiers /msixpodualgcer.  (Some
>> examples include being able to override /i definitions, for user-defined
>> Unicode private-use properties, for allowing one to globally say that \b
>> really should be \b{wb}, and others.)
>> I'm here proposing a syntax for doing this.  An example would be
>>    /(?mi{long-modifier}u: ... )/
> I have one main question, and a few random thoughts.
> My main question is:
>      Are these long-modifiers always, never, or sometimes
>      considered "modifiers or modifiers"?

I had to read this several time to grok it, as it contains a typo.  For 
those of you who don't understand it, the 'or' should be 'of'

> Specifically, I wonder whether that is
>      (A) to be considered four modifiers,

I intended it to be (A).

>      (B) or is it three modifiers,
>   or (C) might it perhaps be either one of those depending on the modifier in question.
> Four modifiers:
>      m
>      i
>      {long-modifier}
>      u
> Three modifiers:
>      m
>      i{long-modifier}
>      u
> In other words, is the {long-modifier} bit something that can
> stand on its own, or can it only occur following a particular
> short modifier?
> The third possibility, C above, is that some of them might be
> a subtype of short modifiers acting adverbially on the short
> modifier itself, but others might be completely stand-alone.
> I could probably find points in favor of all three possible
> interpretations.
> The i flag seems especially um "affinitive" to the modifier-of-a-modifier
> way of looking at things.  It might even admit multiple simultaneous ones:
>      i{turkic}
>      i{uca=1}
>      i{turkic, uca=1}

My intent was to get a syntax that was currently illegal, and did the 
job at hand.  I can see the use-case for modifying modifiers, but what I 
was proposing wasn't that.  Better suggestions welcome, or we could use 
[] for this use-case instead of {} if and when we implement that use-case.
> Although now I wonder about how substraction would work. Hm.
>      (?flags-flags: ... )
> But the s&m flags also might like adverbial modifiers of their own,
> something that make them think not about \n but about \R instead.
> Then again, that might be a long-modifier that would reasonably
> apply to both both s&m.

My mother always told me to stay away from s&m ;)

Some modifiers already are illegal after a minus.  Most of the new ones 
would be too.
> I also see the point of having some of these be basically internal
> markings triggered by 'use re' pragma variants instead of being things
> that the user gets at.  So maybe something like
>      use re '/{linebreak=unicode}'
> Or some such to make all the s&m stuff treat any \R grapheme as previously
> it were treating (or not treating) \n.

Exactly.  That is my proposal, to make all the known long modifiers only 
be generated by a pragma.  That way, if something goes awry we can 
change or remove them without worrying about back compat.  Aftger 
gaining field-experience, we could relax this.  Perhaps the pragma could 
even generate the modifier to look like


so that someone trying to bypass the pragma would certainly be 
forewarned of the inadvisability of doing so.

> Sometimes I think I might prefer
>      (?break{word}: \bfoo\b)
> Over
>      \b{word}foo\b{word}

Yes, but that can wait until we gain experience.  For 5.22, I would 
propose that you'd have to say

  use re '/\b=wb'

or some such, to get the effect of changing \b behavior.  I think the 
Unicode definition will be preferable in general to the current one, so 
I would think almost all code that cares would want to only use it, and 
not have different ones scattered around, except rarely, so I don't see 
the use-case for specifying which break you want on a per-regex basis. 
(Also, I am now favoring 'wb' over 'word' because the former is an 
official unicode name and would be less likely to be misinterpreted as 
our current \b.)

> I don't know whether this where to sneak in tr18's level-3 tailoring
> bits.  Possibly so, possibly not.  I'm looking at things like their
> own examples of \T{locale_id}...\E, or \X{es-u-co-trad} or \b{w}.
>      If both tailored and default regular expressions are supported, then
>      a number of different mechanism are affected. There are two main
>      alternatives for control of tailored support:
>         * coarse-grained support: the whole regular expression (or the
>           whole script in which the regular expression occurs) can be
>           marked as being tailored.
>         * fine-grained support: any part of the regular expression can be
>           marked in some way as being tailored.
>      For example, fine-grained support could use some syntax such as the
>      following to indicate tailoring to a locale within a certain range.
>      Locale (or language) IDs should use the syntax from locale identifier
>      definition in [UTS35], Section 3. Identifiers . Note that the locale id
>      of "root" or "und" indicates the root locale, such as in the CLDR root
>      collation. \T{<locale_id>}..\E
>      ---
>      For example, an implementation could interpret \X{es-u-co-trad} as
>      matching a collation grapheme cluster for a traditional Spanish
>      ordering, or use a switch to change the meaning of \X during some
>      span of the regular expression.
>      ---
>      For example, an implementation could interpret \b{x:...} as matching the
>      word break positions according to the locale information in CLDR [UTS35]
>      (which are tailorings of word break positions in [UAX29]).
>      Thus it could interpret
> 	\b{w:und} or \b{w} as matching a root word break
> 	\b{w:ja} as matching a Japanese word break
> 	\b{l:ja} as matching a Japanese line break
>      Alternatively, it could use a switch to change the meaning of \b and \B
>      during some span of the regular expression.

I hadn't thought about tailoring, and am glad you are.  That's for the 
future, but if an extensible syntax can be found now, so much the better.
> More random thoughts...
> Here are the sorts of mnemonics I use in my own head when I use the
> one-letter modifiers (which are variously pattern flags, match-operator
> flags, or substitute-operator flags):
>     [Note that the s&m mods' interpretation were banged into my head
>      by Larry through some not inconsiderable effort on his part,
>      because they weren't really fitting into existing holes that well.]
>      m	multiline(d)
>      s	singleline(d)
>      i	insensitive
>      x	expand(ed)
>      p	preserve(d)
>      o   onetime
>      d	dual(istic)
>      u	unicode
>      a	ascii
>      l	locale
>      g	global
>      c	continue(d)
>      e	evaluat(ed)
>      r	return(ed)
> I've just noticed that grammatically, those are basically all "noun
> modifiers", whether as adjectives or attributive nouns or participial
> adjectively.  That is, they all fit into the <BLAH> slot in
>      This is a <BLAH> match.
> The reason this is interesting is that I seem to recall the perl6 folks
> calling these adverbs, not adjectives.    I guess you can just plop on
> an -ly for most of those to adverb them, which works fine with globally
> but somewhat (English-)dubiously for singlelinèdly.
> For the operator flags, I can how see those being adverbs, since it
> applies to the verbing (matching, substituting) operation.
>      while this string globally matches the pattern....  I'm not sure there is really a use-case for having different interpretations scattered around.
> But with the pattern-compilation flags, they modify the compiled
> pattern itself, not how the match operator uses that pattern to
> perform its duties.  But this has always been a (mild, minor) confusion,
> since the syntactic slot after
>      m/.../abcdefgȝhijklmnopqrstþuvwxyz
>      s/.../abcdefgȝhijklmnopqrstþuvwxyz
>      qr/../abcdefgȝhijklmnopqrstþuvwxyz
> accepts single letters, not caring "when" the apply.
> --tom

Thread Previous Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About