develooper Front page | perl.perl5.porters | Postings from August 2014

Re: RFC: long regex pattern modifiers

Thread Previous
From:
Karl Williamson
Date:
August 28, 2014 04:44
Subject:
Re: RFC: long regex pattern modifiers
Message ID:
53FEB3A3.4030106@khwilliamson.com
On 08/27/2014 05:41 AM, Tom Christiansen wrote:
> Karl Williamson <public@khwilliamson.com> wrote
>     on Tue, 26 Aug 2014 18:12:19 MDT:
>
>> I have mentioned in earlier posts about the upcoming need for going
>> beyond the single-char pattern modifiers /msixpodualgcer.  (Some
>> examples include being able to override /i definitions, for user-defined
>> Unicode private-use properties, for allowing one to globally say that \b
>> really should be \b{wb}, and others.)
>
>> I'm here proposing a syntax for doing this.  An example would be
>>    /(?mi{long-modifier}u: ... )/
>
> I have one main question, and a few random thoughts.
>
> My main question is:
>
>      Are these long-modifiers always, never, or sometimes
>      considered "modifiers or modifiers"?

I had to read this several time to grok it, as it contains a typo.  For 
those of you who don't understand it, the 'or' should be 'of'

>
> Specifically, I wonder whether that is
>      (A) to be considered four modifiers,

I intended it to be (A).

>      (B) or is it three modifiers,
>   or (C) might it perhaps be either one of those depending on the modifier in question.
>
> Four modifiers:
>
>      m
>      i
>      {long-modifier}
>      u
>
> Three modifiers:
>
>      m
>      i{long-modifier}
>      u
>
> In other words, is the {long-modifier} bit something that can
> stand on its own, or can it only occur following a particular
> short modifier?
>
> The third possibility, C above, is that some of them might be
> a subtype of short modifiers acting adverbially on the short
> modifier itself, but others might be completely stand-alone.
>
> I could probably find points in favor of all three possible
> interpretations.
>
> The i flag seems especially um "affinitive" to the modifier-of-a-modifier
> way of looking at things.  It might even admit multiple simultaneous ones:
>
>      i{turkic}
>      i{uca=1}
>      i{turkic, uca=1}

My intent was to get a syntax that was currently illegal, and did the 
job at hand.  I can see the use-case for modifying modifiers, but what I 
was proposing wasn't that.  Better suggestions welcome, or we could use 
[] for this use-case instead of {} if and when we implement that use-case.
>
> Although now I wonder about how substraction would work. Hm.
>
>      (?flags-flags: ... )
>
> But the s&m flags also might like adverbial modifiers of their own,
> something that make them think not about \n but about \R instead.
> Then again, that might be a long-modifier that would reasonably
> apply to both both s&m.

My mother always told me to stay away from s&m ;)

Some modifiers already are illegal after a minus.  Most of the new ones 
would be too.
>
> I also see the point of having some of these be basically internal
> markings triggered by 'use re' pragma variants instead of being things
> that the user gets at.  So maybe something like
>
>      use re '/{linebreak=unicode}'
>
> Or some such to make all the s&m stuff treat any \R grapheme as previously
> it were treating (or not treating) \n.

Exactly.  That is my proposal, to make all the known long modifiers only 
be generated by a pragma.  That way, if something goes awry we can 
change or remove them without worrying about back compat.  Aftger 
gaining field-experience, we could relax this.  Perhaps the pragma could 
even generate the modifier to look like

{experimental:linebreak=unicode}

so that someone trying to bypass the pragma would certainly be 
forewarned of the inadvisability of doing so.

>
> Sometimes I think I might prefer
>
>      (?break{word}: \bfoo\b)
>
> Over
>
>      \b{word}foo\b{word}

Yes, but that can wait until we gain experience.  For 5.22, I would 
propose that you'd have to say

  use re '/\b=wb'

or some such, to get the effect of changing \b behavior.  I think the 
Unicode definition will be preferable in general to the current one, so 
I would think almost all code that cares would want to only use it, and 
not have different ones scattered around, except rarely, so I don't see 
the use-case for specifying which break you want on a per-regex basis. 
(Also, I am now favoring 'wb' over 'word' because the former is an 
official unicode name and would be less likely to be misinterpreted as 
our current \b.)

>
> I don't know whether this where to sneak in tr18's level-3 tailoring
> bits.  Possibly so, possibly not.  I'm looking at things like their
> own examples of \T{locale_id}...\E, or \X{es-u-co-trad} or \b{w}.
>
>      If both tailored and default regular expressions are supported, then
>      a number of different mechanism are affected. There are two main
>      alternatives for control of tailored support:
>
>         * coarse-grained support: the whole regular expression (or the
>           whole script in which the regular expression occurs) can be
>           marked as being tailored.
>         * fine-grained support: any part of the regular expression can be
>           marked in some way as being tailored.
>
>      For example, fine-grained support could use some syntax such as the
>      following to indicate tailoring to a locale within a certain range.
>      Locale (or language) IDs should use the syntax from locale identifier
>      definition in [UTS35], Section 3. Identifiers . Note that the locale id
>      of "root" or "und" indicates the root locale, such as in the CLDR root
>      collation. \T{<locale_id>}..\E
>
>      ---
>
>      For example, an implementation could interpret \X{es-u-co-trad} as
>      matching a collation grapheme cluster for a traditional Spanish
>      ordering, or use a switch to change the meaning of \X during some
>      span of the regular expression.
>
>      ---
>
>      For example, an implementation could interpret \b{x:...} as matching the
>      word break positions according to the locale information in CLDR [UTS35]
>      (which are tailorings of word break positions in [UAX29]).
>
>      Thus it could interpret
>
> 	\b{w:und} or \b{w} as matching a root word break
> 	\b{w:ja} as matching a Japanese word break
> 	\b{l:ja} as matching a Japanese line break
>
>      Alternatively, it could use a switch to change the meaning of \b and \B
>      during some span of the regular expression.

I hadn't thought about tailoring, and am glad you are.  That's for the 
future, but if an extensible syntax can be found now, so much the better.
>
> More random thoughts...
>
> Here are the sorts of mnemonics I use in my own head when I use the
> one-letter modifiers (which are variously pattern flags, match-operator
> flags, or substitute-operator flags):
>
>     [Note that the s&m mods' interpretation were banged into my head
>      by Larry through some not inconsiderable effort on his part,
>      because they weren't really fitting into existing holes that well.]
>
>      m	multiline(d)
>      s	singleline(d)
>      i	insensitive
>      x	expand(ed)
>      p	preserve(d)
>      o   onetime
>      d	dual(istic)
>      u	unicode
>      a	ascii
>      l	locale
>      g	global
>      c	continue(d)
>      e	evaluat(ed)
>      r	return(ed)
>
> I've just noticed that grammatically, those are basically all "noun
> modifiers", whether as adjectives or attributive nouns or participial
> adjectively.  That is, they all fit into the <BLAH> slot in
>
>      This is a <BLAH> match.
>
> The reason this is interesting is that I seem to recall the perl6 folks
> calling these adverbs, not adjectives.    I guess you can just plop on
> an -ly for most of those to adverb them, which works fine with globally
> but somewhat (English-)dubiously for singlelinèdly.
>
> For the operator flags, I can how see those being adverbs, since it
> applies to the verbing (matching, substituting) operation.
>
>      while this string globally matches the pattern....  I'm not sure there is really a use-case for having different interpretations scattered around.
>
> But with the pattern-compilation flags, they modify the compiled
> pattern itself, not how the match operator uses that pattern to
> perform its duties.  But this has always been a (mild, minor) confusion,
> since the syntactic slot after
>
>      m/.../abcdefgȝhijklmnopqrstþuvwxyz
>      s/.../abcdefgȝhijklmnopqrstþuvwxyz
>      qr/../abcdefgȝhijklmnopqrstþuvwxyz
>
> accepts single letters, not caring "when" the apply.
>
> --tom
>


Thread Previous


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About