develooper Front page | perl.perl6.users | Postings from May 2020

Matching subpatterns in any order, conjunctions, negated matches

Thread Next
From:
Joseph Brenner
Date:
May 16, 2020 02:32
Subject:
Matching subpatterns in any order, conjunctions, negated matches
Message ID:
CAFfgvXWuF5CcJR_jnNQg-BTnSfk_b5+Uhw95XZCuOCCB2GM9Sg@mail.gmail.com
Regex engines by their nature care a lot about order, but I
occasionally want to relax that to match for multiple
multicharacter subpatterns where the order of them doesn't
matter.

Frequently the simplest thing to do is just to just do multiple
matches.   Let's say you're looking for words that have a "qu" a
"th" and also, say an "ea".  This works:

  my $DICT  = "/usr/share/dict/american-english";
  my @hits = $DICT.IO.open( :r ).lines.grep({/qu/}).grep({/th/}).grep({/ea/});
  say @hits;
  # [bequeath bequeathed bequeathing bequeaths earthquake earthquake's
earthquakes]


It could be useful to be able to do it as one match though, for
example, you might be using someone else's routine which takes a
single regex as argument.  I've been known to write things like
this:

  my regex qu_th_ea   {  [ qu .*? th .*? ea ] |
                         [ qu .*? ea .*? th ] |
                         [ th .*? qu .*? ea ] |
                         [ th .*? ea .*? qu ] |
                         [ ea .*? th .*? qu ] |
                         [ ea .*? qu .*? th ]  };
  my @hits = $DICT.IO.open( :r ).lines.grep({/<qu_th_ea>/});

That works, but it gets unwieldy quickly if you need to scale up
the number of subpatterns.

Recently though, I noticed the "conjunctions" feature, and it
occured to me that this could be a very neat way of handling
these things:

  my regex qu_th_ea { ^ [ .* qu .* & .* th .* & .* ea .* ] $ };

That's certainly much better, though unfortunately each element
of the conjunction needs to match a substring of the same length,
so pretty frequently you're stuck with the visual noise of
bracketing subpatterns with pairs of .*

Where things get interesting is when you want a negated match of
one of the subpatterns.  One of the things I like about the first
approach using multiple chained greps is that it's easy to do a
reverse match.  What if you want words with "qu" and "th" but
want to *skip* ones with an "ea"?

  my @hits = $DICT.IO.open( :r ).lines.grep({/qu/}).grep({/th/}).grep({!/ea/});
  # [Asquith discotheque discotheque's discotheques quoth]

To do that in one regex, it would be nice if there were some sort
of adverb to do a reverse match, like say :not, then it
would be straight-forward (NOTE: NON-WORKING CODE):

  my regex qu_th_ea { ^ [ .* qu .* & .* th .* &  [ :not .* ea .* ] ] $ };

But since there isn't an adverb like this, what else might we do?
The best idea I can come up with is this:

  my regex qu_th_ea { ^ [ .* qu .* & .* th .* &  [ <!after ea> . ]*  ] $ };

Where the third element of the conjunction should match only if
none of the characters follow "ea".  There's an oddity here
though in that I think this can get confused by things like an
"ea" that *precedes* the conjunction.

So, the question then is: is there a neater way to embed a
subpattern in a regex that does a negated match?

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About