develooper Front page | perl.perl6.language.regex | Postings from December 2000

Re: Perl 5's "non-greedy" matching can be TOO greedy!

From:
Uri Guttman
Date:
December 15, 2000 11:40
Subject:
Re: Perl 5's "non-greedy" matching can be TOO greedy!
Message ID:
200012151940.OAA08604@home.sysarch.com.
>>>>> "DTC" == Deven T Corzine <deven@ties.org> writes:


  DTC> The pattern in question is "b.*?d".  Obviously, this matches "b",
  DTC> followed by something else, followed by "d".  What "something
  DTC> else" should be is the issue at hand.  That portion of the regexp
  DTC> is just ".*?" -- the "." matches any character (except newlines,
  DTC> depending on the mode), the "*" modifies the "." to match "zero
  DTC> or more" of "any character", and the "?" modifies the ".*" to
  DTC> match "zero or more" of "any character", but "matching the
  DTC> minimum number of times possible".  Hence, the ".*?" can be
  DTC> summarized as "match anything, but keep the match as short as
  DTC> possible".

  DTC> TAKEN IN ISOLATION, the most natural interpretation for the
  DTC> programmer to make, at this high semantic level, is that it will
  DTC> match "bccccd" rather than "bbbbccccd", because the ".*?" is
  DTC> expected to match as little as possible (though as much as
  DTC> necessary), and "cccc" is as little as it can possibly match
  DTC> while allowing the entire regexp to match.  "bbbbccccd" is a
  DTC> longer match, and counterintuitive compared to the semantic
  DTC> description of what "b.*?d" should match.

just add the semantic part of leftmost or first and you have your
answer. it is defined that way in the current semantics as well as in
the implementation. regardless, having the semantics of your way would
entail finding ALL possible matches and then scanning them for the
shortest one. there is no other way to define that behavior than
that. how could you find the true shortest match without finding all of
the matches first? the leftmost/first semantic eliminates that problem.

so please add those words and then restate your argument. you can't. the
result you want does not jibe with first or leftmost. it is a different
semantic. and the current one is used for many reasons including
history, simplicity of semantics and even (drat!) implementation and
speed.

as for perl6 doing what you want, write a regex engine and plug it
in. but the default perl6 regexes should behave the same as perl5's for
all the right reasons including backwards compatibility.

uri

-- 
Uri Guttman  ---------  uri@sysarch.com  ----------  http://www.sysarch.com
SYStems ARCHitecture, Software Engineering, Perl, Internet, UNIX Consulting
The Perl Books Page  -----------  http://www.sysarch.com/cgi-bin/perl_books
The Best Search Engine on the Net  ----------  http://www.northernlight.com



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About