perl.perl6.language.regex http://www.nntp.perl.org/group/perl.perl6.language.regex/ ... Copyright 1998-2008 perl.org Fri, 09 May 2008 12:01:43 +0000 ask@perl.org Help required urgently !!!!!! by Manoj.Menon &quot;use matchpairs&quot; not working. Currently having perl 5.0 in my system. <br/>Please suggest.<br/>Manoj<br/><br/> http://www.nntp.perl.org/group/perl.perl6.language.regex/2005/04/msg594.html Thu, 21 Apr 2005 09:54:29 +0000 Re: Exposing regexp engine & compiled regexp's by Branden <br/>Quoted from http://www.perl.com/pub/2000/09/ilya.html,<br/>an interview with Dr. Ilya Zakharevich:<br/>&gt;<br/>&gt; Q: Could you describe in more detail what additional text-<br/>&gt; handling primitives you would like to see included with Perl?<br/>&gt; What string munging operations are absent that really ought to<br/>&gt; be included in Perl&#39;s core?<br/>&gt;<br/>&gt; A: The problem: Perl&#39;s text-handling abilities do not scale<br/>&gt; well. This has two faces, both invisible as far as you confine<br/>&gt; yourselves to simple tasks only. The first face is not that<br/>&gt; Perl lacks some &quot;operations;&quot; it is not that some &quot;words&quot; are<br/>&gt; missing, whole &quot;word classes&quot; are not present. Imagine<br/>&gt; expressive power of a language without adjectives.<br/>&gt;<br/>&gt; In Perl text-handling equals string-handling. But there is<br/>&gt; more in a text than the sequence of characters. You see a text<br/>&gt; of a program - you can see boundaries of blocks, etc.; you see<br/>&gt; an English text, you can see word boundaries and sentence<br/>&gt; boundaries, etc. With the exception of the word boundaries,<br/>&gt; all these &quot;distinctive features&quot; become very hard to recognize<br/>&gt; by a &quot;local inspection of a sequence of characters near an<br/>&gt; offset&quot; - unless you agree to use a heuristic which works only<br/>&gt; time to time. But a lot of problems require recognition of the<br/>&gt; relative position of a substring w.r.t. these &quot;distinctive<br/>&gt; features&quot;.<br/>&gt;<br/>&gt; Remember those &quot;abstract algorithms&quot; books and lessons? You<br/>&gt; can solve the problems &quot;straightforwardly,&quot; or you can do it &quot;<br/>&gt; smartly.&quot; Typically, &quot;straightforward&quot; algorithms are easy to<br/>&gt; code, but they do not scale well. Smart algorithms start by an<br/>&gt; appropriate preprocessing step. You organize your data first.<br/>&gt; The particular ways to do this may be quite different: you<br/>&gt; sort the data, or keep an &quot;index&quot; of some kind &quot;into your data,&quot;<br/>&gt; you hash things appropriately, your balance some trees, and<br/>&gt; so on. The algorithms use the initial data together with such<br/>&gt; an &quot;index.&quot;<br/>&gt;<br/>&gt; Perl provides a few primitives to work with strings, which are<br/>&gt; quite enough to code any &quot;straightforward&quot; algorithm. What<br/>&gt; about &quot;smart&quot; ones? You need preprocessing. Typically, digging<br/>&gt; out the info is easy with Perl, but how would you store what<br/>&gt; you dug? The information should be kept &quot;off band,&quot; for example,<br/>&gt; in an array or hash of offsets into the string.<br/>&gt;<br/>&gt; Now modify the string a little bit, say, perform some s()()<br/>&gt; substitutions, or cut-and-paste with substr(). What happens<br/>&gt; with your &quot;off band&quot; information? It went out of sync. You<br/>&gt; need to update your annotating structures. Do not even think<br/>&gt; about doing s()()g, since you do not have enough info about<br/>&gt; the changes after the fact. You need to do your s()() one-by-<br/>&gt; one - but while s()()g is quite optimized, a series of s()()<br/>&gt; is not - and you get stuck again into the land of badly<br/>&gt; scaling algorithms.<br/>&gt;<br/>&gt; (Strictly speaking, for this particular example s()()eg could<br/>&gt; save you - as well as code-embedded-into-a-regular-expression,<br/>&gt; but this was only a simple illustration of why off-band data<br/>&gt; is not appropriate for many algorithms. Please be lenient with<br/>&gt; this example!)<br/>&gt;<br/>&gt; Even if no modification is done, using off-band data is very<br/>&gt; awkward: how to check what are the attributes of the character<br/>&gt; at offset 2001 when there are many different attributes, each<br/>&gt; marking a large subset of the string?<br/>&gt;<br/>&gt; That was the problem, and the solution supported by many text-<br/>&gt; processing systems is to have &quot;in-band annotations&quot;, which is<br/>&gt; recognized by the editing primitives, and easily queryable.<br/>&gt; Perl allows exactly one item of in-band data for strings: pos<br/>&gt; (), which is respected by regular expressions. But it is not<br/>&gt; preserved by string-editing operations, or even by $s1 = $s2!<br/>&gt;<br/>&gt; &quot;In-band&quot; data comes in several &quot;kinds&quot;. A particular &quot;kind&quot;<br/>&gt; describes:<br/>&gt;<br/>&gt; - how it behaves with respect to insertion or deletion of<br/>&gt; characters nearby;<br/>&gt; - can the &quot;same&quot; markup appear &quot;several times&quot;;<br/>&gt; - can the markup &quot;nest&quot; (like nested comments in some languages<br/>&gt; ); and<br/>&gt; - is there an internal structure of the markup (as in a loop,<br/>&gt; which may be<br/>&gt;<br/>&gt; [[LABEL DELIM0] KEYWORD [DELIM1 VAR1 SEP VAR2 ... DELIM2]<br/>&gt; [DELIM4 EXPR DELIM4] [DELIM5 BODY DELIM6]]<br/>&gt;<br/>&gt; - with some parts possibly missing, so the internal structure<br/>&gt; is a tree).<br/>&gt;<br/>&gt; Different answers lead to a zoo of intuitively different kinds<br/>&gt; of markup, each kind useful for some categories of problems.<br/>&gt; You can mark &quot;gaps between&quot; characters, or you can mark<br/>&gt; characters themselves. The markup may &quot;name&quot; a position (&quot;the<br/>&gt; first __END__ in a Perl program&quot;), or cover a subset of the<br/>&gt; string (&quot;show in red&quot;, &quot;is a link to this URL&quot;, or &quot;inside<br/>&gt; comment&quot;). Since the kind of the markup defines what happens<br/>&gt; when the string is modified, the system can support self-<br/>&gt; consistency of the markup &quot;automatically&quot; (in exceptionally<br/>&gt; complicated cases one may need to register a callback or two).<br/>&gt;<br/>&gt; The second face of problem is not with the expressive power of<br/>&gt; Perl, but with the implementation. Perl has a very rigid rule:<br/>&gt; a string must be stored in a consecutive sequence of bytes.<br/>&gt; Remove a character in the middle of the string, and all the<br/>&gt; chars after it (or before it) should be moved. As I said, s()()g<br/>&gt; has some optimizations which allow doing such movements &quot;in<br/>&gt; one pass&quot;, but what if your problem cannot be reduced to one<br/>&gt; pass of s()()g? Then each of the tiny modification you do one-<br/>&gt; at-a-time may require a huge relocation - or maybe even<br/>&gt; copying of the whole string. This is why a lot of algorithms<br/>&gt; for text manipulation require a &quot;split buffer&quot; implementation,<br/>&gt; when several chunks of the string may be stored (transparently!)<br/>&gt; at unrelated addresses.<br/>&gt;<br/>&gt; Such &quot;split-buffer&quot; strings may look incredibly hard to<br/>&gt; implement, as in &quot;all the innards of Perl should be changed&quot;,<br/>&gt; but it is not. Just store &quot;split strings&quot; similarly to tie()d<br/>&gt; data. The FETCH (actually, the low-level MAGIC-read method)<br/>&gt; would &quot;glue&quot; all the chunks into one - and would remove the<br/>&gt; MAGIC - before the actual read is performed; and now no part<br/>&gt; of Perl requires any change. Now four or five primitives for<br/>&gt; text-handling may be changed to recognize the associated tie()d<br/>&gt; structures - and act without gluing chunks together. We may<br/>&gt; even do it in arbitrarily small steps, one opcode at a time.<br/>&gt;<br/>&gt; Another important performance improvement needed for many<br/>&gt; algorithms would be the copy-on-write, when several variables<br/>&gt; may refer to the same buffer in memory, or different parts of<br/>&gt; the same buffer - with suitable semantic what to do when one<br/>&gt; of these variables is modified. (In fact the core of this is<br/>&gt; already implemented in one of my patches!) Together with other<br/>&gt; benefits, this would solve the performance problems of $&amp; and<br/>&gt; friends, as well as would make m/foo/; $&amp; = &#39;bar&#39;; equivalent<br/>&gt; to s/foo/bar/. Having copy-on-write substrings may be slightly<br/>&gt; more patch-intensive than copy-on-write strings, though. The<br/>&gt; complication: currently the buffers are required to be 0-<br/>&gt; terminated (so that they may be used with the system APIs). It<br/>&gt; is hard to make &#39;b&#39; as in substr(&#39;abc&#39;,1,1) refer to the same<br/>&gt; buffer (containing &quot;abc\0&quot;) as &#39;abc&#39;. The solution may be to<br/>&gt; remove this requirement, and have two low-level string access<br/>&gt; API, SvPV() and SvPVz(), so that SvPVz() may perform the<br/>&gt; actual copying (as in copy-on-write) and the appending of \0 -<br/>&gt; but only when needed!<br/>&gt;<br/>&gt; Without these - or similar - changes Perl would not scale well<br/>&gt; as a language for efficient text-processing. What is more, I<br/>&gt; believe that the changes above can remove most of the<br/>&gt; significant bottlenecks for the problems we have in text-<br/>&gt; processing of today. At least I know a lot of problems which<br/>&gt; would have feasible solutions given these changes.<br/>&gt;<br/>&gt; And I need not repeat that a handful of small extensions to<br/>&gt; the expressive power of the regular expression engine could<br/>&gt; radically extend the domain of its applicability. ;-)<br/>&gt;<br/><br/><br/><br/>That&#39;s exactly the kind of thing we can do by exposing regexp engine&#39;s guts.<br/>Tags in the string we can implement by using lists, just as lisp would do<br/>it. If we have SVs that actually independ from the implementation, we can<br/>create a ``tagged-string&#39;&#39;, that&#39;s seen as a string by the script but<br/>internally implemented as a list. And, if we have access to the regexp<br/>engine&#39;s guts, we can implement matches and substitutions against those<br/>magic tagged strings.<br/><br/>The other thing about copy-on-write would be piece of cake, using the same<br/>magic SVs for implementing strings that are substrings of other strings.<br/><br/>Branden.<br/><br/><br/><br/> http://www.nntp.perl.org/group/perl.perl6.language.regex/2001/01/msg593.html Wed, 10 Jan 2001 12:53:28 +0000 Re: Exposing regexp engine & compiled regexp's by Damian Conway &gt; As Rick pointed out, there&#39;s no problem with overloading =~ for an<br/> &gt; object, in the same way it&#39;s done with `eq&#39;, and one object&#39;s<br/> &gt; function could return either an object or a closure (a sub<br/> &gt; reference), so that a module could even hide the details of whether<br/> &gt; it&#39;s using the object interface with the overloaded =~ or the new<br/> &gt; behaviour of =~ with a sub lvalue.<br/><br/>Good point. I think at this stage we&#39;re violently agreeing with each other. ;-)<br/><br/>Damian<br/> http://www.nntp.perl.org/group/perl.perl6.language.regex/2001/01/msg592.html Tue, 09 Jan 2001 19:03:26 +0000 Re: Exposing regexp engine & compiled regexp's by Jarkko Hietaniemi On Tue, Jan 09, 2001 at 12:41:30AM -0500, James Mastros wrote:<br/>&gt; On Mon, Jan 08, 2001 at 05:02:17PM -0600, Jarkko Hietaniemi wrote:<br/>&gt; &gt; Wouldn&#39;t an incremental on-demand engine be much<br/>&gt; &gt; more flexible and optimizable (e.g. finding &#39;the fast path&#39; smells<br/>&gt; &gt; like input-driven LRU to me)?<br/>&gt;<br/>&gt; Umm, I&#39;m not certian that I&#39;m completly following here. It seems that in<br/>&gt; the vast majority of all cases, you&#39;d need to compile (or at the very least,<br/>&gt; parse) the entire regex. Also, you can get /vast/ efficency gains by<br/>&gt; compiling a regex, so you can check the easy things first.<br/><br/>What I have in mind is not *that* different from the current way of<br/>things, the regex will get compiled, but not before it&#39;s first used,<br/>and not necessarily all at once. No, I don&#39;t have any formal theory,<br/>any actual code, and much less benchmarks to prove this, I&#39;m just<br/>waving my hands to keep warm.<br/><br/>-- <br/>$jhi++; # http://www.iki.fi/jhi/<br/> # There is this special biologist word we use for &#39;stable&#39;.<br/> # It is &#39;dead&#39;. -- Jack Cohen<br/> http://www.nntp.perl.org/group/perl.perl6.language.regex/2001/01/msg591.html Tue, 09 Jan 2001 06:21:53 +0000 Re: Exposing regexp engine & compiled regexp's by Filipe Brandenburger Damian Conway wrote: <br/>&gt;I&#39;m well-known as a non-delving-into-the-guts type of guy. I don&#39;t have <br/> <br/> <br/>I totally aggree with you that delving into the guts is the last thing we, <br/>the people that use perl as a tool, want to do! The fact is that, the least <br/>we know about the internals, the better it is. But for this to be possible, <br/>we need modules that provide the functionality we need without making us <br/>deal with the rawness of perl guts. And that&#39;s why I defend the exposing of <br/>the regexp subsystem interface (as all other subsystems), so that one can <br/>write (and we can use!!!) modules that can match regexps against file <br/>streams, we can have partial matches, approximate, etc... Having the guts <br/>interface opens up the door for everything that is doable with a regexp <br/>engine. <br/> <br/>As to the efficiency problem, I said it before and I&#39;m saying it again: my <br/>opinion is to put both approaches (overloaded =~ in object module and having <br/>=~ work for a sub). One of them is very flexible, and possibly can do things <br/>the other can not, and the other is very efficient for the general case, the <br/>one that will be used 90% of the time. <br/> <br/>As Rick pointed out, there&#39;s no problem with overloading =~ for an object, <br/>in the same way it&#39;s done with `eq&#39;, and one object&#39;s function could return <br/>either an object or a closure (a sub reference), so that a module could even <br/>hide the details of whether it&#39;s using the object interface with the <br/>overloaded =~ or the new behaviour of =~ with a sub lvalue. <br/> <br/> <br/>Branden. <br/><br/>_________________________________________________________<br/>Oi! Voc&ecirc; quer um iG-mail gratuito?<br/>Ent&atilde;o clique aqui: http://www.ig.com.br/paginas/assineigmail.html<br/><br/> http://www.nntp.perl.org/group/perl.perl6.language.regex/2001/01/msg590.html Tue, 09 Jan 2001 06:11:19 +0000 Re: Exposing regexp engine & compiled regexp's by James Mastros On Mon, Jan 08, 2001 at 05:02:17PM -0600, Jarkko Hietaniemi wrote:<br/>&gt; Wouldn&#39;t an incremental on-demand engine be much<br/>&gt; more flexible and optimizable (e.g. finding &#39;the fast path&#39; smells<br/>&gt; like input-driven LRU to me)?<br/>Umm, I&#39;m not certian that I&#39;m completly following here. It seems that in<br/>the vast majority of all cases, you&#39;d need to compile (or at the very least,<br/>parse) the entire regex. Also, you can get /vast/ efficency gains by<br/>compiling a regex, so you can check the easy things first.<br/><br/> -=- James Mastros<br/>-- <br/>midendian: She never sleeps.<br/>mousetrout: But I do. I just regret it after I wake up.<br/>AIM: theorbtwo homepage: http://www.rtweb.net/theorb/<br/>ICBM: 40:04:15.100 N, 76:18:53.165 W<br/> http://www.nntp.perl.org/group/perl.perl6.language.regex/2001/01/msg589.html Mon, 08 Jan 2001 21:42:08 +0000 Re: Exposing regexp engine & compiled regexp's by Damian Conway &gt; The only thing I remark is that I believe all of Perl should be the<br/> &gt; most exposed possible, so that unseen levels of introspection<br/> &gt; can be achieved. In that philosophy I wrote my idea about<br/> &gt; exposing the engine&#39;s guts.<br/><br/>I&#39;m well-known as a non-delving-into-the-guts type of guy. I don&#39;t have<br/>a problem with your proposal, since (as Rick points out) it can easily<br/>be viewed as a superset of mine, and made no harder to implement (assuming<br/>we eventually see method calls draw close to subroutine calls in<br/>execution speed).<br/><br/><br/> &gt; BTW, I didn&#39;t see any comments about my second thought, the one<br/> &gt; of inspecting compiled regexps. Did you like it? That also goes in<br/> &gt; the direction of exposing all the internals to the module writers...<br/><br/>I didn&#39;t feel qualified or entitled to comment on it since I *never*<br/>operate at the level of the perl guts; not even in modules -- like Switch --<br/>that may appear to do so.<br/><br/>Damian<br/> http://www.nntp.perl.org/group/perl.perl6.language.regex/2001/01/msg588.html Mon, 08 Jan 2001 19:24:46 +0000 Re: Exposing regexp engine & compiled regexp's by branden Damian Conway wrote:<br/><br/>&gt; # As Branden proposes:<br/>&gt;<br/>&gt; package From_STDIN;<br/>&gt;<br/>&gt; sub new { bless $_[1], $_[0] }<br/>&gt;<br/>&gt; sub MORE_DATA { $_[0]-&gt;getn($_[1]) }<br/>&gt; sub ON_FAIL { $_[0]-&gt;pushback($_[1]) }<br/>&gt;<br/>&gt; use overload &quot;=~&quot; =&gt; 1;<br/>&gt;<br/>&gt; package main;<br/>&gt;<br/>&gt; From_STDIN-&gt;new($fh) =~ /pat/;<br/>&gt;<br/>&gt;<br/>&gt;Hmmmm. Potentially more flexible, but also much more ponderous.<br/><br/><br/><br/><br/>Sorry I didn&#39;t include code the first time, but actually my idea is about<br/>much more flexibility than having MORE_DATA and ON_FAIL methods<br/>in an object with overloaded ``=~&#39;&#39; Actually, I think the whole interface of<br/>the regex engine should be exposed to Perl, so that someone could<br/>write an OO package with ``virtual methods&#39;&#39; MORE_DATA and ON_FAIL<br/>and manage the guts of the engine so that it behaves like expected.<br/><br/>Something like:<br/><br/> package RegexBase;<br/><br/> use overload &#39;=~&#39; =&gt; \&amp;match;<br/><br/> sub match { # here is the brains of the class<br/> # something involving:<br/> # - the guts of the interface of the Regex Engine<br/> # - the MORE_DATA method when data is needed<br/> # - the ON_FAIL method when a match is failed<br/> }<br/><br/><br/> package From_STDIN;<br/> @ISA = qw(RegexBase);<br/> sub MORE_DATA { ... }<br/> sub ON_FAIL { ... }<br/><br/><br/><br/><br/>What I&#39;m trying to say is that this is the most flexible we can do.<br/>If one wants to check success or failure, he can, if he wants to<br/>see if there was a state of Failed/Short/Exact/Long/LongFailed<br/>(words from your message), he can too. If he wants to inspect<br/>in which state the NFA stopped, he also can. Virtually anything<br/>that involves Regexp&#39;s can be built from there up. I think your `sub&#39;<br/>idea, althought a bit confuse and having more than one significate<br/>to the same sub, is the most common case, and I think it should<br/>be implemented yes.<br/><br/>The only thing I remark is that I believe all of Perl should be the<br/>most exposed possible, so that unseen levels of introspection<br/>can be achieved. In that philosophy I wrote my idea about<br/>exposing the engine&#39;s guts.<br/><br/>I know it&#39;s heavy to do things like I say. 1st: that&#39;s the price of the<br/>flexibility it gives (althought the two approaches can be safely<br/>implemented, they are complementary, not conflictant) and 2nd:<br/>those are thought to be used by module writers, for problems<br/>that can&#39;t be solved now in a good way, and where all the complexity<br/>it introduces would still be a big win, compared with the way it could<br/>be handled in perl5 (namely, reading the whole file in memory or<br/>breaking a regexp that would match more than one line, without<br/>mentioning the spaghetti flow control turns in!!!)<br/><br/><br/>BTW, I didn&#39;t see any comments about my second thought, the one<br/>of inspecting compiled regexps. Did you like it? That also goes in<br/>the direction of exposing all the internals to the module writers...<br/><br/>Branden.<br/><br/><br/><br/><br/><br/>----------------------------------------------------------------<br/>Get your free email from AltaVista at http://altavista.iname.com<br/> http://www.nntp.perl.org/group/perl.perl6.language.regex/2001/01/msg587.html Mon, 08 Jan 2001 19:16:32 +0000 Re: Exposing regexp engine & compiled regexp's by Rick Delaney <br/>Damian Conway wrote:<br/>&gt; <br/>&gt; Branden wrote:<br/>&gt; <br/>&gt; &gt; Then, what you proposed in RFC 93 through<br/>&gt; &gt;<br/>&gt; &gt; sub { ... } =~ m/.../;<br/>&gt; &gt;<br/>&gt; &gt; could be handled by<br/>&gt; &gt;<br/>&gt; &gt; my $mymatch = MyClassForMatchingFromFileHandles-&gt;new($myhandle);<br/>&gt; &gt; $mymatch =~ m/.../;<br/>&gt; <br/>&gt; This is an interesting alternative. The main problem is that matching<br/>&gt; against a blessed object already has a useful meaning in Perl: stringify<br/>&gt; the object (calling its overloaded stringification operator if possible)<br/>&gt; and match against the resulting string.<br/><br/>That may not be a problem. In the absence of overloading of =~, the<br/>object can be stringified (like now) but if =~ is overloaded then that<br/>could take precedence. Or, just make =~ behave the same as eq and use<br/>fallback to get the current stringifying behaviour. <br/><br/>The biggest problem is making =~ overloadable in the first place. This<br/>has a lot of potential but needn&#39;t replace your RFC:<br/><br/> use overload &quot;=~&quot; =&gt; sub {<br/> my ($obj, $pat) = @_;<br/> my $coderef = $obj-&gt;{attr};<br/> $coderef =~ /$pat/;<br/> };<br/><br/><br/>&gt; My other problem with this approach is that it&#39;s relatively heavy. Let&#39;s<br/>&gt; take the example in the RFC and implement it both ways:<br/>&gt; <br/>&gt; # As the RFC proposes:<br/>&gt; <br/>&gt; sub from_STDIN {<br/>&gt; $_[1] ? $fh-&gt;pushback($_[0]) : $fh-&gt;getn($_[0])<br/>&gt; }<br/>&gt; <br/>&gt; \&amp;from_STDIN =~ /pat/;<br/>&gt; <br/>&gt; # As Branden proposes:<br/>&gt; <br/>&gt; package From_STDIN;<br/>&gt; <br/>&gt; sub new { bless $_[1], $_[0] }<br/>&gt; <br/>&gt; sub MORE_DATA { $_[0]-&gt;getn($_[1]) }<br/>&gt; sub ON_FAIL { $_[0]-&gt;pushback($_[1]) }<br/>&gt; <br/>&gt; use overload &quot;=~&quot; =&gt; 1;<br/>&gt; <br/>&gt; package main;<br/>&gt; <br/>&gt; From_STDIN-&gt;new($fh) =~ /pat/;<br/>&gt; <br/>&gt; Hmmmm. Potentially more flexible, but also much more ponderous.<br/><br/>I wouldn&#39;t say *much* more. There is slightly more to do for a module<br/>author (in object setup) but for the module user it is the difference<br/>between:<br/><br/> my $scalar = From_STDIN-&gt;new($fh);<br/> $scalar =~ /pat/;<br/>and<br/> my $scalar = From_STDIN_closure($fh);<br/> $scalar =~ /pat/;<br/><br/>The overload technique can be made considerably less ponderous by<br/>creating one class that behaves as your proposal. Then those that like<br/>it could just<br/><br/> use RFC93 sub {<br/> $_[1] ? $fh-&gt;pushback($_[0]) : $fh-&gt;getn($_[0])<br/> };<br/> my $scalar = RFC93-&gt;new;<br/> $scalar =~ /pat/;<br/><br/>-- <br/>Rick Delaney<br/>rick.delaney@home.com<br/> http://www.nntp.perl.org/group/perl.perl6.language.regex/2001/01/msg586.html Mon, 08 Jan 2001 19:11:05 +0000 Re: Exposing regexp engine & compiled regexp's by Damian Conway Branden wrote:<br/><br/> &gt; I read your RFC 93. It mentions using a sub to read from the<br/> &gt; string. I just think it uses the sub in two conflicting ways, one<br/> &gt; for requesting more data from the stream and other for telling<br/> &gt; there was a match.<br/><br/>It&#39;s really using the sub as a interface to whatever source of data it&#39;s<br/>trying to match.<br/> <br/> &gt; I thought, too, that requesting it to return<br/> &gt; _exactly_ the number of characters that was requested goes against<br/> &gt; most unix syscalls convention (like read...), where it&#39;s requested<br/> &gt; to read at most that number of characters.<br/><br/>Err. The RFC never says the request is for exactly a certain number of<br/>characters; just that the subroutine will be told how many characters<br/>are *known* to be needed in order for the regex to continue matching.<br/>The RFC specifically mentions the possibility of returning fewer than<br/>requested characters.<br/><br/> &gt; What I think is that it could be handled by a OO module. Suppose<br/> &gt; there&#39;s how to hook into the regexp engine guts, getting responses<br/> &gt; as the ones you mentioned above. One could write a OO module, with<br/> &gt; methods for reading more data, checking end of data, and<br/> &gt; acknowledging a failed or succeeded match. Then, it could overload<br/> &gt; the =~ operator, making the regexp engine call the module&#39;s methods<br/> &gt; instead of its own&#39;s.<br/> &gt; <br/> &gt; Then, what you proposed in RFC 93 through <br/> &gt; <br/> &gt; sub { ... } =~ m/.../; <br/> &gt; <br/> &gt; could be handled by <br/> &gt; <br/> &gt; my $mymatch = MyClassForMatchingFromFileHandles-&gt;new($myhandle); <br/> &gt; $mymatch =~ m/.../; <br/><br/>This is an interesting alternative. The main problem is that matching<br/>against a blessed object already has a useful meaning in Perl: stringify<br/>the object (calling its overloaded stringification operator if possible)<br/>and match against the resulting string.<br/><br/>My other problem with this approach is that it&#39;s relatively heavy. Let&#39;s<br/>take the example in the RFC and implement it both ways:<br/><br/> # As the RFC proposes:<br/><br/> sub from_STDIN {<br/> $_[1] ? $fh-&gt;pushback($_[0]) : $fh-&gt;getn($_[0])<br/> }<br/><br/> \&amp;from_STDIN =~ /pat/;<br/><br/><br/> # As Branden proposes:<br/><br/> package From_STDIN;<br/><br/> sub new { bless $_[1], $_[0] }<br/><br/> sub MORE_DATA { $_[0]-&gt;getn($_[1]) }<br/> sub ON_FAIL { $_[0]-&gt;pushback($_[1]) }<br/><br/> use overload &quot;=~&quot; =&gt; 1;<br/><br/> package main;<br/><br/> From_STDIN-&gt;new($fh) =~ /pat/;<br/><br/><br/>Hmmmm. Potentially more flexible, but also much more ponderous.<br/><br/><br/> &gt; BTW, if you have a C++-based regexp engine with a clean design,<br/> &gt; couldn&#39;t we use it as a base to a new regexp engine that supports<br/> &gt; current (or new) perl&#39;s regexp syntax and features and has its guts<br/> &gt; exposed?<br/><br/>It was very basic, and *very* slow. It was also DFA-based and hence<br/>unable to implement full Perl regex semantics.<br/><br/>Furthermore, the regex engine is (and should be) one of the most heavily<br/>optimized parts of Perl: probably not the place for clean, modular design. :-)<br/><br/>However, for what it&#39;s worth, I have no objection to making the code<br/>available for everyone&#39;s amusement. Bear in mind that this was written<br/>by a much early version of me (about 0.27), way back in the last<br/>millenium, before C++ was standardized and before there was an STL.<br/>Surprisingly, it still compiles and runs under g++ 2.8.1.<br/><br/>Grab it from: http://www.csse.monash.edu.au/~damian/Perl6/Regex/Regex.tar.gz<br/><br/>Damian<br/> http://www.nntp.perl.org/group/perl.perl6.language.regex/2001/01/msg585.html Mon, 08 Jan 2001 16:01:30 +0000 Re: Exposing regexp engine & compiled regexp's by Jarkko Hietaniemi On Tue, Jan 09, 2001 at 09:53:10AM +1100, Damian Conway wrote:<br/>&gt; <br/>&gt; &gt; I once brutalized Henry Spencer&#39;s engine into telling me when I was<br/>&gt; &gt; on my way to a match. This was for a UI: I wanted to be able to say<br/>&gt; &gt; that the input should only match this RE, and if they typed something<br/>&gt; &gt; that broke the match, I could beep and disallow the character.<br/>&gt; &gt; <br/>&gt; &gt; I was just a greenhorn then, so it coredumped. But it&#39;s another use<br/>&gt; &gt; for access to internal engine states and failure reasons.<br/>&gt; <br/>&gt; Good point. Perhaps, on failure, a regex could set $@ to indicate what the<br/>&gt; problem was:<br/>&gt; <br/>&gt; while (gui_getchar($nextchar)) {<br/>&gt; $input.$nextchar =~ m/$pattern/;<br/>&gt; if ($@ =~ /fail/) { beep }<br/>&gt; else { $input .= $nextchar }<br/>&gt; }<br/><br/>Another related idea that has been circling in my brain lately is that<br/>do we really need to have separate compilation and execution steps in<br/>our regex engine? Wouldn&#39;t an incremental on-demand engine be much<br/>more flexible and optimizable (e.g. finding &#39;the fast path&#39; smells<br/>like input-driven LRU to me)? (Also much worse hell to implement,<br/>but let&#39;s worry about that later :-)<br/><br/>-- <br/>$jhi++; # http://www.iki.fi/jhi/<br/> # There is this special biologist word we use for &#39;stable&#39;.<br/> # It is &#39;dead&#39;. -- Jack Cohen<br/> http://www.nntp.perl.org/group/perl.perl6.language.regex/2001/01/msg584.html Mon, 08 Jan 2001 15:02:40 +0000 Re: Exposing regexp engine & compiled regexp's by Damian Conway <br/> &gt; I once brutalized Henry Spencer&#39;s engine into telling me when I was<br/> &gt; on my way to a match. This was for a UI: I wanted to be able to say<br/> &gt; that the input should only match this RE, and if they typed something<br/> &gt; that broke the match, I could beep and disallow the character.<br/> &gt; <br/> &gt; I was just a greenhorn then, so it coredumped. But it&#39;s another use<br/> &gt; for access to internal engine states and failure reasons.<br/><br/>Good point. Perhaps, on failure, a regex could set $@ to indicate what the<br/>problem was:<br/><br/> while (gui_getchar($nextchar)) {<br/> $input.$nextchar =~ m/$pattern/;<br/> if ($@ =~ /fail/) { beep }<br/> else { $input .= $nextchar }<br/> }<br/><br/>Damian<br/> http://www.nntp.perl.org/group/perl.perl6.language.regex/2001/01/msg583.html Mon, 08 Jan 2001 14:53:15 +0000 Re: Exposing regexp engine & compiled regexp's by Filipe Brandenburger <br/>Damian wrote: <br/>&gt; <br/>&gt; I once wrote a C++-based regex engine (much simpler than Perl&#39;s!) <br/>&gt; just like this. <br/>&gt; <br/>&gt; Knowing why a regex failed *is* invaluable when matching regexes <br/>&gt; against file streams, but there are more possibilities than you <br/>&gt; mentioned: <br/>&gt; <br/>&gt; &quot;Failed&quot; Did not match because of illegal transition <br/>&gt; <br/>&gt; &quot;Short&quot; Did not match: did not reach acceptor state <br/>&gt; <br/>&gt; &quot;Exact&quot; Matched and finished in an acceptor state <br/>&gt; <br/>&gt; &quot;Long&quot; Passed through an acceptor state, continued <br/>to <br/>&gt; match, but did not finish in an acceptor state <br/>&gt; <br/>&gt; &quot;LongFailed&quot; Passed through an acceptor state, continued <br/>to <br/>&gt; match, but then found an illegal transition <br/>&gt; <br/>&gt; Ultimately, I decided that what was needed wasn&#39;t insight into the cause <br/>&gt; of failure, but rather the chance to provide more data to &quot;feed&quot; the <br/>&gt; engine so it doesn&#39;t have to fail &quot;Short&quot; or &quot;Long&quot;. That&#39;s why I <br/>&gt; proposed RFC 93 (http://dev.perl.org/rfc/93.html) instead of a mechanism <br/>&gt; such as you have suggested. <br/>&gt; <br/>&gt; Damian <br/>&gt; <br/><br/><br/>Good points, Damian. <br/><br/>I read your RFC 93. It mentions using a sub to read from the string. I just <br/>think it uses the sub in two conflicting ways, one for requesting more data <br/>from the stream and other for telling there was a match. I thought, too, <br/>that requesting it to return _exactly_ the number of characters that was <br/>requested goes against most unix syscalls convention (like read...), where <br/>it&#39;s requested to read at most that number of characters. <br/><br/>What I think is that it could be handled by a OO module. Suppose there&#39;s how <br/>to hook into the regexp engine guts, getting responses as the ones you <br/>mentioned above. One could write a OO module, with methods for reading more <br/>data, checking end of data, and acknowledging a failed or succeeded match. <br/>Then, it could overload the =~ operator, making the regexp engine call the <br/>module&#39;s methods instead of its own&#39;s. <br/><br/>Then, what you proposed in RFC 93 through <br/><br/> sub { ... } =~ m/.../; <br/><br/>could be handled by <br/><br/> my $mymatch = MyClassForMatchingFromFileHandles-&gt;new($myhandle); <br/> $mymatch =~ m/.../; <br/><br/>What I mean is, by exposing the guts of the regexp engine, we could <br/>implement all that&#39;s wanted in RFC 93, with a cleaner interface, and even do <br/>more, because we can hook up every call to the regexp engine! <br/><br/>BTW, if you have a C++-based regexp engine with a clean design, couldn&#39;t we <br/>use it as a base to a new regexp engine that supports current (or new) <br/>perl&#39;s regexp syntax and features and has its guts exposed? <br/><br/><br/>Branden. <br/><br/>_________________________________________________________<br/>Oi! Voc&ecirc; quer um iG-mail gratuito?<br/>Ent&atilde;o clique aqui: http://www.ig.com.br/paginas/assineigmail.html<br/><br/> http://www.nntp.perl.org/group/perl.perl6.language.regex/2001/01/msg582.html Mon, 08 Jan 2001 04:17:15 +0000 Re: Exposing regexp engine & compiled regexp's by Nathan Torkington Damian Conway writes:<br/>&gt; I once wrote a C++-based regex engine (much simpler than Perl&#39;s!)<br/>&gt; just like this.<br/><br/>I once brutalized Henry Spencer&#39;s engine into telling me when I was<br/>on my way to a match. This was for a UI: I wanted to be able to say<br/>that the input should only match this RE, and if they typed something<br/>that broke the match, I could beep and disallow the character.<br/><br/>I was just a greenhorn then, so it coredumped. But it&#39;s another use<br/>for access to internal engine states and failure reasons.<br/><br/>Nat<br/> http://www.nntp.perl.org/group/perl.perl6.language.regex/2001/01/msg581.html Sat, 06 Jan 2001 13:19:09 +0000 Re: Exposing regexp engine & compiled regexp's by Damian Conway <br/> &gt; 1. I think it should be possible to have ``incomplete matches&#39;&#39;.<br/> &gt; <br/> &gt; Regexp&#39;s are interpreted by a NFA, that is a state machine.<br/> &gt; I think it would be nice if, when I try to match a regexp against<br/> &gt; a string, and the string ends before the regexp matches, it<br/> &gt; would be possible to find out in which state the NFA has stopped,<br/> &gt; and it would also be possible to start another match from<br/> &gt; that state on.<br/><br/>I once wrote a C++-based regex engine (much simpler than Perl&#39;s!)<br/>just like this.<br/><br/>Knowing why a regex failed *is* invaluable when matching regexes<br/>against file streams, but there are more possibilities than you <br/>mentioned:<br/><br/> &quot;Failed&quot; Did not match because of illegal transition<br/><br/> &quot;Short&quot; Did not match: did not reach acceptor state<br/><br/> &quot;Exact&quot; Matched and finished in an acceptor state<br/><br/> &quot;Long&quot; Passed through an acceptor state, continued to<br/> match, but did not finish in an acceptor state<br/><br/> &quot;LongFailed&quot; Passed through an acceptor state, continued to<br/> match, but then found an illegal transition<br/><br/>Ultimately, I decided that what was needed wasn&#39;t insight into the cause<br/>of failure, but rather the chance to provide more data to &quot;feed&quot; the<br/>engine so it doesn&#39;t have to fail &quot;Short&quot; or &quot;Long&quot;. That&#39;s why I<br/>proposed RFC 93 (http://dev.perl.org/rfc/93.html) instead of a mechanism<br/>such as you have suggested.<br/><br/>Damian<br/> http://www.nntp.perl.org/group/perl.perl6.language.regex/2001/01/msg580.html Sat, 06 Jan 2001 13:08:53 +0000 Exposing regexp engine & compiled regexp's by Filipe Brandenburger <br/>Hello.<br/><br/>I have some ideas (actually a wishlist) for the regular expression<br/>subsystem (that&#39;s what it&#39;ll be, right?). I would appreciate if<br/>perl6&#39;s regexp engine exposes the maximum possible interfaces, and<br/>I&#39;ll expose why I think this would be nice.<br/><br/><br/>1. I think it should be possible to have ``incomplete matches&#39;&#39;.<br/><br/> Regexp&#39;s are interpreted by a NFA, that is a state machine.<br/> I think it would be nice if, when I try to match a regexp against<br/> a string, and the string ends before the regexp matches, it<br/> would be possible to find out in which state the NFA has stopped,<br/> and it would also be possible to start another match from<br/> that state on.<br/><br/><br/><br/>Suppose I&#39;m reading from a file on a per-line basis. Suppose that<br/>I want to strip off XML-style comments from the file. What I need<br/>is to match the /&lt;!--.*?--&gt;/ regular expression and substitute it<br/>for, say, the empty string. The problem is if the file contains<br/>multi-line comments, where the ``&lt;!--&#39;&#39; is in one line and the ``--&gt;&#39;&#39;<br/>is in another posterior line, this method will fail. If, when I try<br/>the match, I see that I have an incomplete match, I could take<br/>the state of the regexp engine, and start matching with it in the<br/>next line, and so on, until the engine ends the match.<br/><br/>Other possibility is to tokenize a file `a la lex&#39;. For example,<br/>if I want to parse a C source code file, I break it in identifiers,<br/>constants, reserved-words, comments, etc. If I have the whole file<br/>in a string, I can break it using the /\G.../gc regular expressions,<br/>but reading a whole file in memory can be a problem sometimes. If<br/>we have incomplete matches, we can read a block of data from the file,<br/>try to match it against the regexp&#39;s and, if necessary, read more<br/>blocks of data from the file until a match is done.<br/><br/>Now I had another idea with the exposing of states of the NFA. It<br/>would be possible to join several /\G.../gc regular expressions into<br/>one, using |, and, after the match, use the state the NFA stopped to<br/>determine which regexp was matched.<br/><br/>Other possibility is to implement the equivalent of awk&#39;s RS. awk has<br/>a RS variable similar to perl5&#39;s $/, but awk&#39;s RS can handle regular<br/>expressions instead of $/&#39;s strings. With NFA&#39;s states exposed it would<br/>be possible to implement the same behaviour of awk&#39;s RS.<br/><br/><br/>The possibilities are many. I think it wouldn&#39;t be difficult to expose<br/>NFA&#39;s states as I imagine it. Afterall, probably the C code inside of<br/>perl5 has them exposed. I have yet worked with XS and the fact that the<br/>regular expression engine isn&#39;t exposed in the XS interface let&#39;s me<br/>a little disappointed. There were times when I wanted to do things I<br/>say here and I realised that even with XS I would have to dig for perl&#39;s<br/>regexp engine or even use an external regexp engine to do the job.<br/><br/><br/><br/>2. I think compiled regexp&#39;s should be analysable by perl code.<br/><br/> Regexp&#39;s in their compiled states are (a little simplification here)<br/> a tree. I think it would be nice if there were ways of traversing<br/> this tree, to find properties of the regular expressions.<br/><br/><br/><br/>This would help several things. For example, consider a module that<br/>executes searches in a database table, matching people&#39;s names<br/>against regular expressions. Suppose now that the database column<br/>where the names are stored is indexed by prefix and suffix. If it<br/>were possible to traverse regular expressions, the module would<br/>be able to optimize searches like /^John/ and /Smith$/, using the<br/>indexes to reduce the number of records that the regexp should match.<br/>Of course it would have to search through all the records for a middle<br/>name, like /F\./, but the possibility to optimize some cases by<br/>inspecting the compiled regexp is a big win, at least for me.<br/><br/><br/>Traversing a compiled regexp also leads to implementing custom regexp<br/>engines, for example, a engine that calculates the difference or distance<br/>between the string and what would match the regexp. As with the first<br/>case, the possibilities are many.<br/><br/><br/><br/>I actually don&#39;t know perl5&#39;s regexp engine and compiler, but I believe<br/>it doesn&#39;t escape from my model (NFA with states &amp; a tree). I agree with<br/>the fact that code talks louder than speech, but I would like to know<br/>if there&#39;s something going in this regexp area and if there really is<br/>interest in this. Anyway, if there&#39;s something going here, I would like<br/>very much to join the team, if I can contribute a bit to it. If there&#39;s<br/>nothing going on and the list considers my idea good, I will try to<br/>`kind of&#39; implement it as a perl5 extension (oh, boy! that&#39;ll be hard!!!)<br/>If I do it, I hope I get help from you in doing it!<br/><br/>Thank&#39;s.<br/><br/>Branden<br/><br/><br/><br/><br/> http://www.nntp.perl.org/group/perl.perl6.language.regex/2001/01/msg579.html Fri, 05 Jan 2001 14:15:41 +0000 An Apology by Deven T. Corzine Everyone:<br/><br/>I&#39;m sorry about the recent commotion over minor aspects of regex design and<br/>implementation. I stumbled into being the most active participant in an<br/>argument nobody wanted, myself included. I realized too late (and with the<br/>help of a friend) that I had been arguing for a _declarative_ viewpoint,<br/>against a prevailing iterative viewpoint. It turns out to be much more<br/>difficult to convey the declarative viewpoint clearly and precisely.<br/><br/>I still consider the declarative viewpoint to be perfectly valid, but I now<br/>realize that it&#39;s merely a different (and minority) perspective. I made<br/>the initial mistake of assuming everyone saw the same thing I did, and the<br/>further mistake of believing that the declarative viewpoint was preferable.<br/>I realize now that which is preferable depends on your frame of reference,<br/>and it&#39;s obvious now that the iterative viewpoint is the prevailing one.<br/>(That may be the result of indoctrination as much as natural inclination,<br/>but the fact remains.) What seems intuitive from _each_ viewpoint also<br/>seems counterintuitive from the other one, a good formula for dissension!<br/><br/>While I never assumed that the behavior would necessarily be changed (since<br/>historical precedent and/or implementation issues might be compelling), now<br/>I believe that it should remain as it is, because it&#39;s harder to understand<br/>the way I suggested, and changing the behavior could prove disruptive to<br/>the programmer&#39;s confidence in his/her ability to predict the proper match.<br/>(Also, the C-style comment matching example convinced me that there _can_<br/>be true utility in the current behavior, which I hadn&#39;t considered...)<br/><br/>Many people were quick to jump to conclusions and misinterpret my questions<br/>about the design as ignorance. I guess I got a little defensive because of<br/>that, and responded to more messages than was wise. I was surprised by the<br/>reaction I received, since with Perl&#39;s motto (TMTOWTDI), I rather expected<br/>divergent viewpoints to be not only accepted, but welcomed!<br/><br/>Maybe that was a naive expectation. Or maybe I just did a miserable job<br/>trying to explain the idea I was trying to convey. I wish I had thought<br/>of the declarative/iterative dichotomy earlier; my friend was the one who<br/>managed to label them for me. That brought the controversy clearly into<br/>focus for me, and made me realize that the perspectives were different,<br/>with neither inherently better than the other. That&#39;s when I realized that<br/>I had a personal preference for the declarative viewpoint that I didn&#39;t<br/>notice as such, and decided it was a mistake to argue this to begin with.<br/><br/>I suppose I&#39;m now tarred as the villain here, since I started the thread<br/>and (because most replies were directed at me) ended up the most active<br/>participant in this argument that got blown entirely out of proportion.<br/>I hope people can look beyond that, and not prejudge me from one incident.<br/>Only time will tell...<br/><br/>Sorry for all the trouble. If anyone wants to discuss this issue further,<br/>I&#39;ll only do it in private at this point; I don&#39;t want to further annoy<br/>anyone on this list with the topic, even if anyone DOES want to continue!<br/><br/>Deven<br/><br/>P.S. For anyone curious about the &quot;sleep debt&quot; thing I mentioned before,<br/>I&#39;ll recommend reading &quot;Sleep Thieves&quot; by Stanley Coren. It&#39;s interesting<br/>and presents some good evidence to suggest that Americans are particularly<br/>sleep-deprived in general, that it&#39;s a dangerous and less-productive state,<br/>and that we probably ought to be getting 9-10 hours of sleep, not the 7-8<br/>hours that society deems &quot;normal&quot;... (I too often get 6 hours or less!)<br/><br/>P.P.S. With my luck, it will probably turn out to have been a mistake to<br/>send this message also. Since I can&#39;t win anyway, I might as well send it.<br/><br/> http://www.nntp.perl.org/group/perl.perl6.language.regex/2000/12/msg578.html Mon, 18 Dec 2000 08:59:30 +0000 Re: Perl 5's "non-greedy" matching can be TOO greedy! by Bart Lateur On Fri, 15 Dec 2000 13:42:44 -0700, Kevin Walker wrote:<br/><br/>&gt;Deven seems to be advocating thinking about regular expressions <br/>&gt;without worrying too much about the implementation, even at a fairly <br/>&gt;abstract level.<br/><br/>Here&#39;s a counter example:<br/><br/> /aaaabbbbccccdddddbbbbcdddd/<br/><br/>Shouldn&#39;t a non-greedy matcher /b.*?d/, according to the OP&#39;s rules,<br/>match &quot;bcd&quot;, the second matching string? That is shorter than the first<br/>match. Oh, you want the first match. Well: same thing.<br/><br/>It is similar in nature to that question in comp.lang.perl.misc the<br/>other day: why doesn&#39;t greediness force /a|ab/ to preferably match &quot;ab&quot;?<br/>Because it doesn&#39;t. Try from left to right, first match wins. A dead<br/>simple rule.<br/><br/>If Perl&#39;s greediness/nongreediness doesn&#39;t work for you, rely on<br/>something else, to get what you want.<br/><br/>-- <br/> Bart.<br/> http://www.nntp.perl.org/group/perl.perl6.language.regex/2000/12/msg577.html Sat, 16 Dec 2000 17:58:30 +0000 Re: Perl 5's "non-greedy" matching can be TOO greedy! by brian d foy On Fri, 15 Dec 2000, Simon Cozens wrote:<br/><br/>&gt; On Fri, Dec 15, 2000 at 11:39:08AM -0800, Randal L. Schwartz wrote:<br/>&gt; &gt; Tell me how you can do that without breaking much existing code.<br/>&gt; <br/>&gt; Pssst, Randal, this is Perl 6, not p5p.<br/><br/>well, we do have to translate 95% of that code to Perl 6 without a hitch.<br/>i imagine that breaking regex behaviour in fovor of something new would<br/>make that quite difficult. perhaps that is not what Randal was talking<br/>about, but it is something to consider.<br/><br/>--<br/>brian d foy &lt;brian@smithrenaud.com&gt;<br/>Director of Technology, Smith Renaud, Inc.<br/>875 Avenue of the Americas, 2510, New York, NY 10001<br/> V: (212) 239-8985<br/><br/><br/> http://www.nntp.perl.org/group/perl.perl6.language.regex/2000/12/msg576.html Sat, 16 Dec 2000 11:23:05 +0000 Re: Perl 5's "non-greedy" matching can be TOO greedy! by brian d foy On Fri, 15 Dec 2000, Deven T. Corzine wrote:<br/><br/>&gt; If we want the first interesting match, and we&#39;re preferring early matches<br/>&gt; and short matches, I believe that &quot;bccccd&quot; is more interesting.<br/><br/>then write a regex that describes that pattern. the pattern is<br/><br/> one b<br/> followed by<br/> some stuff that is not a d<br/> up to<br/> one d<br/><br/>you complain because the &quot;.&quot; regex special character does not do what you<br/>want. it matches any character except a newline. however, for the case<br/>you provide, you don&#39;t want any character except a newline. so, your use<br/>of &quot;.&quot; is your problem, along with your refusal to realize that what you<br/>are dealing with is a set of rules, and if you grok the rules, you can do<br/>what you want. it has nothing to do with opinion or intuition. neither<br/>of those have any place in a completely (well, almost ;) described,<br/>human-designed system. this isn&#39;t nuclear physics after all. <br/><br/>your trouble is not with greediness or shortest matches, as i said before.<br/>you just don&#39;t understand what you are doing and refuse to beleive<br/>otherwise. <br/><br/>--<br/>brian d foy &lt;brian@smithrenaud.com&gt;<br/>Director of Technology, Smith Renaud, Inc.<br/>875 Avenue of the Americas, 2510, New York, NY 10001<br/> V: (212) 239-8985<br/><br/><br/> http://www.nntp.perl.org/group/perl.perl6.language.regex/2000/12/msg575.html Sat, 16 Dec 2000 11:18:44 +0000 Re: Perl 5's "non-greedy" matching can be TOO greedy! by brian d foy On Fri, 15 Dec 2000, Deven T. Corzine wrote:<br/><br/>&gt; On 15 Dec 2000, Randal L. Schwartz wrote:<br/>&gt; <br/>&gt; &gt; &gt;&gt;&gt;&gt;&gt; &quot;Deven&quot; == Deven T Corzine &lt;deven@ties.org&gt; writes:<br/>&gt; &gt; <br/>&gt; &gt; Deven&gt; As for special-case rules, I believe that my proposed modification would<br/>&gt; &gt; Deven&gt; REMOVE a special-case semantic rule, at the cost of added complexity at the<br/>&gt; &gt; Deven&gt; implementation level. (The cost decision of whether that added complexity<br/>&gt; &gt; Deven&gt; is worthwhile is a separate consideration.)<br/><br/>&gt; &gt; No, it would break a much higher overriding rule of &quot;left most match<br/>&gt; &gt; wins&quot;. <br/>&gt; <br/>&gt; Can you give a concrete, real-life example of code that my proposed change<br/>&gt; would actually break, not a contrived hypothetical case design to break?<br/><br/>well, there is the trivial case that it breaks<br/><br/> m/b.*d/;<br/><br/>if i wrote that, i expect, since it is documented this way, to find the<br/>first b and everything up to and including the first d after that.<br/><br/><br/>but here&#39;s one from some code i wrote last week. i want to find all the<br/>groups of things like &quot;212-555-1212&quot;. this particular regex adequately<br/>describes the un-normalized data that i had to munge. <br/><br/> @matches = m/(\d.*?)\s+/g;<br/><br/>the right way (the documented way), gives me the entire phone number.<br/>yours give me &quot;2&quot;.<br/><br/>you should be quiet now, or write your own language.<br/><br/><br/>--<br/>brian d foy &lt;brian@smithrenaud.com&gt;<br/>Director of Technology, Smith Renaud, Inc.<br/>875 Avenue of the Americas, 2510, New York, NY 10001<br/> V: (212) 239-8985<br/><br/> http://www.nntp.perl.org/group/perl.perl6.language.regex/2000/12/msg574.html Sat, 16 Dec 2000 11:04:32 +0000 Re: Perl 5's "non-greedy" matching can be TOO greedy! by Tom Christiansen &gt;Nice summary, but I&#39;m not buying what you&#39;re selling in the elaboration.<br/><br/>Then you lose, because I am not allowed to disagree with you anymore.<br/>And everyone else has already written you off.<br/><br/>And the answer to &quot;what breaks if mimimal matching is overall but<br/>maximal matching is local&quot;--or even, &quot;if we change it all&quot;-- is<br/>a zillion programs, including just about any progressive match:<br/><br/> while (/.*?(\w+)=(\S+)/g) {<br/> push @{ $h{$1} }, $2;<br/> } <br/><br/>I can&#39;t wait for that to match the rightmost one and then fail. Bah.<br/><br/>&gt;&gt;/dev/null<br/><br/>--tom<br/> http://www.nntp.perl.org/group/perl.perl6.language.regex/2000/12/msg573.html Fri, 15 Dec 2000 15:27:27 +0000 Re: Perl 5's "non-greedy" matching can be TOO greedy! by Deven T. Corzine [I delayed responding to this message because it was the longest.]<br/><br/>On Thu, 14 Dec 2000, Tom Christiansen wrote:<br/><br/>&gt; &gt;No question that&#39;s how it&#39;s been implemented. But WHY would anyone want<br/>&gt; &gt;such behavior? When is it beneficial?<br/>&gt; <br/>&gt; It is beneficial because this is how it&#39;s always been, because it<br/>&gt; is faster, because it is more expressive, because it is more powerful,<br/>&gt; because it is more intuitive, and because it is more perlian.<br/><br/>Nice summary, but I&#39;m not buying what you&#39;re selling in the elaboration.<br/><br/>&gt; In elaboration:<br/>&gt; <br/>&gt; 0) All NFAs before POSIX acted this way. It is historically <br/>&gt; consistent and perfectly expected.<br/><br/>First, all those older regular expression systems were inherently greedy<br/>matching algorithms. There is absolutely no conflict between earliest<br/>match and greedy matching. The conflict only arises when non-greedy<br/>matching is introduced into the equation.<br/><br/>Second, by referring to NFA&#39;s, you&#39;re already dropping down to a lower<br/>level, and as I said in a prior message, you&#39;ve lost the gestalt by then.<br/>Once you drop to that level (and stop referencing the semantics of the<br/>whole), my point is moot, because it no longer makes sense at that level.<br/><br/>Finally, historical precedent is a always a justification not to change<br/>anything. Of course, that argued against the wonderful regex extensions<br/>that Perl 5 added, since they went against the way &quot;it&#39;s always been&quot;...<br/><br/>&gt; 1) It is obviously faster to come to an answer earlier on in the<br/>&gt; execution than it would be to come to an answer later. It&#39;s<br/>&gt; like an expression whose evaluation short-circuits. Also, when<br/>&gt; the matching sematics permit back tracking and back references,<br/>&gt; the combinatoric possibilities can easily explode into virtual<br/>&gt; unsolvability as the 2**N algorithm loses its race to the heat<br/>&gt; death of the universe. Yes, if Perl did overall-longest or<br/>&gt; overall-shorted, this would produce a more predictable time;<br/>&gt; however, as we see with DFAs and POSIX NFAs, this prediction<br/>&gt; plays out as guaranteed *WORST-CASE* time. It is not acceptable<br/>&gt; to make everyone pay the worst-case time. Never penalize<br/>&gt; the whole world for the needs or desires or the few.<br/><br/>I don&#39;t think it should be implemented unless the cost is small or applied<br/>only to those situations where it&#39;s wanted despite the cost.<br/><br/>However, I don&#39;t believe it would NECESSARILY be slower to execute -- yes,<br/>there&#39;s more complexity. The price for that complexity must be paid, but<br/>it may be payable in memory by making a larger (but equally fast) NFA for<br/>the regular expression. Or maybe it can&#39;t. I don&#39;t know, because I have<br/>not investigated it. It&#39;s premature to simply assume it MUST be slower.<br/><br/>&gt; 2) Consider the simple case, /A|B/. In your overall longest/shortest,<br/>&gt; guaranteed worst-case time, both submatch A and submatch B must<br/>&gt; be calculated, and then the lengths of their matches both be compared.<br/>&gt; Perl, fortunately, does not do that. Rather, the first one in that<br/>&gt; sequence wins. That means that under the current scheme, the <br/>&gt; patterns /A|B/ and /B|A/ have different semantics. Under your <br/>&gt; worst-case scheme, they do not. Because /A|B/ and /B|A/ mean<br/>&gt; something different, more expressivity is provided. This is the<br/>&gt; same scenario, albeit expressed slightly differently, as your<br/>&gt; situation. The issues manifest in both are equivalent.<br/><br/>Then that is another semantic anomaly, because the alternation is supposed<br/>to mean that one or the other pattern matches. Logically, &quot;or&quot; should be<br/>communitive. If they&#39;re not quite, that&#39;s another disconnect between the<br/>high-level model and the implementation. Maybe /A|B/ and /B|A/ _do_ mean<br/>different things to the engine, but they do NOT mean different things in<br/>the high-level semantics -- except when you force the high-level semantics<br/>to change by fiat, to match the implementation.<br/><br/>&gt; 3) This leads to increased power. It&#39;s like the difference between<br/>&gt; a short-circuiting &quot;or&quot; and one that blindly plods ahead trying<br/>&gt; to figure something out even when all is for naught. Compare A&amp;&amp;B<br/>&gt; with A&amp;B, for example. If A is 0, then B need not be computed, <br/>&gt; yet in the second version, one runs subexpression B nevertheless.<br/>&gt; If according to the rules of one particular system, patX and<br/>&gt; patY mean different things, whereas in a second system, they are<br/>&gt; completely interchangeable, then the first system can express<br/>&gt; nuances that the second one cannot. When you have more nuances,<br/>&gt; more expressivity, then you have more power, because you can say<br/>&gt; things you could not otherwise say. Why do C and its derivatives<br/>&gt; such as Perl have short-circuiting Boolean operators? Because<br/>&gt; in older languages, such as Fortran and Pascal, where you did<br/>&gt; not have them, one quickly found that this was cumbersome and<br/>&gt; annoying.<br/><br/>It&#39;s a question of speed vs. correctness. Correctness is important, but<br/>occasionally a little incorrectness is worthwhile for increased speed.<br/><br/>Short-circuiting in C is a good example. It&#39;s *almost* correct -- the<br/>right boolean answer to the final expression WILL be returned, as will the<br/>correct matching string for /A|B/ or /B|A/. But it&#39;s not quite perfect.<br/>In C, side-effects of executing the short-circuited expressions (such as<br/>function calls) won&#39;t take effect. This is a worthwhile tradeoff, since it<br/>can save a LOT of execution time, and the programmer can work around it<br/>when it matters. But that semantic anomaly in C is well documented and<br/>accepted, so people get by with it.<br/><br/>If the cost of &quot;fixing&quot; this non-greedy/leftmost semantic anomaly in Perl&#39;s<br/>regular expression engine is prohibitive, it&#39;s certainly acceptable to say<br/>it&#39;s a worthwhile tradeoff and document why it&#39;s the way it is. But to<br/>deny the validity of the alternative approach is a bit more extreme.<br/><br/>The current behavior is &quot;correct enough&quot;. It&#39;s not important to change it.<br/>But, if we can fix it and make it completely correct, it&#39;s at least worth<br/>considering -- but maybe not (in the end) worth actually implementing.<br/><br/>&gt; 4) It is more intuitive to the reader and the writer to minimize<br/>&gt; strange action at a distance. It&#39;s more to remember; or, perhaps<br/>&gt; better phrased, more to forget. That&#39;s why we don&#39;t like <br/>&gt; variables set in one place magically affecting innocent code<br/>&gt; elsewhere. Maybe it&#39;s more applicable here to say that that&#39;s<br/>&gt; why having mixed precedences and associativities confuses people.<br/>&gt; If in an expression like A-&gt;B-&gt;C-&gt;D, you had to know a prior when<br/>&gt; evaluating A that D was going to be coming up, it would require <br/>&gt; greater look-ahead, more mental storage. Even if a computer could<br/>&gt; do it, people would find it harder. That&#39;s why we don&#39;t write<br/>&gt; <br/>&gt; &amp;{&amp;{$fnctbl{expr}}(arg1)}(arg2)<br/>&gt; <br/>&gt; when we can simply write<br/>&gt; <br/>&gt; $fnctbl{expr}-&gt;(arg1)-&gt;(arg2)<br/>&gt; <br/>&gt; It is not intuitive to people to have to do too much look-ahead, <br/>&gt; or too much storage. Having distance items interact with one<br/>&gt; another is confusing, and we&#39;ve already got that situation with<br/>&gt; backreferences, as in /(\w+)(\w+)\s+\2(\w+)/, which depending on <br/>&gt; how you start weighting those +&#39;s into +?&#39;s, can really move<br/>&gt; matters around. Let&#39;s not exacerbate the counterintuitiveness.<br/><br/>Ah, but the current behavior *is* strange action at a distance. Instead of<br/>looking at the pattern and the part of the string you expect it to match,<br/>you have to consider at-a-distance factors such as the global priority of<br/>leftmost matching as a strictly-overriding factor, or whether or not some<br/>greedy subexpression earlier in the regular expression will eat the<br/>characters or not. I believe the current behavior is counterintuitive,<br/>which is exactly why I suggested changing it.<br/><br/>&gt; 5) It is more Perlian because of the principle that things that look <br/>&gt; different should actually *be* different. /A|B/ and /B|A/ look<br/>&gt; quite different. Thus, they should likewise *be* different.<br/><br/>Really? So, the semantics of &quot;if (!(expr)) {...}&quot; should be different from<br/>&quot;unless (expr) {...}&quot; just because they look different? An optimizer would<br/>have a hell of a time getting ANYTHING done if it weren&#39;t for the semantic<br/>equivalence of different code that can accomplish the same thing.<br/><br/>If the high-level semantics are logically equivalent, constraining those<br/>semantics for the sake of the implementation is unnecessarily complex.<br/><br/>&gt; &gt;I didn&#39;t need the long-winded explanation, and I don&#39;t need help with<br/>&gt; &gt;understanding how that regexp matches what it does. I understand it<br/>&gt; &gt;perfectly well already. I&#39;m no neophyte with regular expressions, even if<br/>&gt; &gt;Perl 5 does offer some regexp features I&#39;ve never bothered to exploit...<br/>&gt; <br/>&gt; All NFAs prior to POSIX behaved in the fashion that Perl&#39;s continue<br/>&gt; to behave in. I am surprised that over the long course of your <br/>&gt; experiences with regexes, that you never noticed this fundamental<br/>&gt; principle before.<br/><br/>Not all NFAs are used for regexps. Moreover, it&#39;s already too low a level;<br/>once you start viewing the regexp as a series of instructions for how to<br/>move through the states of an NFA, you&#39;ve already turned it into a program<br/>and lost the semantics I&#39;m trying to preserve from the original pattern.<br/><br/>And yes, I&#39;ve used regular expressions heavily for MANY years now. It&#39;s<br/>not relevant to this discussion, except to say that I&#39;m not ignorant of<br/>the way regular expressions work.<br/><br/>&gt; &gt;My point is that the current behavior, while reasonable, isn&#39;t quite right.<br/>&gt; <br/>&gt; You&#39;re wrong. Don&#39;t call it &quot;not right&quot;. It&#39;s perfectly correct<br/>&gt; and consistent. It follows directly from historical behavior of<br/>&gt; these things, and quite simply, it&#39;s in the rules. It&#39;s preferable<br/>&gt; for the many reasons I outlined for you above.<br/><br/>Historical precedent does have bearing on whether to implement it, but not<br/>on what the ideal semantics should be. The codified rules are related to<br/>the implementation, and again shouldn&#39;t have any bearing on the idealized<br/>semantics. As for being preferable, I found your arguments unconvincing;<br/>on the contrary, I consider the current behavior counterintuitive, which is<br/>the only incentive I see for changing it.<br/><br/>&gt; You just want something different. That does not make this wrong.<br/>&gt; If I want ice cream and you want hamburgers, this does not make me<br/>&gt; wrong. However, for you to tell me I really want hamburgers when<br/><br/>I want the regular expressions to be intuitive and consistent across ALL<br/>levels, not just at the procedural NFA implementation level. What&#39;s wrong<br/>with that desire?<br/><br/>&gt; &gt;design flaw in Perl 5&#39;s regular expressions. I was hoping we could have a<br/>&gt; &gt;rational debate about it, rather than propagating the same design flaw into<br/>&gt; &gt;Perl 6 through inertia or for legacy reasons. <br/>&gt; <br/>&gt; It is not a design &quot;flaw&quot;. See above.<br/><br/>My mistake in terminology. It&#39;s too loaded to call it a &quot;flaw&quot;, which is<br/>why I&#39;ve taken to calling it an &quot;anomaly&quot;, which is more value-neutral.<br/><br/>&gt; &gt;&gt; You should probably read the whole chapter.<br/>&gt; <br/>&gt; &gt;That was uncalled for! Just because I disagree with a particular detail of<br/>&gt; &gt;the design doesn&#39;t mean that I didn&#39;t _understand_ it...<br/>&gt; <br/>&gt; It was hardly uncalled for. You did not prove that you understood<br/>&gt; it. You have not yet done so, as you did not cover any of the<br/>&gt; ground which I did above. Perhaps if I had said &quot;Read the source<br/>&gt; code&quot;, that might have been uncalled for. But I did not say that,<br/>&gt; and your retort was gratuitously combative.<br/><br/>Combative? I don&#39;t believe so. I simply objected to the assumption that<br/>disagreeing with the design constitutes ignorance of it.<br/><br/>I could prove that I understand regexps, but what would be the point?<br/><br/>&gt; You see, it just so happens that there&#39;s much more in that chapter,<br/>&gt; including the referenced Little Engine section, that when read as<br/>&gt; a whole, might well shed better understanding than mere extracts.<br/>&gt; You should also check out _Mastering Regular Expressions_ says about<br/>&gt; that. I won&#39;t quote for you. I&#39;m tired of typing when you could<br/>&gt; be reading.<br/><br/>I&#39;ve read the 2nd edition of the camel book, but I haven&#39;t seen the new<br/>material in the 3rd edition yet. (I&#39;ve postponed buying it, to leave it as<br/>a potential Xmas present. I&#39;ll buy it immediately after Xmas if I don&#39;t<br/>get it as a present.) I glanced over the section in the bookstore. Looks<br/>like a good overview of the specific algorithm implemented, but it&#39;s not<br/>really relevent to the question of idealized semantics.<br/><br/>&gt; What is completely uncalled for, though, is to call something a bug<br/>&gt; just because you either don&#39;t understand it or because you don&#39;t<br/>&gt; like it. Lamentably, such behavior is remarkably frequent. You<br/>&gt; are hardly the first.<br/><br/>I apologize for seeming to call it a bug, and for calling it a &quot;flaw&quot; --<br/>those sound too much like value judgements, and encourage people to<br/>entrench their positions rather than contemplating alternatives.<br/><br/>Deven<br/><br/> http://www.nntp.perl.org/group/perl.perl6.language.regex/2000/12/msg572.html Fri, 15 Dec 2000 15:04:28 +0000 Re: Perl 5's "non-greedy" matching can be TOO greedy! by Simon Cozens On Fri, Dec 15, 2000 at 05:20:35PM -0500, Deven T. Corzine wrote:<br/>&gt; It&#39;s a pattern, not a program. Yes, it&#39;s straightforward to treat it as a<br/>&gt; step-by-step procedure for matching that pattern, but by doing so, you lose<br/>&gt; something of the gestalt of the whole. <br/> <br/>You may deal in patterns, but computers deal in programs. There&#39;s a reason<br/>Gestalt.pm hasn&#39;t made it yet.<br/><br/>-- <br/>Putting a square peg into a round hole can be worthwhile if you don&#39;t mind a <br/>few shavings. -- Larry Wall<br/> http://www.nntp.perl.org/group/perl.perl6.language.regex/2000/12/msg571.html Fri, 15 Dec 2000 15:00:02 +0000 Re: Perl 5's "non-greedy" matching can be TOO greedy! by Tom Christiansen &gt;Take. It. To. Private. Email. Please.<br/><br/>I&#39;m going to do better. I&#39;m taking it to /dev/null.<br/>It&#39;s not worth my wasting my life over. Nobody<br/>agrees with this guy, so it doesn&#39;t matter.<br/><br/>--tom<br/> http://www.nntp.perl.org/group/perl.perl6.language.regex/2000/12/msg570.html Fri, 15 Dec 2000 14:31:43 +0000 Re: Perl 5's "non-greedy" matching can be TOO greedy! by Deven T. Corzine <br/>On Fri, 15 Dec 2000, Tom Christiansen wrote:<br/><br/>&gt; &gt;At worst, this should take no more than double the amount of time that the<br/>&gt; &gt;single pass did, probably less. Hardly a cause to concern ourselves with<br/>&gt; &gt;the heat death of the universe.<br/>&gt; <br/>&gt; Oh really? We have shown that for the kind of global overall<br/>&gt; analysis that you are asking for, that in the general case, all<br/>&gt; possible paths much be taken. You cannot short-circuit, because<br/>&gt; you must first consider all possibilities and then weigh each valid<br/>&gt; result against each other valid result.<br/>&gt; <br/>&gt; Consider something like /.*/ or /.*?/. For a string a length N,<br/>&gt; there are<br/>&gt; <br/>&gt; (N+1) (N+2)<br/>&gt; -----------<br/>&gt; 2<br/>&gt; <br/>&gt; substrings that that matches. That means that an 80-byte string<br/>&gt; has some 3321 possible substrings, all of which must be considered.<br/>&gt; <br/>&gt; In the short-circuiting version, the Engine need consider but one<br/>&gt; single solitary case for each of those. 3321 is not the double of 1.<br/>&gt; <br/>&gt; Consider now something like /(.*)(.*)/ or /(.*?)(.*?)/ or /(.*)(.*?)/<br/>&gt; or /(.*)(.*?)/. You now have<br/>&gt; <br/>&gt; <br/>&gt; 2<br/>&gt; ( (N+1) (N+2) )<br/>&gt; ---------------<br/>&gt; 4<br/>&gt; <br/>&gt; cases to consider, or, in the case of an 80-byte string, some<br/>&gt; 11,029,041 possible choices. <br/><br/>Where does the combinatorial math have any relevance? I&#39;m not suggesting<br/>checking every possibility.<br/><br/>&gt; And with the current, normal, standard, short-circuiting system, <br/>&gt; the Engine has to consider, hm... could it be just one possibility?<br/>&gt; And that&#39;s just with two wildcards. People are often writing more<br/>&gt; than that.<br/><br/>You can still short-circuit. I&#39;m not suggesting examining any further into<br/>the text to be searched than it already does. For the second pass, you can<br/>scan backwards for a rightmost match and short-circuit that, if you want.<br/>(But that would leave some small chance of a shorter match in the middle.)<br/><br/>&gt; Can you now see why this would be a problem? And how even in the<br/>&gt; cases where it didn&#39;t actually break old programs (many of which it<br/>&gt; would!) that it would cause many many them to apparently hang, <br/>&gt; racing for electron death?<br/><br/>It would be a problem if you actually had a combinatorial algorithm.<br/><br/>At worst, you&#39;re re-testing the regexp N times, where N is the length of<br/>the matched string from the first pass, NOT the length of the original<br/>string to be searched.<br/><br/>Oh, and where are these &quot;many&quot; old programs it would break? I&#39;d still love<br/>to hear a real-world example that would actually break...<br/><br/>Deven<br/><br/> http://www.nntp.perl.org/group/perl.perl6.language.regex/2000/12/msg569.html Fri, 15 Dec 2000 14:28:12 +0000 Re: Perl 5's "non-greedy" matching can be TOO greedy! by Jarkko Hietaniemi Take. It. To. Private. Email. Please.<br/><br/>-- <br/>$jhi++; # http://www.iki.fi/jhi/<br/> # There is this special biologist word we use for &#39;stable&#39;.<br/> # It is &#39;dead&#39;. -- Jack Cohen<br/> http://www.nntp.perl.org/group/perl.perl6.language.regex/2000/12/msg568.html Fri, 15 Dec 2000 14:22:52 +0000 Re: Perl 5's "non-greedy" matching can be TOO greedy! by Deven T. Corzine <br/>On Fri, 15 Dec 2000, Tom Christiansen wrote:<br/><br/>&gt; &gt;That would be a strange regexp, but I never suggested it. I suggested the<br/>&gt; &gt;regexp /b.*?d/ and pointed out that I believe &quot;bccccd&quot; is a more intuitive<br/>&gt; &gt;match than &quot;bbbbccccd&quot;. That was the matching text, not the regexp, sorry<br/>&gt; &gt;if I didn&#39;t make that clear.<br/>&gt; <br/>&gt; Fine. What you said is <br/>&gt; <br/>&gt; first<br/>&gt; find a b<br/>&gt; then <br/>&gt; find any non-newline, repeated 0 to N times<br/>&gt; then <br/>&gt; find a d<br/>&gt; <br/>&gt; What part of &quot;first find a b&quot; do you expect a randomizing solution to?<br/>&gt; That&#39;s very clear.<br/><br/>It&#39;s a pattern, not a program. Yes, it&#39;s straightforward to treat it as a<br/>step-by-step procedure for matching that pattern, but by doing so, you lose<br/>something of the gestalt of the whole. I&#39;m aware of the mapping from the<br/>pattern to the procedure, but I&#39;m retaining in mind the original intent of<br/>the pattern as well as the details of how it&#39;s matched. The semantics of<br/>the step-by-step procedure are very similar to those of the pattern, but<br/>not identical.<br/><br/>The semantics I want it to mean are &quot;find a match starting with b, ending<br/>with d, with as little as necessary in between&quot;. The difference between<br/>this and the procedural description you give above is what I&#39;m talking<br/>about where we&#39;re losing something of the gestalt of the whole. While it&#39;s<br/>similar, it&#39;s not quite the same.<br/><br/>Deven<br/><br/> http://www.nntp.perl.org/group/perl.perl6.language.regex/2000/12/msg567.html Fri, 15 Dec 2000 14:20:48 +0000 Re: Perl 5's "non-greedy" matching can be TOO greedy! by Tom Christiansen And while I&#39;m at it, consider /(.*)(.*)(.*)/, which we&#39;ll call<br/>/ABC./ You need to be able to say all of these independently<br/>and in conjunction with one another:<br/><br/> whether segment A is longest or shortest overall <br/> whether segment B is longest or shortest overall <br/> whether segment C is longest or shortest overall <br/><br/> whether segment AB is longest or shortest overall <br/> whether segment BC is longest or shortest overall <br/><br/> whether segment ABC is longest or shortest overall <br/><br/>Imagine wanting, in /ABC/, A and B to be minimal, C to be maximal,<br/>AB to be maximal, BC to be minimal, and ABC to be maximal.<br/><br/>Does this not strike fear into your heart? The very notation we&#39;d<br/>have to devise should itself be plenty sufficient to give you serious<br/>pause--and that&#39;s not even considering the heat-death problem of<br/>guaranteed worst-case behavior that the word &quot;overall&quot; mandates.<br/><br/>Be very afraid.<br/>--tom<br/> http://www.nntp.perl.org/group/perl.perl6.language.regex/2000/12/msg566.html Fri, 15 Dec 2000 13:48:59 +0000 Re: Perl 5's "non-greedy" matching can be TOO greedy! by Tom Christiansen &gt;At worst, this should take no more than double the amount of time that the<br/>&gt;single pass did, probably less. Hardly a cause to concern ourselves with<br/>&gt;the heat death of the universe.<br/><br/>Oh really? We have shown that for the kind of global overall<br/>analysis that you are asking for, that in the general case, all<br/>possible paths much be taken. You cannot short-circuit, because<br/>you must first consider all possibilities and then weigh each valid<br/>result against each other valid result.<br/><br/>Consider something like /.*/ or /.*?/. For a string a length N,<br/>there are<br/><br/> (N+1) (N+2)<br/> -----------<br/> 2<br/><br/>substrings that that matches. That means that an 80-byte string<br/>has some 3321 possible substrings, all of which must be considered.<br/><br/>In the short-circuiting version, the Engine need consider but one<br/>single solitary case for each of those. 3321 is not the double of 1.<br/><br/>Consider now something like /(.*)(.*)/ or /(.*?)(.*?)/ or /(.*)(.*?)/<br/>or /(.*)(.*?)/. You now have<br/><br/><br/> 2<br/> ( (N+1) (N+2) )<br/> ---------------<br/> 4<br/><br/>cases to consider, or, in the case of an 80-byte string, some<br/>11,029,041 possible choices. <br/><br/>And with the current, normal, standard, short-circuiting system, <br/>the Engine has to consider, hm... could it be just one possibility?<br/>And that&#39;s just with two wildcards. People are often writing more<br/>than that.<br/><br/>Can you now see why this would be a problem? And how even in the<br/>cases where it didn&#39;t actually break old programs (many of which it<br/>would!) that it would cause many many them to apparently hang, <br/>racing for electron death?<br/><br/>--tom<br/> http://www.nntp.perl.org/group/perl.perl6.language.regex/2000/12/msg565.html Fri, 15 Dec 2000 13:37:56 +0000 Re: Perl 5's "non-greedy" matching can be TOO greedy! by Tom Christiansen &gt;That would be a strange regexp, but I never suggested it. I suggested the<br/>&gt;regexp /b.*?d/ and pointed out that I believe &quot;bccccd&quot; is a more intuitive<br/>&gt;match than &quot;bbbbccccd&quot;. That was the matching text, not the regexp, sorry<br/>&gt;if I didn&#39;t make that clear.<br/><br/>Fine. What you said is <br/><br/> first<br/> find a b<br/> then <br/> find any non-newline, repeated 0 to N times<br/> then <br/> find a d<br/><br/>What part of &quot;first find a b&quot; do you expect a randomizing solution to?<br/>That&#39;s very clear.<br/><br/>--tom<br/> http://www.nntp.perl.org/group/perl.perl6.language.regex/2000/12/msg564.html Fri, 15 Dec 2000 13:25:36 +0000 Re: Perl 5's "non-greedy" matching can be TOO greedy! by Deven T. Corzine <br/>On Fri, 15 Dec 2000, Tom Christiansen wrote:<br/><br/>&gt; &gt;You can&#39;t explain why &quot;bbbbccccd&quot; matches without making reference to the<br/>&gt; &gt;absolute priority of the leftmost rule. &quot;bccccd&quot; would still make sense<br/>&gt; &gt;(locally) without reference to that rule.<br/>&gt; <br/>&gt; Nope. Nope, nope, and nope.<br/>&gt; <br/>&gt; Th8is /bbbbccccd/ thing, which is completely unrealistic and<br/>&gt; non-real-worldly, says:<br/>&gt; <br/>&gt; find a <br/>&gt; b<br/>&gt; such that this is immediately followed by <br/>&gt; b<br/>&gt; such that this is immediately followed by <br/>&gt; b<br/>&gt; such that this is immediately followed by <br/>&gt; b<br/>&gt; such that this is immediately followed by <br/>&gt; c<br/>&gt; such that this is immediately followed by <br/>&gt; c<br/>&gt; such that this is immediately followed by <br/>&gt; c<br/>&gt; such that this is immediately followed by <br/>&gt; c<br/>&gt; such that this is immediately followed by <br/>&gt; d<br/>&gt; <br/>&gt; If you think that people for &quot;find a b&quot; to suddently mean <br/>&gt; something stochastic, you know different people than I do.<br/><br/>That would be a strange regexp, but I never suggested it. I suggested the<br/>regexp /b.*?d/ and pointed out that I believe &quot;bccccd&quot; is a more intuitive<br/>match than &quot;bbbbccccd&quot;. That was the matching text, not the regexp, sorry<br/>if I didn&#39;t make that clear.<br/><br/>Deven<br/><br/> http://www.nntp.perl.org/group/perl.perl6.language.regex/2000/12/msg563.html Fri, 15 Dec 2000 13:14:22 +0000 Re: Perl 5's "non-greedy" matching can be TOO greedy! by Deven T. Corzine <br/>On Fri, 15 Dec 2000, Nathan Torkington wrote:<br/><br/>&gt; Tom Christiansen writes:<br/>&gt; &gt; &gt;We may have to &quot;agree to disagree&quot;. <br/>&gt; &gt; <br/>&gt; &gt; I shan&#39;t be doing that.<br/>&gt; <br/>&gt; I think you should, or at least agree to take it private and report<br/>&gt; back to the list once you both come to a decision. Once you&#39;ve stated<br/>&gt; your position twice, there&#39;s not really much point in saying it a<br/>&gt; third time. It&#39;s a sign that the discussion is turning on itself.<br/><br/>Well, I&#39;ve tried to clarify each restatement further and/or reframe it,<br/>rather than repeating exactly the same thing. That said, if it comes to<br/>the point where Tom and myself are the only ones interested in hashing out<br/>the issue, I&#39;d be happy to take it offline until the two of us either agree<br/>or get tired of spinning in circles, then report back.<br/><br/>&gt; I&#39;m glad this thread hasn&#39;t turned into a flamewar, but the stakes are<br/>&gt; so low that I&#39;m sure it could easily become one. I&#39;d like to make<br/>&gt; sure that everyone, Deven, Tom, and the rest of the list, relax and<br/>&gt; realize that it&#39;s just a programming language.<br/><br/>This _is_ a minor issue. I&#39;m very much looking forward to Perl 6; since<br/>Perl 5 is such a joy to use, I fully expect Perl 6 to be even better. And<br/>I&#39;ll be using it, even if the semantic anomaly I believe I see is there.<br/><br/>Deven<br/><br/> http://www.nntp.perl.org/group/perl.perl6.language.regex/2000/12/msg562.html Fri, 15 Dec 2000 13:13:07 +0000 Re: Perl 5's "non-greedy" matching can be TOO greedy! by Deven T. Corzine <br/>On Fri, 15 Dec 2000, Tom Christiansen wrote:<br/><br/>&gt; &gt;We may have to &quot;agree to disagree&quot;. <br/>&gt; <br/>&gt; I shan&#39;t be doing that.<br/><br/>Well, I&#39;m still willing to discuss it, as long as it remains a discussion<br/>and doesn&#39;t become a flame war.<br/><br/>&gt; &gt;I&#39;m understand why people believe in<br/>&gt; &gt;the current semantics, but I&#39;ve seen no indication that anyone else<br/>&gt; &gt;understands why I believe in these alternative semantics, or has tried.<br/>&gt; &gt;(Disagreeing with my conclusion doesn&#39;t preclude understanding where I&#39;m<br/>&gt; &gt;coming from, but nobody seems to.)<br/>&gt; <br/>&gt; You have not addressed the heat death of the universe as I and<br/>&gt; others have illustrated. Finding all possible matches is very often<br/>&gt; completely infeasible. Please solve the electron decay problem<br/>&gt; before continuing.<br/><br/>Where does the heat death of the universe come in? I can give you a SIMPLE<br/>way to implement it, but I doubt it&#39;s the best way: apply the current rules<br/>first, then take the matching substring and search within THAT for a match<br/>with the priority of the rules inverted -- prefer non-greediness OVER<br/>leftmost matching for the second pass. This WILL get the result I suggest,<br/>preferring leftmost matching in general while still maximizing the amount<br/>non-greediness (stinginess) within those constraints.<br/><br/>At worst, this should take no more than double the amount of time that the<br/>single pass did, probably less. Hardly a cause to concern ourselves with<br/>the heat death of the universe.<br/><br/>Note, I do NOT recommend that implementation; it imposes an obvious speed<br/>penalty that shouldn&#39;t be imposed on people who don&#39;t care. It might make<br/>sense as an option, however.<br/><br/>However, it does bring another possibility to mind. For those who are<br/>willing to pay a 100% speed penalty for simplicity, this sort of two-pass<br/>mode could be allowed, and allow juggling of both preferences? Maybe it<br/>would be useful to allow a &quot;rightmost matching preference&quot; option for<br/>people who could use it. (Might be helpful when working with Hebrew?)<br/><br/>&gt; &gt;Well, obviously we could. Maybe we shouldn&#39;t, but we could do it. Many,<br/>&gt; &gt;many existing programs depended on Perl 4&#39;s magic behavior with @&#39;s in<br/>&gt; &gt;double-quoted strings, yet Perl 5 broke them all with a fatal error during<br/>&gt; &gt;the compile phase. People survived. They adapted and moved on. <br/>&gt; <br/>&gt; Red herring.<br/><br/>Counterexample to the assumption that we can&#39;t break existing code by<br/>changing the semantics. It&#39;s been done before, it could happen again.<br/><br/>&gt; &gt;Unlike that incompatibility, this one would probably affect few<br/>&gt; &gt;programs.<br/>&gt; <br/>&gt; You&#39;re wrong. Incredibly wrong. <br/><br/>Really? Do you have a real-world example that it would break, which would<br/>demonstrate how common such breakage would be?<br/><br/>Deven<br/><br/> http://www.nntp.perl.org/group/perl.perl6.language.regex/2000/12/msg561.html Fri, 15 Dec 2000 13:09:53 +0000 Re: Perl 5's "non-greedy" matching can be TOO greedy! by Nathan Torkington Tom Christiansen writes:<br/>&gt; &gt;We may have to &quot;agree to disagree&quot;. <br/>&gt; <br/>&gt; I shan&#39;t be doing that.<br/><br/>I think you should, or at least agree to take it private and report<br/>back to the list once you both come to a decision. Once you&#39;ve stated<br/>your position twice, there&#39;s not really much point in saying it a<br/>third time. It&#39;s a sign that the discussion is turning on itself.<br/><br/>I&#39;m glad this thread hasn&#39;t turned into a flamewar, but the stakes are<br/>so low that I&#39;m sure it could easily become one. I&#39;d like to make<br/>sure that everyone, Deven, Tom, and the rest of the list, relax and<br/>realize that it&#39;s just a programming language.<br/><br/>Nat<br/> http://www.nntp.perl.org/group/perl.perl6.language.regex/2000/12/msg560.html Fri, 15 Dec 2000 13:08:16 +0000 Re: Perl 5's "non-greedy" matching can be TOO greedy! by Tom Christiansen &gt;You can&#39;t explain why &quot;bbbbccccd&quot; matches without making reference to the<br/>&gt;absolute priority of the leftmost rule. &quot;bccccd&quot; would still make sense<br/>&gt;(locally) without reference to that rule.<br/><br/>Nope. Nope, nope, and nope.<br/><br/>Th8is /bbbbccccd/ thing, which is completely unrealistic and<br/>non-real-worldly, says:<br/><br/> find a <br/> b<br/> such that this is immediately followed by <br/> b<br/> such that this is immediately followed by <br/> b<br/> such that this is immediately followed by <br/> b<br/> such that this is immediately followed by <br/> c<br/> such that this is immediately followed by <br/> c<br/> such that this is immediately followed by <br/> c<br/> such that this is immediately followed by <br/> c<br/> such that this is immediately followed by <br/> d<br/><br/>If you think that people for &quot;find a b&quot; to suddently mean <br/>something stochastic, you know different people than I do.<br/><br/>--tom<br/> http://www.nntp.perl.org/group/perl.perl6.language.regex/2000/12/msg559.html Fri, 15 Dec 2000 13:02:05 +0000 Re: Perl 5's "non-greedy" matching can be TOO greedy! by Tom Christiansen &gt;On Fri, 15 Dec 2000, Tom Christiansen wrote:<br/><br/>&gt;&gt; &gt;As for special-case rules, I believe that my proposed modification would<br/>&gt;&gt; &gt;REMOVE a special-case semantic rule, at the cost of added complexity at the<br/>&gt;&gt; &gt;implementation level. <br/>&gt;&gt; <br/>&gt;&gt; What is this alleged &quot;special-case rule&quot; you are talking about?<br/>&gt;&gt; There is no such thing. None. When you write /pat/, it means to<br/>&gt;&gt; find the first such pattern. There is no special case here.<br/><br/>&gt;The special case is &quot;as long as it has the earliest starting position&quot;.<br/><br/>&gt;There may be many, many possible matches for a regexp in a given string,<br/>&gt;especially with an expression as inclusive as &quot;.*&quot;. <br/><br/>You want to change things from &quot;find a match&quot;, which has the obviously<br/>deterministic semantics of finding the first match, and alter that<br/>to mean &quot;find all possible matches; now, amongst those...&quot;. This<br/>is much more complicated, at many levels.<br/><br/>You have yet to address my long mail to you.<br/><br/>You have yet to read MRE.<br/><br/>&gt;So, you have to apply some disambiguating rules to identify which matches<br/>&gt;are &quot;interesting&quot; enough to be worth paying attention to. <br/><br/>There is no ambiguity. Short-circuiting it not ambiguity. Stopping when<br/>you have an answer is not ambiguity. You are mistaken.<br/><br/>--tom<br/> http://www.nntp.perl.org/group/perl.perl6.language.regex/2000/12/msg558.html Fri, 15 Dec 2000 13:00:10 +0000 Re: Perl 5's "non-greedy" matching can be TOO greedy! by Deven T. Corzine <br/>On Fri, 15 Dec 2000, Simon Cozens wrote:<br/><br/>&gt; On Fri, Dec 15, 2000 at 11:39:08AM -0800, Randal L. Schwartz wrote:<br/>&gt; &gt; Tell me how you can do that without breaking much existing code.<br/>&gt; <br/>&gt; Pssst, Randal, this is Perl 6, not p5p.<br/><br/>That&#39;s why I never suggested fixing it in Perl 5 -- the chance of breaking<br/>any existing code did NOT seem worth considering for something so minor.<br/><br/>But if Perl 6 is likely to break a few things anyway, then this would be an<br/>appropriate time to consider such a change. (If it actually makes sense.)<br/><br/>Deven<br/><br/> http://www.nntp.perl.org/group/perl.perl6.language.regex/2000/12/msg557.html Fri, 15 Dec 2000 12:59:38 +0000 Re: Perl 5's "non-greedy" matching can be TOO greedy! by Deven T. Corzine <br/>On Fri, 15 Dec 2000, Tom Christiansen wrote:<br/><br/>&gt; &gt;Actually, I&#39;m not sure -- it&#39;s conceivable that the ending point would ALSO<br/>&gt; &gt;move inward for a different starting point within the original match. But<br/>&gt; &gt;the ending point should NEVER be advanced further -- that&#39;s where the<br/>&gt; &gt;&quot;leftmost over nongreedy&quot; rule should apply instead...<br/>&gt; <br/>&gt; Please show us your implementation for a pattern matching engine<br/>&gt; that lets the current end-point vary. This is very exciting,<br/>&gt; because now you can relax the restriction that lookbehinds<br/>&gt; must be constant width.<br/><br/>I don&#39;t know if it can be implemented. I&#39;m just not assuming out of hand<br/>that it CANNOT. Just because we can&#39;t see how to implement it doesn&#39;t mean<br/>there isn&#39;t a technique we haven&#39;t thought of yet. I won&#39;t go so far as to<br/>say that &quot;I can&#39;t think of one&quot; until such time as I&#39;ve seriously TRIED to.<br/><br/>When I do try, maybe then I&#39;ll be convinced that the answer is beyond me,<br/>or that it doesn&#39;t exist. Until then, I consider it an interesting, yet<br/>unresolved question.<br/><br/>However, that wasn&#39;t the point of the above paragraph. I meant that you<br/>might have a match that the current engine would return, which starts at<br/>position x and ends at position y. Obviously, a less-greedy match is<br/>possible which starts at position x&#39;, where x &lt; x&#39; &lt;= y. It might be<br/>possible that such a match could also end at position y&#39; where y&#39; &lt; y.<br/>I don&#39;t have an example in mind, but I can&#39;t rule out the possibility.<br/><br/>In both cases above, it&#39;s the same thing -- I&#39;m reluctant to jump to the<br/>conclusion that something doesn&#39;t exist just because _I_ don&#39;t see it yet.<br/><br/>Deven<br/><br/> http://www.nntp.perl.org/group/perl.perl6.language.regex/2000/12/msg556.html Fri, 15 Dec 2000 12:57:58 +0000 Re: Perl 5's "non-greedy" matching can be TOO greedy! by Tom Christiansen &gt;We may have to &quot;agree to disagree&quot;. <br/><br/>I shan&#39;t be doing that.<br/><br/>&gt;I&#39;m understand why people believe in<br/>&gt;the current semantics, but I&#39;ve seen no indication that anyone else<br/>&gt;understands why I believe in these alternative semantics, or has tried.<br/>&gt;(Disagreeing with my conclusion doesn&#39;t preclude understanding where I&#39;m<br/>&gt;coming from, but nobody seems to.)<br/><br/>You have not addressed the heat death of the universe as I and<br/>others have illustrated. Finding all possible matches is very often<br/>completely infeasible. Please solve the electron decay problem<br/>before continuing.<br/><br/>&gt;Well, obviously we could. Maybe we shouldn&#39;t, but we could do it. Many,<br/>&gt;many existing programs depended on Perl 4&#39;s magic behavior with @&#39;s in<br/>&gt;double-quoted strings, yet Perl 5 broke them all with a fatal error during<br/>&gt;the compile phase. People survived. They adapted and moved on. <br/><br/>Red herring.<br/><br/>&gt;Unlike<br/>&gt;that incompatibility, this one would probably affect few programs.<br/><br/>You&#39;re wrong. Incredibly wrong. <br/><br/>--tom<br/> http://www.nntp.perl.org/group/perl.perl6.language.regex/2000/12/msg555.html Fri, 15 Dec 2000 12:56:08 +0000