develooper Front page | perl.beginners | Postings from April 2012

Re: how to get two matches out

Thread Previous | Thread Next
From:
Lawrence Statton
Date:
April 29, 2012 09:10
Subject:
Re: how to get two matches out
Message ID:
4F9D67F9.1050707@cluon.com
On 04/29/2012 10:41 AM, lina wrote:
> On Sun, Apr 29, 2012 at 11:26 PM, Lawrence Statton<lawrence@cluon.com>  wrote:
>> On 04/29/2012 10:21 AM, lina wrote:
>>>
>>> Hi,
>>>
>>> I have a text file like:
>>>
>>> $ more sample.tex
>>>
>>> aaa \cite{d1,d2},ddd \cite{e1},ccc \cite{f1,f2,f3}
>>> bbb\cite{inhibitor}aaa
>>>
>>>
>>> sub read_tex{
>>>         open my $fh, '<', @_;
>>>         while(<$fh>){
>>>                 if(/cite\{(.+?)\}/){
>>>                 push @citeditems,split/,/,$1;
>>>                 }
>>>         }
>>>         close($fh);
>>> }
>>>
>>> It only extract the first \cite part out, failed to extract the e1,
>>> f1, f2, f3 and uncertain number of being cited item out.
>>>
>>> Can someone give me some suggest regarding how to upgrade the match part?
>>>
>>> Thanks with best regards,
>>>
>>
>> Your regexp only asks for a single \cite... match in each line.
>>
>> perldoc perlretut
>>
>> Search for "Global Matching" roughly half-way down the page.
>
> 	if($_ =~ m/cite\{(.+?)\}/g){
>

I have never, in a decade of using Perl every day, used $1 or backrefs.

My personal preference is to always do matching in a list context, e.g.

    my @thing = $target =~ m/Foo: (\w+) bar: (\w+) baz: (\w+))/;

 From perlretut, I quote:

        In list context, "//g" returns a list of matched groupings,
        or if there are no groupings, a list of matches to
        the whole regexp.  So if we wanted just the words, we could use

            @word = ($x =~ /(\w+)/g);  # matches,
                                        # $word[0] = 'cat'
                                        # $word[1] = 'dog'
                                        # $word[2] = 'house'

(Note that the docs (at least on my copy of perl) have a typo ... it 
says @words, not @word.)

So - let's take a quick pass at your problem, breaking it down into pieces.

First - let's get a list of the contents of each cite{...} into an array 
per line.


  while (my $line = <$fh>) {
    my @cite_for_line = $line =~ m/cite\{(.+)\}/g;
    print ">>$_<<\n" for @cite_for_line;
  }

Which produces (incorrectly) ...

   >>d1,d2},ddd \cite{e1},ccc \cite{f1,f2,f3<<
   >>inhibitor<<

Hrm ... it looks like the regexp is matching "too much"

If we go back to perlretut, we'll find around line 746 (your page may 
vary) the helpful paragraph:

        For all of these quantifiers, Perl will try to match as much of
        the string as possible, while still allowing the regexp to
        succeed.  Thus with "/a?.../", Perl will first try to match the
        regexp with the "a" present; if that fails, Perl will try to
        match the regexp without the "a" present.  For the quantifier
        "*", we get the following:

            $x = "the cat in the hat";
            $x =~ /^(.*)(cat)(.*)$/; # matches,
                                     # $1 = 'the '
                                     # $2 = 'cat'
                                     # $3 = ' in the hat'

        Which is what we might expect, the match finds the only "cat"
        in the string and locks onto it.  Consider, however, this
        regexp:

            $x =~ /^(.*)(at)(.*)$/; # matches,
                                    # $1 = 'the cat in the h'
                                    # $2 = 'at'
                                    # $3 = ''   (0 characters match)


So, we are getting "greedy" matches (search for greedy in perlretut for 
more information on that.)

What we want in our regexp in YOUR case is a "non greedy" match, which 
(cutting to the chase) looks like THIS

     my @cite_for_line = $line =~ m/cite\{(.+?)\}/g;


The only difference is there is now a "?" inside the match grouping, 
which says "rather than pick the LARGEST segment that matches, pick the 
SMALLEST).

Running this code, gives the output

 >>d1,d2<<
 >>e1<<
 >>f1,f2,f3<<
 >>inhibitor<<


Great .. now, we can break up each of those elements using the split /,/ 
that you used before....

   my @cite;

   while (my $line = <$fh>) {
     my @cite_for_line = $line =~ m/cite\{(.+?)\}/g;
     push @cite , split /,/, for @cite_for_line;
   }

   print "$_\n" for @cite;


Which gives the (one assumes correct) answer of:

d1
d2
e1
f1
f2
f3
inhibitor


If you want to be even more clever, you can remove the intermediate 
temporary variable @cite_for_line by doing this:

   while (my $line = <$fh>) {
     push @cite, split /,/, for $line =~ m/cite\{(.+?)\}/g;
   }


--L





Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About