Front page | perl.beginners |
Postings from April 2012
Re: how to get two matches out
Thread Previous
|
Thread Next
From:
Lawrence Statton
Date:
April 29, 2012 09:10
Subject:
Re: how to get two matches out
Message ID:
4F9D67F9.1050707@cluon.com
On 04/29/2012 10:41 AM, lina wrote:
> On Sun, Apr 29, 2012 at 11:26 PM, Lawrence Statton<lawrence@cluon.com> wrote:
>> On 04/29/2012 10:21 AM, lina wrote:
>>>
>>> Hi,
>>>
>>> I have a text file like:
>>>
>>> $ more sample.tex
>>>
>>> aaa \cite{d1,d2},ddd \cite{e1},ccc \cite{f1,f2,f3}
>>> bbb\cite{inhibitor}aaa
>>>
>>>
>>> sub read_tex{
>>> open my $fh, '<', @_;
>>> while(<$fh>){
>>> if(/cite\{(.+?)\}/){
>>> push @citeditems,split/,/,$1;
>>> }
>>> }
>>> close($fh);
>>> }
>>>
>>> It only extract the first \cite part out, failed to extract the e1,
>>> f1, f2, f3 and uncertain number of being cited item out.
>>>
>>> Can someone give me some suggest regarding how to upgrade the match part?
>>>
>>> Thanks with best regards,
>>>
>>
>> Your regexp only asks for a single \cite... match in each line.
>>
>> perldoc perlretut
>>
>> Search for "Global Matching" roughly half-way down the page.
>
> if($_ =~ m/cite\{(.+?)\}/g){
>
I have never, in a decade of using Perl every day, used $1 or backrefs.
My personal preference is to always do matching in a list context, e.g.
my @thing = $target =~ m/Foo: (\w+) bar: (\w+) baz: (\w+))/;
From perlretut, I quote:
In list context, "//g" returns a list of matched groupings,
or if there are no groupings, a list of matches to
the whole regexp. So if we wanted just the words, we could use
@word = ($x =~ /(\w+)/g); # matches,
# $word[0] = 'cat'
# $word[1] = 'dog'
# $word[2] = 'house'
(Note that the docs (at least on my copy of perl) have a typo ... it
says @words, not @word.)
So - let's take a quick pass at your problem, breaking it down into pieces.
First - let's get a list of the contents of each cite{...} into an array
per line.
while (my $line = <$fh>) {
my @cite_for_line = $line =~ m/cite\{(.+)\}/g;
print ">>$_<<\n" for @cite_for_line;
}
Which produces (incorrectly) ...
>>d1,d2},ddd \cite{e1},ccc \cite{f1,f2,f3<<
>>inhibitor<<
Hrm ... it looks like the regexp is matching "too much"
If we go back to perlretut, we'll find around line 746 (your page may
vary) the helpful paragraph:
For all of these quantifiers, Perl will try to match as much of
the string as possible, while still allowing the regexp to
succeed. Thus with "/a?.../", Perl will first try to match the
regexp with the "a" present; if that fails, Perl will try to
match the regexp without the "a" present. For the quantifier
"*", we get the following:
$x = "the cat in the hat";
$x =~ /^(.*)(cat)(.*)$/; # matches,
# $1 = 'the '
# $2 = 'cat'
# $3 = ' in the hat'
Which is what we might expect, the match finds the only "cat"
in the string and locks onto it. Consider, however, this
regexp:
$x =~ /^(.*)(at)(.*)$/; # matches,
# $1 = 'the cat in the h'
# $2 = 'at'
# $3 = '' (0 characters match)
So, we are getting "greedy" matches (search for greedy in perlretut for
more information on that.)
What we want in our regexp in YOUR case is a "non greedy" match, which
(cutting to the chase) looks like THIS
my @cite_for_line = $line =~ m/cite\{(.+?)\}/g;
The only difference is there is now a "?" inside the match grouping,
which says "rather than pick the LARGEST segment that matches, pick the
SMALLEST).
Running this code, gives the output
>>d1,d2<<
>>e1<<
>>f1,f2,f3<<
>>inhibitor<<
Great .. now, we can break up each of those elements using the split /,/
that you used before....
my @cite;
while (my $line = <$fh>) {
my @cite_for_line = $line =~ m/cite\{(.+?)\}/g;
push @cite , split /,/, for @cite_for_line;
}
print "$_\n" for @cite;
Which gives the (one assumes correct) answer of:
d1
d2
e1
f1
f2
f3
inhibitor
If you want to be even more clever, you can remove the intermediate
temporary variable @cite_for_line by doing this:
while (my $line = <$fh>) {
push @cite, split /,/, for $line =~ m/cite\{(.+?)\}/g;
}
--L
Thread Previous
|
Thread Next