develooper Front page | perl.perl5.porters | Postings from October 2017

[perl #132197] regexp fails on large string captures entire fileinstead

Thread Previous
From:
James E Keenan via RT
Date:
October 2, 2017 01:01
Subject:
[perl #132197] regexp fails on large string captures entire fileinstead
Message ID:
rt-4.0.24-4496-1506906070-531.132197-15-0@perl.org
On Sun, 01 Oct 2017 22:17:24 GMT, wilyarti@gmail.com wrote:
> Message-Id: <5.26.1_17663_1506895572@pinebook>
> Subject: regexp fails on large string captures entire file instead
> Reply-To: wilyarti@gmail.com
> To: perlbug@perl.org
> From: wilyarti@gmail.com
> 
> 
> This is a bug report for perl from wilyarti@gmail.com,
> generated with the help of perlbug 1.40 running under perl 5.26.1.
> 
> 
> -----------------------------------------------------------------
> [Please describe your issue here]
> I wrote a script to match some text and copy it from large mhtml
> files. However it only works if I give it one file at a time.
> 

Processing one file at a time is almost always the best way to proceed.  What is your rationale for not doing so?

> Attempting to get it to process all 155MB of files causes the matc
> hing to fail and perl starts to match and print the entire file.
> 

Your post doesn't give any specific examples of the failure, so it's difficult to say what's happening.  However, trying to process 155MB of data in one fell swoop is almost certainly not the way to go.

> Fails if I pipe the files in via cat as well. I have 5.22 installed
> so I used perlbrew to install 5.26 to test the problem to see if it
> had been fixed before submitting the bug. The problem exist in
> 5.26 as well.
> 
> Here is the code that causes the glitch:
> 
> use Modern::Perl;
> use HTML::Strip;

Red flags right there.  The Perl 5 Porters mailing list/newsgroup, perlbug and rt.perl.org are for the purpose of reporting problems in the Perl 5 core distribution.  Modern::Perl and HTML::Strip are CPAN libraries not maintained by the Perl 5 Porters.  So if you're having a problem with a program in which you are 'use'-ing them, you first need to rule out the possibility that your problems come from those libraries rather than the Perl 5 core.

(Side note:  I don't see anything in your program that requires the use of Modern::Perl.  For the purpose of reporting a bug you're better off simply saying: 'use strict; use warnings;'.)

> binmode (STDOUT, ':utf8');
> binmode (STDIN, ':utf8');
> my $MYFILE;
> open ($MYFILE, ">>ccnawordlist.txt") or die "Error opening";
> 

You don't attempt to print to your output file until farther down.  Hence, there is no point to opening the append filehandle at this point.

> my $strip = HTML::Strip->new();
> 

You don't make use of $strip until farther down.  You should move this constructor down to just before you call a method on $strip.

> my $line;
> while (<<>>) {
> $line .= $_;
> }

So why are you trying to read 155MB of data into a single record?

> $line =~ s/=\R//g;

Why are you using the \R character class?

> my @matches = $line =~ /Key Terms You Should Know(.*?)Command
> References/msg or die "error";

If your pattern is not really well thought out, then the expression above, up through the '/msg' regex modifier, will evaluate to Perl-false.  If you then follow that with 'or die "error";', your program will indeed fail.

My hunch is that your reach exceeds your grasp.  Try your regex out on just one file at a time until it produces the matches you want.


> 
> for my $match  (@matches) {
> my $cleantext = $strip->parse( $match);
> $strip->eof;
> #$cleantext =~ s/=\R//g;
> print $MYFILE $cleantext;
> }
> 

I further recommend that you pose questions like this in a forum such as perlmonks.org or on a mailing list such as https://lists.perl.org/list/beginners.html.

Closing this ticket.  Thank you very much.

-- 
James E Keenan (jkeenan@cpan.org)

---
via perlbug:  queue: perl5 status: new
https://rt.perl.org/Ticket/Display.html?id=132197

Thread Previous


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About