develooper Front page | perl.perl5.porters | Postings from April 2007

Re: perlfaq9.pod, "remove HTML from a string" example

Thread Previous | Thread Next
From:
Dr.Ruud
Date:
April 11, 2007 07:19
Subject:
Re: perlfaq9.pod, "remove HTML from a string" example
Message ID:
20070411141923.16182.qmail@lists.develooper.com
On Wed, Apr 11, 2007 at 01:49:08PM +0200, Dr.Ruud wrote:
> Abigail:

>> See `perldoc -q remove.HTML`:
>>
>> <quote>
>> Here's one "simple-minded" approach, that works
>> for most files:
>>
>>     #!/usr/bin/perl -p0777
>>     s/<(?:[^>'"]*|(['"]).*?\1)*>//gs
>>
>> </quote>
>>
>> That regex has a ** structure, which can make it very slow, for
instance
>> on an "incomplete" string like
>> qq{> <META http-equiv=3DContent-Type content=3D"text/html; =\n>}.
>
> Well, it wasn't called 'simple-minded' without reason.

Yes, but it could use a warning, like "Beware that incorrect HTML can
slow down this regex significantly (it has a double layer of *
quantifiers to try and please the backtracker).".


>> So maybe we should put a "safer" variant in the documentation?
>
> Maybe, but I'd still keep the "simple-minded" regexp there.

Fine.


> Or even (5.10 style + loop-unrolling):
>
>    s/< [^>"']++
>        (?:
>             (?:"[^"]*+")*+
>             (?:'[^']*+')*+
>             [^>'"]*+
>        )*+
>      >
>     /xg;

Good one for the FAQ.


> None of these actually remove HTML comments correctly, do they?

Yes. :)
The FAQ also mentions Tom Christiansen's striphtml, but that has
problems too.


> My guess is that people would prefer that over fast parsing incorrect
HTML.

That always depends on what it is used for. I use the coarse method to
quickly strip email-messages, just to barely keep the idea of what they
are about. These messages can contain fragments of HTML, even quoted by
inserting '> ' at the start of the lines, etc.

-- 
Affijn, Ruud

"Gewoon is een tijger."


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About