develooper Front page | perl.perl5.porters | Postings from April 2007

Re: perlfaq9.pod, "remove HTML from a string" example

Thread Previous | Thread Next
April 11, 2007 07:19
Re: perlfaq9.pod, "remove HTML from a string" example
Message ID:
On Wed, Apr 11, 2007 at 01:49:08PM +0200, Dr.Ruud wrote:
> Abigail:

>> See `perldoc -q remove.HTML`:
>> <quote>
>> Here's one "simple-minded" approach, that works
>> for most files:
>>     #!/usr/bin/perl -p0777
>>     s/<(?:[^>'"]*|(['"]).*?\1)*>//gs
>> </quote>
>> That regex has a ** structure, which can make it very slow, for
>> on an "incomplete" string like
>> qq{> <META http-equiv=3DContent-Type content=3D"text/html; =\n>}.
> Well, it wasn't called 'simple-minded' without reason.

Yes, but it could use a warning, like "Beware that incorrect HTML can
slow down this regex significantly (it has a double layer of *
quantifiers to try and please the backtracker).".

>> So maybe we should put a "safer" variant in the documentation?
> Maybe, but I'd still keep the "simple-minded" regexp there.


> Or even (5.10 style + loop-unrolling):
>    s/< [^>"']++
>        (?:
>             (?:"[^"]*+")*+
>             (?:'[^']*+')*+
>             [^>'"]*+
>        )*+
>      >
>     /xg;

Good one for the FAQ.

> None of these actually remove HTML comments correctly, do they?

Yes. :)
The FAQ also mentions Tom Christiansen's striphtml, but that has
problems too.

> My guess is that people would prefer that over fast parsing incorrect

That always depends on what it is used for. I use the coarse method to
quickly strip email-messages, just to barely keep the idea of what they
are about. These messages can contain fragments of HTML, even quoted by
inserting '> ' at the start of the lines, etc.

Affijn, Ruud

"Gewoon is een tijger."

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About