develooper Front page | perl.perl5.porters | Postings from April 2007

Re: perlfaq9.pod, "remove HTML from a string" example

Thread Previous | Thread Next
From:
Abigail
Date:
April 11, 2007 05:46
Subject:
Re: perlfaq9.pod, "remove HTML from a string" example
Message ID:
20070411124648.GA4231@abigail.nl
On Wed, Apr 11, 2007 at 01:49:08PM +0200, Dr.Ruud wrote:
> See `perldoc -q remove.HTML`:
> 
> <quote>
> Here's one "simple-minded" approach, that works
> for most files:
> 
>     #!/usr/bin/perl -p0777
>     s/<(?:[^>'"]*|(['"]).*?\1)*>//gs
> 
> </quote>
> 
> That regex has a ** structure, which can make it very slow, for instance
> on an "incomplete" string like
> qq{> <META http-equiv=3DContent-Type content=3D"text/html; =\n>}.


Well, it wasn't called 'simple-minded' without reason. 

The advantage of the given substitution is that it's reasonable simple,
while still catching most of the HTML tags. That it isn't as fast as 
it can be on files that do not contain correct HTML isn't something I
consider to be that important.

> So maybe we should put a "safer" variant in the documentation?

Maybe, but I'd still keep the "simple-minded" regexp there.

> Yves suggested to improve it to something like:
> 
>   s/<
>     (?:
>       (?> [^>'"]+ )
>       |
>       (?> " (?> [^"]* ) " )
>       |
>       (?> ' (?> [^']* ) ' )
>     )*
>     >
>    //xgs

No /s needed.


Or even (5.10 style + loop-unrolling):

    s/< [^>"']++
        (?:
             (?:"[^"]*+")*+
             (?:'[^']*+')*+
             [^>'"]*+
        )*+
      >
     /xg;


The above regexp is untested, but doesn't contain any alternations,
which ought to make it faster.

> I am currently using something more like:
> 
>   1 while
>       s~ < [/!]? \w+
>            (?: \s+
>               (?: \w+ = )?
>               (?: " [^"]* "
>                 | ' [^']* '
>                 | \w+
>               )
>            )*
>          >
>        ~~xs;

No /s needed here either.


None of these actually remove HTML comments correctly, do they? My guess
is that people would prefer that over fast parsing incorrect HTML.


Abigail

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About