develooper Front page | perl.perl5.porters | Postings from April 2007

perlfaq9.pod, "remove HTML from a string" example

Thread Next
From:
Dr.Ruud
Date:
April 11, 2007 05:01
Subject:
perlfaq9.pod, "remove HTML from a string" example
Message ID:
20070411120103.29392.qmail@lists.develooper.com
See `perldoc -q remove.HTML`:

<quote>
Here's one "simple-minded" approach, that works
for most files:

    #!/usr/bin/perl -p0777
    s/<(?:[^>'"]*|(['"]).*?\1)*>//gs

</quote>

That regex has a ** structure, which can make it very slow, for instance
on an "incomplete" string like
qq{> <META http-equiv=3DContent-Type content=3D"text/html; =\n>}.

So maybe we should put a "safer" variant in the documentation?


Yves suggested to improve it to something like:

  s/<
    (?:
      (?> [^>'"]+ )
      |
      (?> " (?> [^"]* ) " )
      |
      (?> ' (?> [^']* ) ' )
    )*
    >
   //xgs


I am currently using something more like:

  1 while
      s~ < [/!]? \w+
           (?: \s+
              (?: \w+ = )?
              (?: " [^"]* "
                | ' [^']* '
                | \w+
              )
           )*
         >
       ~~xs;

-- 
Affijn, Ruud

"Gewoon is een tijger."


Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About