On Wed, Apr 11, 2007 at 01:49:08PM +0200, Dr.Ruud wrote: > Abigail: >> See `perldoc -q remove.HTML`: >> >> <quote> >> Here's one "simple-minded" approach, that works >> for most files: >> >> #!/usr/bin/perl -p0777 >> s/<(?:[^>'"]*|(['"]).*?\1)*>//gs >> >> </quote> >> >> That regex has a ** structure, which can make it very slow, for instance >> on an "incomplete" string like >> qq{> <META http-equiv=3DContent-Type content=3D"text/html; =\n>}. > > Well, it wasn't called 'simple-minded' without reason. Yes, but it could use a warning, like "Beware that incorrect HTML can slow down this regex significantly (it has a double layer of * quantifiers to try and please the backtracker).". >> So maybe we should put a "safer" variant in the documentation? > > Maybe, but I'd still keep the "simple-minded" regexp there. Fine. > Or even (5.10 style + loop-unrolling): > > s/< [^>"']++ > (?: > (?:"[^"]*+")*+ > (?:'[^']*+')*+ > [^>'"]*+ > )*+ > > > /xg; Good one for the FAQ. > None of these actually remove HTML comments correctly, do they? Yes. :) The FAQ also mentions Tom Christiansen's striphtml, but that has problems too. > My guess is that people would prefer that over fast parsing incorrect HTML. That always depends on what it is used for. I use the coarse method to quickly strip email-messages, just to barely keep the idea of what they are about. These messages can contain fragments of HTML, even quoted by inserting '> ' at the start of the lines, etc. -- Affijn, Ruud "Gewoon is een tijger."Thread Previous | Thread Next