On Wed, Apr 11, 2007 at 01:49:08PM +0200, Dr.Ruud wrote: > See `perldoc -q remove.HTML`: > > <quote> > Here's one "simple-minded" approach, that works > for most files: > > #!/usr/bin/perl -p0777 > s/<(?:[^>'"]*|(['"]).*?\1)*>//gs > > </quote> > > That regex has a ** structure, which can make it very slow, for instance > on an "incomplete" string like > qq{> <META http-equiv=3DContent-Type content=3D"text/html; =\n>}. Well, it wasn't called 'simple-minded' without reason. The advantage of the given substitution is that it's reasonable simple, while still catching most of the HTML tags. That it isn't as fast as it can be on files that do not contain correct HTML isn't something I consider to be that important. > So maybe we should put a "safer" variant in the documentation? Maybe, but I'd still keep the "simple-minded" regexp there. > Yves suggested to improve it to something like: > > s/< > (?: > (?> [^>'"]+ ) > | > (?> " (?> [^"]* ) " ) > | > (?> ' (?> [^']* ) ' ) > )* > > > //xgs No /s needed. Or even (5.10 style + loop-unrolling): s/< [^>"']++ (?: (?:"[^"]*+")*+ (?:'[^']*+')*+ [^>'"]*+ )*+ > /xg; The above regexp is untested, but doesn't contain any alternations, which ought to make it faster. > I am currently using something more like: > > 1 while > s~ < [/!]? \w+ > (?: \s+ > (?: \w+ = )? > (?: " [^"]* " > | ' [^']* ' > | \w+ > ) > )* > > > ~~xs; No /s needed here either. None of these actually remove HTML comments correctly, do they? My guess is that people would prefer that over fast parsing incorrect HTML. AbigailThread Previous | Thread Next