develooper Front page | perl.fwp | Postings from July 2003

Re: Is this fun?

Thread Previous | Thread Next
Keith C. Ivey
July 15, 2003 07:23
Re: Is this fun?
Message ID:
A. Pagaltzis <> wrote:

> More than those you mention - because it doesn't parse HTML,
> just looks for some string bits. It will blow up on
> <img alt="<a r g h>" ...>

True, but in the real world (or at least that part of it I 
experience), you're more likely to run into something like

   <img src=>

which will be handled by the regex but may cause a parser to 
blow up (though some are more tolerant than others).  It's sad 
that such code exists, it's sad that browsers tolerate it 
without complaint, but we have to deal with it.

Unfortunately, stuff on pages encountered in the wild often 
isn't valid HTML -- in fact that was the whole point of the 
exercise here.  Valid HTML would have had the closing tags 
already.  And the stuff being produced isn't valid HTML 
either, since the tags may be misnested.

Sometimes parsing is overkill.  If regexes are good enough for 
Tim Bray, they're good enough for me:

|   That leaves input data munging, which I do a lot of, and a
|   lot of input data these days is XML. Now here's the dirty
|   secret; most of it is machine-generated XML, and in most
|   cases, I use the perl regexp engine to read and process
|   it. I've even gone to the length of writing a prefilter to
|   glue together tags that got split across multiple lines,
|   just so I could do the regexp trick.

Keith C. Ivey <>
Washington, DC

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About