Front page | perl.fwp |
Postings from July 2003
Re: Is this fun?
Thread Previous
|
Thread Next
From:
Keith C. Ivey
Date:
July 15, 2003 07:23
Subject:
Re: Is this fun?
Message ID:
3F13D626.3394.26BD2126@localhost
A. Pagaltzis <pagaltzis@gmx.de> wrote:
> More than those you mention - because it doesn't parse HTML,
> just looks for some string bits. It will blow up on
>
> <img alt="<a r g h>" ...>
True, but in the real world (or at least that part of it I
experience), you're more likely to run into something like
<img src=http://www.example.com/images/abcd.gif>
which will be handled by the regex but may cause a parser to
blow up (though some are more tolerant than others). It's sad
that such code exists, it's sad that browsers tolerate it
without complaint, but we have to deal with it.
Unfortunately, stuff on pages encountered in the wild often
isn't valid HTML -- in fact that was the whole point of the
exercise here. Valid HTML would have had the closing tags
already. And the stuff being produced isn't valid HTML
either, since the tags may be misnested.
Sometimes parsing is overkill. If regexes are good enough for
Tim Bray, they're good enough for me:
| That leaves input data munging, which I do a lot of, and a
| lot of input data these days is XML. Now here's the dirty
| secret; most of it is machine-generated XML, and in most
| cases, I use the perl regexp engine to read and process
| it. I've even gone to the length of writing a prefilter to
| glue together tags that got split across multiple lines,
| just so I could do the regexp trick.
http://www.tbray.org/ongoing/When/200x/2003/03/16/XML-Prog
--
Keith C. Ivey <kcivey@cpcug.org>
Washington, DC
Thread Previous
|
Thread Next