develooper Front page | perl.libwww | Postings from July 2001

Re: Conditional handling in HTML::Parser

Thread Next
From:
Gisle Aas
Date:
July 10, 2001 11:17
Subject:
Re: Conditional handling in HTML::Parser
Message ID:
lrk81gejix.fsf@caliper.ActiveState.com
Brent Baccala <baccala@freesoft.org> writes:

> I've got a set of scripts that alter HTML content (expected to be in
> spanish) by adding a link to every word that triggers a lookup in a
> spanish/english dictionary.  I use HTML::Parser.
> 
> Anyway, I've come across some documents that don't parse right.  They
> appear to have been generated by Microsoft Office, and include tags like
> this:
> 
> <![if !supportEmptyParas]>&nbsp;<![endif]>
> 
> The "if" and "supportEmptyParas" end up getting flagged as text, even if
> I've called marked_sections(1)

This stuff does not follow the marked_sections syntax so I'm not
surprised.  As a marked section it would have to be expressed
something like:

  <![ &supportEmptyParams; [ &nbsp; ]]>

where &supportEmptyParams; expands to either "IGNORE" or "INCLUDE".

I don't know SGML well enough to tell if this is something worth
supporting or if this stuff is valid SGML at all.  Does anybody else
know?

A simple hack to avoid this stuff might be to run something like
s/<!(if|endif)\[.*?\]>// on the text before feeding it to HTML::Parser.

> Since I don't really know SGML, I'm not sure how this should be handled,
> or even if it can be handled without having the Microsoft schema (which
> I can't find) available to be parsed.  Anyway, I thought I'd let you
> know.  The URL of the original document is:
> 
> 	http://www.sgci.mec.es/uk/Pub/Tecla/2001/julio2b.htm
> 
> and the page for my scripts is:
> 
> 	http://vyger.freesoft.org/software/spanish
> 
> Thanks for your work with HTML::Parser, it's made this script fairly
> easy to write.

Good to hear!

Regards,
Gisle

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About