develooper Front page | perl.libwww | Postings from January 2002

Fixing opening/closing tags.

Thread Next
Bill Moseley
January 5, 2002 07:31
Fixing opening/closing tags.
Message ID:
(I wonder if this will display odd in Outlook-type of broken mail clients)

I've got a string of HTML that I need to split up into separate parts (they
will end up in links), yet I want the formatting to get fixed up.  For

Starting text:

<tag>This is a -- bunch</tag> of words 
    <tag>where maybe -- some have</tag> tags. 

Splitting on the double dash:

  <tag>This is a
  bunch</tag> of words <tag>where maybe
  some have</tag> tags.

Which should then be corrected to:

  <tag>This is a</tag>
  <tag>bunch</tag> of words <tag>where maybe</tag>
  <tag>some have</tag> tags.

Now, it's a bit more tricky than that, since you can't just look at the
individual split strings.  That is, if the starting string is:

  <tag>first section -- second section -- third section</tag>

The entire string is within the tag, so when splitting you want the results
to be:

  <tag>first section</tag>
  <tag>second section</tag>
  <tag>third section</tag>

So you need to keep track of the current "state" of the tag across all
split text.  

One problem, of course, is I don't really know what the actual <tag> will
be, and there might be a mixture of tags:

   <b>This <em>is something -- really</em> -- awkward</b> without doubt

Ends up:

   <b>This <em>is something</em></b>
   <b>awkward</b> without doubt

All I can think of is to use HTML::Parser and tokenize, where each token is
a structure that has the text, plus a list of tags currently in force.

But I'm wondering if HTML::TreeBuilder might be able to rescue me.

It doesn't seem like an uncommon problem, so I'm asking here for advice.

BTW -- This isn't a huge issue, but I was originally looking for a
non-HTML::Parser solution because this will be running under mod_perl so
I'd like to avoid brining in the HTML::Parser module(s) to save memory.
Minor issue, though.


Bill Moseley

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About