develooper Front page | perl.libwww | Postings from January 2002

Re: Fixing opening/closing tags.

Thread Previous | Thread Next
From:
Reinier Post
Date:
January 6, 2002 08:42
Subject:
Re: Fixing opening/closing tags.
Message ID:
20020106174158.D14160@win.tue.nl
On Sat, Jan 05, 2002 at 07:31:07AM -0800, Bill Moseley wrote:

[...]
 
> I've got a string of HTML that I need to split up into separate parts (they
> will end up in links), yet I want the formatting to get fixed up.  For
> example:
> 
> Starting text:
> 
> <tag>This is a -- bunch</tag> of words 
>     <tag>where maybe -- some have</tag> tags. 
> 
> Splitting on the double dash:
> 
>   <tag>This is a
>   bunch</tag> of words <tag>where maybe
>   some have</tag> tags.
> 
> Which should then be corrected to:
> 
>   <tag>This is a</tag>
>   <tag>bunch</tag> of words <tag>where maybe</tag>
>   <tag>some have</tag> tags.
> 
> Now, it's a bit more tricky than that, since you can't just look at the
> individual split strings.  That is, if the starting string is:
> 
>   <tag>first section -- second section -- third section</tag>
> 
> The entire string is within the tag, so when splitting you want the results
> to be:
> 
>   <tag>first section</tag>
>   <tag>second section</tag>
>   <tag>third section</tag>
> 
> So you need to keep track of the current "state" of the tag across all
> split text.  
> 
> One problem, of course, is I don't really know what the actual <tag> will
> be, and there might be a mixture of tags:
> 
>    <b>This <em>is something -- really</em> -- awkward</b> without doubt
> 
> Ends up:
> 
>    <b>This <em>is something</em></b>
>    <b><em>really</em></b>
>    <b>awkward</b> without doubt

This looks complex enough to merit an exact specification before you
look for solutions.  As far as I can see you want to do two separate
transformations:

 1) all '--' within text content are replaced with "\n"

 2a) (purely technical) all text content elements containing "\n" are split
    such that the "\n" ends up in a separate text element that I'll call
    "a newline element"
 2b) all elements containing a newline element as child are split on the
    newline element, pushing the newline element one level up, unless
    such a split is invalid according to the DTD

where 2b is applied until it no longer applies.

Example: 

   <b>shouting: <em>hello\nworld</em></b>

is
       +- "shouting: "
       |
    b -+
       |
       +- em -+- "hello\nworld"

which after 2a becomes

       +- "shouting: "
       |
    b -+
       |      +- "hello"
       |      |
       +- em -+- "\n"
              |
              +- "world"

then after 2b

       +- "shouting: "
       |
    b -+
       |
       +- em -+- "hello"
       |
       +- "\n"
       |
       +- em -+- "world"

and after another 2b

       +- "shouting: "
       |
    b -+
       |
       + em -+- "hello"

    "\n"

    b -+- em -+- "world"

and assuming the <b> is itself within a <body>, it would stop here because
while having two <b>s is valid, having two <body>s isn't.

Is this specification correct?

> All I can think of is to use HTML::Parser and tokenize, where each token is
> a structure that has the text, plus a list of tags currently in force.

A stack; yes, that would be the most efficient solution.  You just maintain the
element path from the last splittable tag (in this case, <b> not <body>) to the
current element, and write out the closing and opening tags at any "\n" or "--"
within a text content; of course you'd have to copy any element attributes.

> But I'm wondering if HTML::TreeBuilder might be able to rescue me.

I think it would be cleaner to use HTML::TreeBuilder and manipulate the tree,
and if you intend to do more difficult transformations that may require
lookahead, it's worthwhile.
 
> It doesn't seem like an uncommon problem, so I'm asking here for advice.
> 
> BTW -- This isn't a huge issue, but I was originally looking for a
> non-HTML::Parser solution because this will be running under mod_perl so
> I'd like to avoid brining in the HTML::Parser module(s) to save memory.
> Minor issue, though.

This also depends on how clean your input HTML is.  Running it through
HTML::Parser/::TreeBuilder will enfocre *their* ideas of what HTML should
look like.

You may also want to have a look at HTML::PrettyPrint for related code.

> Thanks,
> 
> -- 
> Bill Moseley
> mailto:moseley@hank.org

-- 
Reinier Post
TU Eindhoven

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About