develooper Front page | perl.libwww | Postings from January 2002

Re: Fixing opening/closing tags.

Thread Previous
Bill Moseley
January 6, 2002 09:24
Re: Fixing opening/closing tags.
Message ID:
At 05:41 PM 01/06/02 +0100, Reinier Post wrote:
>>    <b>This <em>is something -- really</em> -- awkward</b> without doubt
>> Ends up:
>>    <b>This <em>is something</em></b>
>>    <b><em>really</em></b>
>>    <b>awkward</b> without doubt
>This looks complex enough to merit an exact specification before you
>look for solutions.  As far as I can see you want to do two separate
> 1) all '--' within text content are replaced with "\n"
> 2a) (purely technical) all text content elements containing "\n" are split
>    such that the "\n" ends up in a separate text element that I'll call
>    "a newline element"
> 2b) all elements containing a newline element as child are split on the
>    newline element, pushing the newline element one level up, unless
>    such a split is invalid according to the DTD
>where 2b is applied until it no longer applies.

Yes, I think that's correct (you are using \n instead of a double dash as
the split point).

In simple terms, I'm taking a string that may have some type of (correctly
balanced) markup.  Splitting it on /\s*--\s*/, and then those parts are
going to end up as the text element of links.

my @tag_stack;

for my $parts ( split /\s*--\s*/, $orig_string ) {
    my $text = balance_tags( $part, \@tag_stack )
    my $href = build_href( $part );
    push @links, qq[<a href="$href">$text</a>];

>   <b>shouting: <em>hello\nworld</em></b>

>and after another 2b
>       +- "shouting: "
>       |
>    b -+
>       |
>       + em -+- "hello"
>    "\n"
>    b -+- em -+- "world"


So my next question is how to make those transformations with
HTML::TreeBuilder.  I don't see much problem writing that balance_tags()
sub above, but it would be nice to see how to do it with TreeBuilder.


Bill Moseley

Thread Previous Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About