develooper Front page | perl.libwww | Postings from January 2002

Re: Fixing opening/closing tags.

Thread Previous | Thread Next
From:
Bill Moseley
Date:
January 10, 2002 11:54
Subject:
Re: Fixing opening/closing tags.
Message ID:
3.0.3.32.20020110115358.02604b34@pop3.hank.org
At 10:55 PM 01/08/02 -0700, Sean M. Burke wrote:
>>But that seems way too awkward.
>
>It sounds like a fine way to do this.  What you have in mind is an
>inherently complex task, and this is the most straightforward
>implementation of it.

I thought I missed your point as you mentioned checking "parentage" which I'm not really.  Maybe we solved it in different ways.

You also recommended using $h->objectify_text, which helps, although I'm not sure I couldn't just look_down() and search for the text content items and avoid the conversions.  I didn't have time to benchmark different methods.

I'll attach what I ended up with (with the hopes people can point out better ways to do things ;).  For example, I can end up with empty content in tags (<b></b>), so I prune like this, which I'm not sure is the best way.

    $_->delete for $seg->look_down( sub
        {
            my @content = shift->content_list;
            return @content == 1 && !$content[0];
        }
    );

In review, what I'm trying to do is take a bit of html:

   <em>Three <b>sections -- are</b> <b>shown -- here</b></em>

Split on the " -- " yet keep the markup applied correctly.  This ends up like:

VAR1 = [
          {
            'text' => 'Three sections -- ',
            'html' => '<em>Three <b>sections</b></em>'
          },
          {
            'text' => 'Three sections -- are shown -- ',
            'html' => '<em><b>are</b> <b>shown</b></em>'
          },
          {
            'text' => 'Three sections -- are shown -- here',
            'html' => '<em><b>here</b></em>'
          }
        ];

If you think of bread-crumbing, I'll use the 'text' to build HREFs for searching, with more searches becoming more specific.


#!/usr/local/bin/perl -w
use strict;

use HTML::TreeBuilder;
use constant MATCH_STRING => ' -- ';


use Data::Dumper;
while ( <DATA> ) {
    chomp;
    next if /^#/;
    print "\n-------------------------\n\n ** $_ **\n";
    
    my $tree = HTML::TreeBuilder->new;
    $tree->parse( $_ );
    $tree->eof;

    my @new_str = split_html( $tree );

    $tree->delete;

    print Dumper \@new_str;
}    


sub split_html {
    my $tree = shift;

    my @segments;
    my $show_segment = 0;  # Current segment to show

    while ( 1 ) {
        # Start with fresh copy
        my $seg = $tree->look_down('_tag', 'body')->clone;

        my $cur_segment = 0;
        
        my $combined_text  = '';

        # Traverse text 
        $seg->objectify_text;

        for my $el ( $seg->look_down('_tag', '~text' ) ) {


            # Blank out segments past what we are interested in.
            if ( $cur_segment > $show_segment ) {
                $el->attr('text', '');
                next;
            }

            my $match_string = '';

            my $match_re = quotemeta( MATCH_STRING );

            for my $token ( split /($match_re)/, $el->attr('text') ) {
                if ( $token eq MATCH_STRING ) {
                    $cur_segment++;
                    $combined_text .= MATCH_STRING;
                    next;
                }

                last if $cur_segment > $show_segment;
                $combined_text .= $token;
                next unless $cur_segment == $show_segment;
                $match_string = $token;
            }

            $el->attr('text', $match_string );
        }


        $seg->deobjectify_text;

        # Prune empty content.

        $_->delete for $seg->look_down( sub
            {
                my @content = shift->content_list;
                return @content == 1 && !$content[0];
            }
        );


        
        # Save the extracted HTML segment, and stored text
        push @segments, {
            text => $combined_text,
            html => join '',
                        map { tr/\n//sd; $_ }  # Can I turn this off for as_HTML?
                            map { ref($_) ? $_->as_HTML : $_ } $seg->content_list,
            };
            
        $seg->delete;

        last unless $cur_segment > $show_segment;
        $show_segment++;
    }
    return @segments;
}

__DATA__
single phrase
<b>single phrase</b>
two sections -- are here
three sections -- are shown -- here
<b>two sections -- are here</b>
<b>three sections -- are shown -- here</b>
<b>two sections</b> -- are here
<b>three sections -- are</b> <b>shown -- here</b>
<em>Three <b>sections -- are</b> <b>shown -- here</b></em>

-- 
Bill Moseley
mailto:moseley@hank.org

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About