develooper Front page | perl.fwp | Postings from April 2003

reencoding ampersands for html, longhand

Thread Next
From:
Andy_Bach
Date:
April 2, 2003 20:26
Subject:
reencoding ampersands for html, longhand
Message ID:
OF7AD50E81.E69205E4-ON86256CFD.0016B8EF@uscmail.dcn
We have a database where names and addresses have been 'polluted' by 
partial html escaping.  Its a mess, in that some entities are escaped 
(< back ticks and some others) some are sometimes escaped and 
ampersands are sometimes escaped.  The chickens come home to roost in the 
java pdf generation process, a name like "M&M Electric" chokes the (yes, 
by hand) parser - it turns it into "M&M; Electric" and complains there's 
no &M; entity defined.  Bleah.

It uses htmldoc (http://www.easysw.com/htmldoc/index.html) to turn html 
into a pdf and as its the java data extract and html generation that 
chokes, I thought I'd try to do it in perl.  That way I can stil create 
the html and use htmldoc, which seems like a pretty slick package.

Anyway, I got all the named entities (the numbered ones aren't a problem), 
created a hash:
%html_entities = (
"quot" => 1,
"amp" => 1,
"lt" => 1,
"gt" => 1,
"nbsp" => 1,
... [ 200 more entities ]

and came up w/:
sub clean_html
{
  my $string = shift;
  my @ents = split(/&/, $string);
  my $ent;
  my $new_str = shift(@ents);  # for strings starting w/ an amp
  foreach $ent ( @ents ) {
   print "Ent: $ent\n" if $verbose > 3;
   if ( $ent =~ /^(\w{2,7});/  ) {
       print STDERR "got a possible ent, $1\n"
       if $verbose;
      my $val = lc($1);
      if ( $html_entities{$val} ) {
        $ent =~ s/^/\&/;         # valid, leave alone
      } else {
        print "Nope: $val\n"
          if $verbose > 3;
        $ent =~ s/^/\&/;
      }     # if html_entity

    } elsif ( /^#\d{3};/ ) { 
      $ent =~ s/^/\&/;          # valid, leave alone
    } else {
      $ent =~ s/^/\&/;
    }
    $new_str .= $ent;
  }
  $new_str .= "&" if $string =~ /&$/;   # ending amp
  return $new_str;

}    # sub clean_html

The idea was just to fix/replace the regular '&' w/ "&", regular being 
any that don't mark a valid named/numbered entity.  I didn't necessarily 
want to golf it but it certainly isn't the most wieldy sort of mess.  I 
tried CGI (un)escapeHTML but due to the mix of valid and invalid , as in:
L&L Electric"s Work
well, they didn't work.

a

Andy Bach, Sys. Mangler
Internet: andy_bach@wiwb.uscourts.gov 
VOICE: (608) 261-5738  FAX 264-5030

"Believe nothing, no matter where you read it or who has said it, not even 
if I have said it, unless it agrees with your own reason and your own 
common sense." Buddha

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About