develooper Front page | perl.fwp | Postings from April 2003

reencoding ampersands for html, longhand

Thread Next
April 2, 2003 20:26
reencoding ampersands for html, longhand
Message ID:
We have a database where names and addresses have been 'polluted' by 
partial html escaping.  Its a mess, in that some entities are escaped 
(< back ticks and some others) some are sometimes escaped and 
ampersands are sometimes escaped.  The chickens come home to roost in the 
java pdf generation process, a name like "M&M Electric" chokes the (yes, 
by hand) parser - it turns it into "M&M; Electric" and complains there's 
no &M; entity defined.  Bleah.

It uses htmldoc ( to turn html 
into a pdf and as its the java data extract and html generation that 
chokes, I thought I'd try to do it in perl.  That way I can stil create 
the html and use htmldoc, which seems like a pretty slick package.

Anyway, I got all the named entities (the numbered ones aren't a problem), 
created a hash:
%html_entities = (
"quot" => 1,
"amp" => 1,
"lt" => 1,
"gt" => 1,
"nbsp" => 1,
... [ 200 more entities ]

and came up w/:
sub clean_html
  my $string = shift;
  my @ents = split(/&/, $string);
  my $ent;
  my $new_str = shift(@ents);  # for strings starting w/ an amp
  foreach $ent ( @ents ) {
   print "Ent: $ent\n" if $verbose > 3;
   if ( $ent =~ /^(\w{2,7});/  ) {
       print STDERR "got a possible ent, $1\n"
       if $verbose;
      my $val = lc($1);
      if ( $html_entities{$val} ) {
        $ent =~ s/^/\&/;         # valid, leave alone
      } else {
        print "Nope: $val\n"
          if $verbose > 3;
        $ent =~ s/^/\&/;
      }     # if html_entity

    } elsif ( /^#\d{3};/ ) { 
      $ent =~ s/^/\&/;          # valid, leave alone
    } else {
      $ent =~ s/^/\&/;
    $new_str .= $ent;
  $new_str .= "&" if $string =~ /&$/;   # ending amp
  return $new_str;

}    # sub clean_html

The idea was just to fix/replace the regular '&' w/ "&", regular being 
any that don't mark a valid named/numbered entity.  I didn't necessarily 
want to golf it but it certainly isn't the most wieldy sort of mess.  I 
tried CGI (un)escapeHTML but due to the mix of valid and invalid , as in:
L&L Electric"s Work
well, they didn't work.


Andy Bach, Sys. Mangler
VOICE: (608) 261-5738  FAX 264-5030

"Believe nothing, no matter where you read it or who has said it, not even 
if I have said it, unless it agrees with your own reason and your own 
common sense." Buddha

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About