develooper Front page | perl.perl5.porters | Postings from March 2000

Security bug in (pursuant to CERT #CA-2000-02)

Tom Christiansen
March 8, 2000 09:36
Security bug in (pursuant to CERT #CA-2000-02)
Message ID:
In the module included in perl-5.5.670, of this module revision

    $CGI::revision = '$Id:,v 1.19 1999/08/31 17:04:37 lstein Exp $';

The following code occurs in the CGI::escapeHTML() function

	$toencode =~ s/&/&/g;
	$toencode =~ s/\"/"/g;
	$toencode =~ s/>/>/g;
	$toencode =~ s/</&lt;/g;

The problem I am about to demonstrate is exacerbated by this line
from the CGI::header() function

    $type ||= 'text/html' unless defined($type);

which I suspect should instead read more like

    $type ||= 'text/html;charset=ISO-8859-1';

except that you may wish to honor some sort of -CHARSET parameter,
and then interpolate that if set.

To demo the exploit, merely put this in a CGI script and observe
the result in a browser (well, in a browser that has the bug)

    print "\x8bH1\x9bTest\x8b/H1\x9b";

You are welcome to write that as:

    use CGI qw/:standard escapeHTML/;
    print p(escapeHTML("\x8bH1\x9bTest\x8b/H1\x9b"));

But it still sneaks through, because the code for escaping there isn't
good enough.  Even folks who know to escape "<" and ">" and "&" are in
trouble.  Here's the exploit: Because Netscape (and, I hear tell, the
MS-Exploder; there may be other offenders, too) improperly infer that
char 0x8b is a "<" (start of tag) and that 0x9b is a ">" (end of tag).
Consequently, the escapeHTML() function above misses those, and the
nefarious tags detailed in the CERT alert can propagate through.  I 
should like to imagine that this is stems from following a misguided
policy of obseqiously following some evil and rude MS-HTMLism.  

I'll explain more further down this missive, but if you would like,
background reading on these matters can be found in the recent CERT
alert of February 2000:

    "Malicious HTML Tags Embedded in Client Web Requests" 

And these tech tips:

    "How To Remove Meta-characters From User-Supplied Data In CGI Scripts 

    "Understanding Malicious Content Mitigation for Web Developers"

The issue is one of encoding types.  First, from the second tech tip:

    Many web pages leave the character encoding ("charset" parameter
    in HTTP) undefined.  In earlier versions of HTML and HTTP, the
    character encoding was supposed to default to ISO-8859-1 if it wasn't
    defined. In fact, many browsers had a different default, so it was
    not possible to rely on the default being ISO-8859-1. HTML version
    4 legitimizes this - if the character encoding isn't specified,
    any character encoding can be used.

    If the web server doesn't specify which character encoding is in
    use, it can't tell which characters are special. Web pages with
    unspecified character encoding work most of the time because most
    character sets assign the same characters to byte values below 128.
    But which of the values above 128 are special?

Now, on strictly conforming browsers (doubtless the empty set), one would
imagine that an explicit charset such as suffice Latin1 or ASCII would
be enough to dodge this evil and illicit interpretation of 0x8b and 0x9b.

Alas, it does not!  These wicked browsers do this wrongness even when
the server dutifully sends charset=iso-8859-1.  They do this when
it sends charset=utf-8.  They do this when it sends charset=ascii.
They do this when it sends charset=tengwar.  It doesn't matter, and it's
terribly wrong.  It's against the standards, it's a grave security hole,
but you have to code around it or you open yourself up to serious badness.

That means that all you're really hosed when output user input back to
them, or from a guestbook or database.  The only safe thing to do is
to strip out anything that's not in printable 7-bit ASCII, or printable
8-bit Latin-1.  

So what do you do?  Well, you could end up writing code that retains
only very mundane things:

    $new_guestbook_entry =~ tr[_a-zA-Z0-9 ,./!?()@+*-][]dc;

or, praying that ctype.h does the right thing (but on BSD and Linux,
at least, apparently it's only 7bit even with use locale)

    $new_guestbook_entry =~ s/[^[print]]//g;

or retaining just Latin1:

    $new_guestbook_entry =~ s/[^\x32-\x126\xa1-\xff]//g;

And *then* you do your "<" etc translation.  

Isn't that just horrible?  

This is a real annoyance as the world moves toward Unicode.

What I did learn was that if you encode these non-7bit chars
as ascii, you have a chance.  For example:

    <P>Alpha omega test: &#945; &#969; (should be lower case alpha
    followed by lower case omega).</P>
    <P>Cyrillic test: &#1071; (should be an upper case Cyrillic
    letter which looks like a mirror image of R)</P>
    <P>Extended Latin test: &#353; (should be an s with a
    caron (inverted circumflex, "hat"))</P>

That means that if you do this:

    print "&#139;H1&#155;Test&#139;/H1&#155;"

you're ok, whereas if you do this:

    print "\x8bH1\x9bTest\x8b/H1\x9b"

then you're hosed.  

I wonder whether you shouldn't shouldn't write the escaping thing
more like:

    $new_guestbook_entry =~ s/([^\x32-\x126\xa1-\xff])/"&#" . ord($1) . ";"/ge;

But then you hae to be careful not to change the & into &amp; later, too,
so you'll need to do it all in one pass, probably with a translation table.

I don't quite know what should be done about all this.  Suggestions?
One thing I'm pretty sure of is that should fix its escapeHTML()
function, and that this function should be advertised to the world so
the world can just use that instead of having to recreate what may turn
out to semi-tricky logic, and also so that the world stops imagining
themselves safe if they use their own more simplistic escaping.

--tom Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About