develooper Front page | perl.perl5.porters | Postings from October 2009

Re: [perl #69973] Invalid and tainted utf-8 char crashes perl 5.10.1 in regexp evaluation

Thread Previous | Thread Next
From:
demerphq
Date:
October 23, 2009 13:15
Subject:
Re: [perl #69973] Invalid and tainted utf-8 char crashes perl 5.10.1 in regexp evaluation
Message ID:
9b18b3110910231315j41ca06dbleb4ba6f74aa75ff2@mail.gmail.com
2009/10/22 Mark Martinec <perlbug-followup@perl.org>:
> # New Ticket Created by  Mark Martinec
> # Please include the string:  [perl #69973]
> # in the subject line of all future correspondence about this issue.
> # <URL: http://rt.perl.org/rt3/Ticket/Display.html?id=69973 >
>
>
>
> This is a bug report for perl from Mark.Martinec@ijs.si,
> generated with the help of perlbug 1.39 running under perl 5.10.1.
>
>
> -----------------------------------------------------------------
> [Please describe your issue here]
>
> Tracking down a reason for crashes of a perl process while processing
> certain obfuscated spam messages, it turns out that an utf-8 character
> with a large (and invalid) codepoint is causing a perl 5.10.1 crash
> while matching such string to a particular regular expression.
>
> This is happening on a FreeBSD 7.2, using perl as installed from ports
> with no special settings.
>
> Reducing the actual crashing application to a small test case,
> here it is:
>
>
> #!/usr/bin/perl -T
>  use strict;
>
>  # Here is a HTML snippet from a malicious/obfuscated mail message.
>  # Note the last character has an invalid and huge UTF-8 code
>  # (as a result of an unrelated bug in HTML::Parser).
>  #
>  my $t = '<a>Attention Home&#959&#969n&#1257rs...1&#1109t '.
>          'T&#1110&#1084e E&#957&#1257&#1075075</a>';
>
>  $t =~ s/&#(\d+)/chr($1)/ge;    # convert HTML entities to UTF8
>  $t .= substr($ENV{PATH},0,0);  # make it tainted
>
>  # show character codes in the resulting string
>  print join(", ", map {ord} split(//,$t)), "\n";
>
>  # The following regexp evaluation crashes perl 5.10.1 on FreeBSD.
>  # Note that $t must be tainted and must have the UTF8 flag on,
>  # otherwise the crash seems to be avoided.
>
>  $t =~ /( |\b)(http:|www\.)/i;
>
>
> and here is the result (hand wrapped):
>
>  60, 97, 62, 65, 116, 116, 101, 110, 116, 105, 111, 110, 32, 72, 111,
>  109, 101, 959, 969, 110, 1257, 114, 115, 46, 46, 46, 49, 1109, 116,
>  32, 84, 1110, 1084, 101, 32, 69, 957, 1257, 1075075, 60, 47, 97, 62
>  Segmentation fault: 11 (core dumped)
>
>
> Here is a backtrace as obtained from a core dump
> (cut/pasted from screen, the actual 8-bit characters may be wrong):
> $ gdb -c perl5.10.1.core /usr/local/bin/perl5.10.1
> GNU gdb 6.1.1 [FreeBSD]
> Copyright 2004 Free Software Foundation, Inc.
> GDB is free software, covered by the GNU General Public License, and you are
> welcome to change it and/or distribute copies of it under certain conditions.
> Type "show copying" to see the conditions.
> There is absolutely no warranty for GDB.  Type "show warranty" for details.
> This GDB was configured as "amd64-marcel-freebsd"...
> Core was generated by `perl5.10.1'.
> Program terminated with signal 11, Segmentation fault.
> Reading symbols from /usr/local/lib/perl5/5.10.1/mach/CORE/libperl.so...done.
> Loaded symbols for /usr/local/lib/perl5/5.10.1/mach/CORE/libperl.so
> Reading symbols from /lib/libm.so.5...done.
> Loaded symbols for /lib/libm.so.5
> Reading symbols from /lib/libcrypt.so.4...done.
> Loaded symbols for /lib/libcrypt.so.4
> Reading symbols from /lib/libutil.so.7...done.
> Loaded symbols for /lib/libutil.so.7
> Reading symbols from /lib/libc.so.7...done.
> Loaded symbols for /lib/libc.so.7
> Reading symbols from /libexec/ld-elf.so.1...done.
> Loaded symbols for /libexec/ld-elf.so.1
> #0  0x00000000408bb101 in S_regmatch (reginfo=0x7fffffffe590, prog=0x411143a4) at regexec.c:3049
> 3049                            REXEC_TRIE_READ_CHAR(trie_type, trie, widecharmap, uc,

Unfortunately this is just masking the cause, im pretty sure the
problem is in utf8.c

You would have ended up in this code:

    case trie_utf8_fold:                                                    \
	if ( foldlen>0 ) {                                                  \
	    uvc_unfolded = uvc = utf8n_to_uvuni( uscan, UTF8_MAXLEN, &len,
uniflags ); \
	    foldlen -= len;                                                 \
	    uscan += len;                                                   \
	    len=0;                                                          \
	} else {                                                            \
	    uvc_unfolded = uvc = utf8n_to_uvuni( (U8*)uc, UTF8_MAXLEN, &len,
uniflags ); \
	    uvc = to_uni_fold( uvc, foldbuf, &foldlen );                    \
	    foldlen -= UNISKIP( uvc );                                      \
	    uscan = foldbuf + UNISKIP( uvc );                               \
	}                                                                   \
	break;

Im guessing in the second clause, probably in to_uni_fold().

> (gdb) bt
> #0  0x00000000408bb101 in S_regmatch (reginfo=0x7fffffffe590, prog=0x411143a4) at regexec.c:3049
> #1  0x00000000408b7b0a in S_regtry (reginfo=0x7fffffffe590, startpos=0x7fffffffe6d8) at regexec.c:2355
> #2  0x00000000408b6a7a in Perl_regexec_flags (prog=0x4114f1a0,
>    stringarg=0x4111d6c0 "<a>Attention HomeοÏ\211nÓ©rs...1Ñ\225t TÑ\226мe Eνөô\206\236\203</a>",
>    strend=0x4111d6f3 "/a>",
>    strbeg=0x4111d6c0 "<a>Attention HomeοÏ\211nÓ©rs...1Ñ\225t TÑ\226мe Eνөô\206\236\203</a>", minend=0,
>    sv=0x4113ec48, data=0x0, flags=3) at regexec.c:2146
> #3  0x00000000407864a3 in Perl_pp_match () at pp_hot.c:1356
> #4  0x000000004073fa4c in Perl_runops_debug () at dump.c:1968
> #5  0x00000000406905d8 in S_run_body (oldscope=1) at perl.c:2431
> #6  0x000000004068f9b0 in perl_run (my_perl=0x41102104) at perl.c:2349
> #7  0x0000000000400bf4 in main (argc=3, argv=0x7fffffffea90, env=0x7fffffffeab0) at perlmain.c:117
>
> (gdb)
>
>
>
> And lastly, here is a perl debug output using the -Dr command line option:

Thanks, your report is very complete.

> Compiling REx "( |\b)(http:|www\.)"
> Final program:
>   1: OPEN1 (3)
>   3:   BRANCH (6)
>   4:     EXACTF < > (8)
>   6:   BRANCH (FAIL)
>   7:     BOUND (8)
>   8: CLOSE1 (10)
>  10: OPEN2 (12)
>  12:   TRIE-EXACTF[HWhw] (19)
>        <http:>
>        <www.>
>  19: CLOSE2 (21)
>  21: END (0)
> minlen 4
> Omitting $` $& $' support.
>
> EXECUTING...
[...]
>  46 <E%x{3bd}%x{4e9}> <%x{106783}>|  1:OPEN1(3)
>  46 <E%x{3bd}%x{4e9}> <%x{106783}>|  3:BRANCH(6)
>  46 <E%x{3bd}%x{4e9}> <%x{106783}>|  4:  EXACTF < >(8)
>                                    failed...
>  46 <E%x{3bd}%x{4e9}> <%x{106783}>|  6:BRANCH(8)
>  46 <E%x{3bd}%x{4e9}> <%x{106783}>|  7:  BOUND(8)
>  46 <E%x{3bd}%x{4e9}> <%x{106783}>|  8:  CLOSE1(10)
>  46 <E%x{3bd}%x{4e9}> <%x{106783}>| 10:  OPEN2(12)
>  46 <E%x{3bd}%x{4e9}> <%x{106783}>| 12:  TRIE-EXACTF[HWhw](19)
>  46 <E%x{3bd}%x{4e9}> <%x{106783}>|      State:    1 Accepted:    0

I think the regex engine is the only place that uses the unicode
folding logic. Ill try to dig further.

cheers,
Yves


-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About