Front page | perl.perl5.porters |
Postings from October 2009
Re: [perl #69973] Invalid and tainted utf-8 char crashes perl 5.10.1 in regexp evaluation
Thread Previous
|
Thread Next
From:
demerphq
Date:
October 23, 2009 13:15
Subject:
Re: [perl #69973] Invalid and tainted utf-8 char crashes perl 5.10.1 in regexp evaluation
Message ID:
9b18b3110910231315j41ca06dbleb4ba6f74aa75ff2@mail.gmail.com
2009/10/22 Mark Martinec <perlbug-followup@perl.org>:
> # New Ticket Created by Mark Martinec
> # Please include the string: [perl #69973]
> # in the subject line of all future correspondence about this issue.
> # <URL: http://rt.perl.org/rt3/Ticket/Display.html?id=69973 >
>
>
>
> This is a bug report for perl from Mark.Martinec@ijs.si,
> generated with the help of perlbug 1.39 running under perl 5.10.1.
>
>
> -----------------------------------------------------------------
> [Please describe your issue here]
>
> Tracking down a reason for crashes of a perl process while processing
> certain obfuscated spam messages, it turns out that an utf-8 character
> with a large (and invalid) codepoint is causing a perl 5.10.1 crash
> while matching such string to a particular regular expression.
>
> This is happening on a FreeBSD 7.2, using perl as installed from ports
> with no special settings.
>
> Reducing the actual crashing application to a small test case,
> here it is:
>
>
> #!/usr/bin/perl -T
> use strict;
>
> # Here is a HTML snippet from a malicious/obfuscated mail message.
> # Note the last character has an invalid and huge UTF-8 code
> # (as a result of an unrelated bug in HTML::Parser).
> #
> my $t = '<a>Attention Homeοωnөrs...1ѕt '.
> 'Tімe Eνө􆞃</a>';
>
> $t =~ s/&#(\d+)/chr($1)/ge; # convert HTML entities to UTF8
> $t .= substr($ENV{PATH},0,0); # make it tainted
>
> # show character codes in the resulting string
> print join(", ", map {ord} split(//,$t)), "\n";
>
> # The following regexp evaluation crashes perl 5.10.1 on FreeBSD.
> # Note that $t must be tainted and must have the UTF8 flag on,
> # otherwise the crash seems to be avoided.
>
> $t =~ /( |\b)(http:|www\.)/i;
>
>
> and here is the result (hand wrapped):
>
> 60, 97, 62, 65, 116, 116, 101, 110, 116, 105, 111, 110, 32, 72, 111,
> 109, 101, 959, 969, 110, 1257, 114, 115, 46, 46, 46, 49, 1109, 116,
> 32, 84, 1110, 1084, 101, 32, 69, 957, 1257, 1075075, 60, 47, 97, 62
> Segmentation fault: 11 (core dumped)
>
>
> Here is a backtrace as obtained from a core dump
> (cut/pasted from screen, the actual 8-bit characters may be wrong):
> $ gdb -c perl5.10.1.core /usr/local/bin/perl5.10.1
> GNU gdb 6.1.1 [FreeBSD]
> Copyright 2004 Free Software Foundation, Inc.
> GDB is free software, covered by the GNU General Public License, and you are
> welcome to change it and/or distribute copies of it under certain conditions.
> Type "show copying" to see the conditions.
> There is absolutely no warranty for GDB. Type "show warranty" for details.
> This GDB was configured as "amd64-marcel-freebsd"...
> Core was generated by `perl5.10.1'.
> Program terminated with signal 11, Segmentation fault.
> Reading symbols from /usr/local/lib/perl5/5.10.1/mach/CORE/libperl.so...done.
> Loaded symbols for /usr/local/lib/perl5/5.10.1/mach/CORE/libperl.so
> Reading symbols from /lib/libm.so.5...done.
> Loaded symbols for /lib/libm.so.5
> Reading symbols from /lib/libcrypt.so.4...done.
> Loaded symbols for /lib/libcrypt.so.4
> Reading symbols from /lib/libutil.so.7...done.
> Loaded symbols for /lib/libutil.so.7
> Reading symbols from /lib/libc.so.7...done.
> Loaded symbols for /lib/libc.so.7
> Reading symbols from /libexec/ld-elf.so.1...done.
> Loaded symbols for /libexec/ld-elf.so.1
> #0 0x00000000408bb101 in S_regmatch (reginfo=0x7fffffffe590, prog=0x411143a4) at regexec.c:3049
> 3049 REXEC_TRIE_READ_CHAR(trie_type, trie, widecharmap, uc,
Unfortunately this is just masking the cause, im pretty sure the
problem is in utf8.c
You would have ended up in this code:
case trie_utf8_fold: \
if ( foldlen>0 ) { \
uvc_unfolded = uvc = utf8n_to_uvuni( uscan, UTF8_MAXLEN, &len,
uniflags ); \
foldlen -= len; \
uscan += len; \
len=0; \
} else { \
uvc_unfolded = uvc = utf8n_to_uvuni( (U8*)uc, UTF8_MAXLEN, &len,
uniflags ); \
uvc = to_uni_fold( uvc, foldbuf, &foldlen ); \
foldlen -= UNISKIP( uvc ); \
uscan = foldbuf + UNISKIP( uvc ); \
} \
break;
Im guessing in the second clause, probably in to_uni_fold().
> (gdb) bt
> #0 0x00000000408bb101 in S_regmatch (reginfo=0x7fffffffe590, prog=0x411143a4) at regexec.c:3049
> #1 0x00000000408b7b0a in S_regtry (reginfo=0x7fffffffe590, startpos=0x7fffffffe6d8) at regexec.c:2355
> #2 0x00000000408b6a7a in Perl_regexec_flags (prog=0x4114f1a0,
> stringarg=0x4111d6c0 "<a>Attention HomeοÏ\211nÓ©rs...1Ñ\225t TÑ\226мe Eνөô\206\236\203</a>",
> strend=0x4111d6f3 "/a>",
> strbeg=0x4111d6c0 "<a>Attention HomeοÏ\211nÓ©rs...1Ñ\225t TÑ\226мe Eνөô\206\236\203</a>", minend=0,
> sv=0x4113ec48, data=0x0, flags=3) at regexec.c:2146
> #3 0x00000000407864a3 in Perl_pp_match () at pp_hot.c:1356
> #4 0x000000004073fa4c in Perl_runops_debug () at dump.c:1968
> #5 0x00000000406905d8 in S_run_body (oldscope=1) at perl.c:2431
> #6 0x000000004068f9b0 in perl_run (my_perl=0x41102104) at perl.c:2349
> #7 0x0000000000400bf4 in main (argc=3, argv=0x7fffffffea90, env=0x7fffffffeab0) at perlmain.c:117
>
> (gdb)
>
>
>
> And lastly, here is a perl debug output using the -Dr command line option:
Thanks, your report is very complete.
> Compiling REx "( |\b)(http:|www\.)"
> Final program:
> 1: OPEN1 (3)
> 3: BRANCH (6)
> 4: EXACTF < > (8)
> 6: BRANCH (FAIL)
> 7: BOUND (8)
> 8: CLOSE1 (10)
> 10: OPEN2 (12)
> 12: TRIE-EXACTF[HWhw] (19)
> <http:>
> <www.>
> 19: CLOSE2 (21)
> 21: END (0)
> minlen 4
> Omitting $` $& $' support.
>
> EXECUTING...
[...]
> 46 <E%x{3bd}%x{4e9}> <%x{106783}>| 1:OPEN1(3)
> 46 <E%x{3bd}%x{4e9}> <%x{106783}>| 3:BRANCH(6)
> 46 <E%x{3bd}%x{4e9}> <%x{106783}>| 4: EXACTF < >(8)
> failed...
> 46 <E%x{3bd}%x{4e9}> <%x{106783}>| 6:BRANCH(8)
> 46 <E%x{3bd}%x{4e9}> <%x{106783}>| 7: BOUND(8)
> 46 <E%x{3bd}%x{4e9}> <%x{106783}>| 8: CLOSE1(10)
> 46 <E%x{3bd}%x{4e9}> <%x{106783}>| 10: OPEN2(12)
> 46 <E%x{3bd}%x{4e9}> <%x{106783}>| 12: TRIE-EXACTF[HWhw](19)
> 46 <E%x{3bd}%x{4e9}> <%x{106783}>| State: 1 Accepted: 0
I think the regex engine is the only place that uses the unicode
folding logic. Ill try to dig further.
cheers,
Yves
--
perl -Mre=debug -e "/just|another|perl|hacker/"
Thread Previous
|
Thread Next