develooper Front page | perl.perl5.porters | Postings from March 2007

[PATCH] feel the the baß (encoding problems in the regex engine)

Thread Next
From:
demerphq
Date:
March 19, 2007 17:40
Subject:
[PATCH] feel the the baß (encoding problems in the regex engine)
Message ID:
9b18b3110703191740m6bf21942p6521f3016ed8092f@mail.gmail.com
There is a problem with how the regexp engine handles certain types of
escapes and strings of different encodings. For instance:

perl -wle'$x=qq(\x{DF}); $x=~/$x|\x{100}/ and print qq(ok)'

produces the following:

Malformed UTF-8 character (unexpected non-continuation byte 0x7c,
immediately after start byte 0xdf) in regexp compilation at -e line 1.

As far as I can tell in blead this is because when the \x{100} is
parsed during the sizing phase it switches the pattern is utf8 flag to
true, but doesnt upgrade the string to utf8. On the second pass it
tries to read the string as utf8 and fails. The attached patch fixes
this so that when it notices this might happen it upgrades the string
to utf8 and then redoes[2] the sizing phase since the recoding might
have altered the required allocation. This could have caused a buffer
overrun error.[1]

D:\dev\perl\ver\zoro\t\win32>..\perl -wle"$x=qq(\x{DF});
$x=~/$x|\x{100}/ and print 'ok'"
ok

\x{DF} is ß by the way. Pesky thing.

As a bonus this patch includes two bug fixes which I came across while
working out the utf8 encoding problem. One is for the trie code
charclass logic which was doing the wrong thing under utf8 and the
other was in some debugging output code that was using the wrong utf8
flag.

Not bad for number of bugs per single test case really. :-)

Yves

[1] I almost wonder if this could have been responsible for the sizing
bug in the xml code from a while back.. Ill have to try reverting that
patch with this patch applied and see.

[2] This is far from the most efficient way to deal with this. It
would be nice to fail-fast the parse somehow so that the least work
possible is done in the first parse pass following the time we know we
have to upgrade the string . This could be far into the parse
recursion stack so its a bit difficult to do.

-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About