develooper Front page | perl.perl5.porters | Postings from December 2012

RFC: Adding UTF-8 validity checks to the regex engine

From:
Karl Williamson
Date:
December 23, 2012 22:04
Subject:
RFC: Adding UTF-8 validity checks to the regex engine
Message ID:
50D77FF7.3090904@khwilliamson.com
In thinking about [Perl #116148]: "Pattern utf8ness sticks around 
globally", it seems to me that the regex engine should do some 
self-protection.

It seems to me that the target string should be checked for proper 
UTF8-ness upon entry.  That way we don't have to worry about testing for 
such things in the middle, when backtracking could cause the same test 
to be done gazillions of times.

I added an assert() to do this, and the test suite hangs; there are a 
number of assertion failures as well.  I haven't debugged anything yet.

But I'm thinking that this should probably be not an assertion, but 
something that is done in production code to guard the engine from 
reading off the end of buffer, etc.  I would think that the right thing 
to do would be to raise a warning and fail the match when bad input 
UTF-8 is encountered.

However, this bug report is where the pattern isn't valid UTF-8 (it 
isn't UTF-8 at all; the engine just thinks it is).  Hopefully, we have 
enough control over regex compilation that we generate only valid UTF-8, 
but this bug indicates that could fail.  I am proposing adding a 
-DDEBUGGING-only check in to the regex engine to, at the start of each 
match, go through the pattern, and check each text node for valid UTF-8. 
  I am presuming that this would have caught this bug before release, 
and production code would not be slowed down



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About