develooper Front page | perl.perl5.porters | Postings from April 2010

Re: [PATCH] Re: [perl #74022] Parser hangs on some Unicode numbers and symbols in identifiers

Thread Previous | Thread Next
From:
Father Chrysostomos
Date:
April 26, 2010 03:09
Subject:
Re: [PATCH] Re: [perl #74022] Parser hangs on some Unicode numbers and symbols in identifiers
Message ID:
0764ECF6-1485-4679-A530-F44C759A2FD8@cpan.org

On Apr 25, 2010, at 2:21 PM, Father Chrysostomos wrote:

> I think scan_word should be using is_utf8_idcont, rather than  
> isALNUM_utf8. The attached patch makes it do just this.

The tests had not finished running when I sent that. lib/utf8.t is  
failing. It turns out that things are not as simple as I thought.  
toke.c has 23 instances of isIDFIRST_lazy_if, so it seems that most of  
the code is expecting S_scan_word to match something like

   /^(?!\p{IsDigit})[\p{ID_Continue}_]+/

whereas what it actually matches (ignoring package separators) is

   /^([\p{IsWord}_]\pM?)*/

My patch prevents qq·aaa· from being valid syntax, because U+B7 is  
part of \p{ID_Continue} (hence the lib/utf8.t failure). One thing my  
patch didn’t address was the \pM? (is_utf8_mark) part of scan_word. \p 
{ID_Continue} contains all of \pM except for the thirteen characters  
in \p{Me}.

So there is a potential for breakage if we make everything match  
Unicode. The macro handy.h is already explicitly looser than Unicode.  
Fixing this bug requires an arbitrary decision from someone more  
knowledgeable than I.


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About