On Apr 25, 2010, at 2:21 PM, Father Chrysostomos wrote: > I think scan_word should be using is_utf8_idcont, rather than > isALNUM_utf8. The attached patch makes it do just this. The tests had not finished running when I sent that. lib/utf8.t is failing. It turns out that things are not as simple as I thought. toke.c has 23 instances of isIDFIRST_lazy_if, so it seems that most of the code is expecting S_scan_word to match something like /^(?!\p{IsDigit})[\p{ID_Continue}_]+/ whereas what it actually matches (ignoring package separators) is /^([\p{IsWord}_]\pM?)*/ My patch prevents qq·aaa· from being valid syntax, because U+B7 is part of \p{ID_Continue} (hence the lib/utf8.t failure). One thing my patch didn’t address was the \pM? (is_utf8_mark) part of scan_word. \p {ID_Continue} contains all of \pM except for the thirteen characters in \p{Me}. So there is a potential for breakage if we make everything match Unicode. The macro handy.h is already explicitly looser than Unicode. Fixing this bug requires an arbitrary decision from someone more knowledgeable than I.Thread Previous | Thread Next