Front page | perl.perl5.porters |
Postings from April 2010
Re: [PATCH] Re: [perl #74022] Parser hangs on some Unicode numbersand symbols in identifiers
Thread Previous
|
Thread Next
From:
karl williamson
Date:
April 27, 2010 18:57
Subject:
Re: [PATCH] Re: [perl #74022] Parser hangs on some Unicode numbersand symbols in identifiers
Message ID:
4BD795DB.5090803@khwilliamson.com
Father Chrysostomos wrote:
>
> On Apr 25, 2010, at 2:21 PM, Father Chrysostomos wrote:
>
>> I think scan_word should be using is_utf8_idcont, rather than
>> isALNUM_utf8. The attached patch makes it do just this.
>
> The tests had not finished running when I sent that. lib/utf8.t is
> failing. It turns out that things are not as simple as I thought. toke.c
> has 23 instances of isIDFIRST_lazy_if, so it seems that most of the code
> is expecting S_scan_word to match something like
>
> /^(?!\p{IsDigit})[\p{ID_Continue}_]+/
>
> whereas what it actually matches (ignoring package separators) is
>
> /^([\p{IsWord}_]\pM?)*/
>
> My patch prevents qq·aaa· from being valid syntax, because U+B7 is part
> of \p{ID_Continue} (hence the lib/utf8.t failure). One thing my patch
> didn’t address was the \pM? (is_utf8_mark) part of scan_word.
> \p{ID_Continue} contains all of \pM except for the thirteen characters
> in \p{Me}.
>
> So there is a potential for breakage if we make everything match
> Unicode. The macro handy.h is already explicitly looser than Unicode.
> Fixing this bug requires an arbitrary decision from someone more
> knowledgeable than I.
>
>
Thanks for finding this. I've wondered about the comment in handy.h
that you quoted that documents that we decided to use a Perl home-grown
version of this rather than the official Unicode one. To repeat, it is
/* The ID_Start of Unicode is quite limiting: it assumes a L-class
* character (meaning that you cannot have, say, a CJK character).
* Instead, let's allow ID_Continue but not digits. */
Jarkko wrote that comment in 2002. Since then (actually quite a long
time ago), Unicode has fixed this problem, and the official ID_Start
does include Han characters and Korean syllables.
Jarkko wrote me last year that "Unicode knows best". In other words,
they will, in general, do a better job than people at Perl could
possibly do at figuring out what's best. They haven't always, but it's
getting better as the Standard has evolved, and is stabilizing over
time. They've put more and more things into place to minimize errors,
but that's not to say those have gone to zero.
In 5.12, I took Jarkko's advice, and changed our definitions of \p
properties to be identical to Unicode's. And people agreed with that
decision, so that's what got shipped.
I had been planning to look at this area too, and your posts spurred me
to do it. What I think is that we should move to Unicode's definitions,
even if it means breaking some existing code. Going forward, then, we
won't have to worry about it, as those definitions get added to (and
perhaps modified); we just follow the Standard.
The middle dot that caused your test to fail is one that Unicode has had
some issues with knowing how to handle. I haven't checked if it has
changed with regard to this property, but it has in others. But that
has settled down, and remained unchanged in recent releases.
Actually, I think we should move not to ID_Start, but to Unicode's
revised property, XID_Start which they recommend over the earlier one,
and is nearly identical, but better handles a few weirdly behaving
characters, in Thai, Lao, Greek, and Arabic mostly. 5.12's regex \X
construct uses a similar Unicode definition that takes these into
account, and it automatically fixes the issue with marks that you
pointed out.
Unicode is keeping ID_Start around for backwards compatibility. I don't
know if they intend to do so indefinitely or not. My guess is that it
will be there for quite some time to come.
To summarize, I propose that we use Unicode's XID_Start and XID_Continue
properties in 5.14, even though that breaks one of our tests, and
possibly existing code.
Thread Previous
|
Thread Next