develooper Front page | perl.perl5.porters | Postings from May 2010

Re: [perl #74022] What to do in 5.12.1? -- Parser hangs on some Unicodenumbers and symbols in identifiers

Thread Previous | Thread Next
From:
karl williamson
Date:
May 2, 2010 15:39
Subject:
Re: [perl #74022] What to do in 5.12.1? -- Parser hangs on some Unicodenumbers and symbols in identifiers
Message ID:
4BDDFF13.1080901@khwilliamson.com
Since this causes Perl to hang, I think it should be addressed somehow 
in 5.12.1.  It may be that the thing to do is just document it.  It's 
been around since 2007.  I'm still looking at how things are done 
currently, and a number of things appear wrong to me, but that's an 
initial take, subject to further consideration.

Father Chrysostomos wrote:
> 
> On Apr 27, 2010, at 6:56 PM, karl williamson wrote:
> 
>> To summarize, I propose that we use Unicode's XID_Start and 
>> XID_Continue properties in 5.14, even though that breaks one of our 
>> tests, and possibly existing code.
> 
> Would we change the meanings of is_utf8_idcont and is_utf8_idfirst, or 
> introduce new functions?

My first take is that I think we would just change the meanings.  The 
differences are quite minimal.  ID_Start contains 23 more characters 
than XID_Start:
037A   GREEK YPOGEGRAMMENI
0E33   THAI CHARACTER SARA AM
0EB3   LAO VOWEL SIGN AM
309B   KATAKANA-HIRAGANA VOICED SOUND MARK
309C   KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK
FC5E   ARABIC LIGATURE SHADDA WITH DAMMATAN ISOLATED FORM
FC5F   ARABIC LIGATURE SHADDA WITH KASRATAN ISOLATED FORM
FC60   ARABIC LIGATURE SHADDA WITH FATHA ISOLATED FORM
FC61   ARABIC LIGATURE SHADDA WITH DAMMA ISOLATED FORM
FC62   ARABIC LIGATURE SHADDA WITH KASRA ISOLATED FORM
FC63   ARABIC LIGATURE SHADDA WITH SUPERSCRIPT ALEF ISOLATED FORM
FDFA   ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM
FDFB   ARABIC LIGATURE JALLAJALALOUHOU
FE70   ARABIC FATHATAN ISOLATED FORM
FE72   ARABIC DAMMATAN ISOLATED FORM
FE74   ARABIC KASRATAN ISOLATED FORM
FE76   ARABIC FATHA ISOLATED FORM
FE78   ARABIC DAMMA ISOLATED FORM
FE7A   ARABIC KASRA ISOLATED FORM
FE7C   ARABIC SHADDA ISOLATED FORM
FE7E   ARABIC SUKUN ISOLATED FORM
FF9E   HALFWIDTH KATAKANA VOICED SOUND MARK
FF9F   HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK

And ID_Continue contains 19 more characters than XID_Continue:
037A   GREEK YPOGEGRAMMENI
309B   KATAKANA-HIRAGANA VOICED SOUND MARK
309C   KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK
FC5E   ARABIC LIGATURE SHADDA WITH DAMMATAN ISOLATED FORM
FC5F   ARABIC LIGATURE SHADDA WITH KASRATAN ISOLATED FORM
FC60   ARABIC LIGATURE SHADDA WITH FATHA ISOLATED FORM
FC61   ARABIC LIGATURE SHADDA WITH DAMMA ISOLATED FORM
FC62   ARABIC LIGATURE SHADDA WITH KASRA ISOLATED FORM
FC63   ARABIC LIGATURE SHADDA WITH SUPERSCRIPT ALEF ISOLATED FORM
FDFA   ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM
FDFB   ARABIC LIGATURE JALLAJALALOUHOU
FE70   ARABIC FATHATAN ISOLATED FORM
FE72   ARABIC DAMMATAN ISOLATED FORM
FE74   ARABIC KASRATAN ISOLATED FORM
FE76   ARABIC FATHA ISOLATED FORM
FE78   ARABIC DAMMA ISOLATED FORM
FE7A   ARABIC KASRA ISOLATED FORM
FE7C   ARABIC SHADDA ISOLATED FORM
FE7E   ARABIC SUKUN ISOLATED FORM

So the differences are minimal; we would be recognizing 23 or 19 fewer 
characters by going with the X versions.  You can tell from some of the 
names why it was wrong to put them in the original versions.

But I need to further study things to come up with a recommendation


> 
> In anticipation of this change, I’ve attached a patch that corrects the 
> test in utf8.t to use ¡ instead of ·. I’ve also moved the test outside 
> of the eval, so it will still run (and fail) if the compilation fails, 
> instead of causing an invalid test count.
> 
Thanks.  Have you considered adding a timeout? test.pl has one that will 
kill the test script if Perl hangs.


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About