develooper Front page | perl.perl5.porters | Postings from June 2015

Unicode's "solution" to the Catalan middle dot vs the Greek ano teleiaproblem

Thread Next
From:
Karl Williamson
Date:
June 23, 2015 22:31
Subject:
Unicode's "solution" to the Catalan middle dot vs the Greek ano teleiaproblem
Message ID:
5589DE11.3010200@khwilliamson.com
They have addressed this with some new text in 8.0

Recall that these two dot characters are indistinguishable from one 
another, being a full stop raised above the base line of the text. 
Also, all normalizations applied to input text containing these will 
cause them both to be mapped to the same code point, U+00B7.  The 
problem is that they are not equivalent; they are from different 
scripts, with different purposes, and the Catalan is a \w character, and 
the Greek is punctuation, roughly equivalent to the English semi-colon.

What Unicode now basically says comes down to, if you care, you should 
treat a B7 surrounded by \w characters as the Catalan, as this character 
legitimately occurs only between letters of a word, and that otherwise 
treat it as punctuation.  In Identifier parsing you can essentially 
create a third property, beyond IDStart and IDContinue, and that is 
IDInterior.  I think one would also have to have an IDTerminal and just 
not use IDContinue.


Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About