On 08/21/2012 02:50 PM, Jarkko Hietaniemi wrote: >> >> Thank you for this idea. I did it for Russian, and it showed the current >> scheme had between 20-25% advantage over my proposed one, so I won't be >> pursuing the proposal as-is. > > Glad it gave you some results. In the meanwhile I remembered another > source for more Unicode text, > but this time it is much shorter (though you can probably just > self-concat it enough times), and at least > in principle the same text: > > http://www.unicode.org/standard/WhatIsUnicode-more.html > In the meantime, I looked at the Unicode 6.2 properties. There are 842 that match distinct sets of code points (not including the ones that are complements of them.) (Some Unicode properties match the exact same set of code points as others. For example Line_Break=CR and Grapheme_Cluster_Break=CR both match exactly a carriage return; there are others that are non-trivial) Of those, 47% have just one or two elements in their inversion lists. 55% have up to four 63% have up to eight 70% have up to 16 77% have up to 32 81% have up to 64 85% have up to 128 90% have up to 256 This indicates that we shouldn't be generating swashes for most official Unicode properties.Thread Previous