>> I particularly liked the concept of doing Unicode classes as DFAs on the >> octets of the UTF-8 representations of the code points. That might be a >> win >> for us. > > But notice that he only implements two Unicode properties, gc and sc. This > is likely much faster than our mechanism, but I think it takes significantly > more space. handwaving the effort involved to produce something usable, I wonder if a DFA on nybbles, bit pairs, or even bits instead of octets would be useful. The common cases would compress together, and the transition tables would be much much smaller. Using flexible offsets, they might even be sharable. The trees would be bigger but have fewer branches at each stage.Thread Previous