Ilya Zakharevich writes: : On Sun, Feb 13, 2000 at 09:54:43PM +0200, Jarkko Hietaniemi wrote: : > Food for thought: should Perl always make its utf8 data to be in the : > decomposed form to be canonical? Or, the other way, should it always : > try to find the composite form (to be more compact)? : : No and no. That's correct--eventually we need to support both Normalization Form D (canonical decomposition) and Normalization Form C (canonical composition). : > A canonical form would make searching the data rather easier. : : I think there is a Consortium's document on "Levels of : internationization support in REx engines". I think there are 3 or 4 : levels, and we are on the first one now. IIRC, what you propose is : similar to the level 2. Y'all want to see Unicode Technical Report #15, Unicode Normalization Forms: http://www.unicode.org/unicode/reports/tr15/ : I would think that such things should be treated by pessimizers for : RExen: "mutate this REx to support composition/decomposition too". I think the pessimization (when it's more than just an assertion) would most naturally happen in the input disciplines, since it only has to happen once there, and the regular expression engine would have to pay the price over and over. That being said, we might choose to do it lazily, and compose/decompose just before we execute an instruction that requires a particular form. Of course, that would require another bit or two in SVs. And I will give you that REx munging is potentially a more general solution, since it could also be taught to handle any of various not-so-canonical forms. But the thing we definitely need is a way to declare a script to be a Normalization-Form-C Zone, since the most people feel that that's going to be the preferred form for most day-to-day text storage, being more compact, and closer to national character sets, and easier to render correctly. So let's make sure we can force the issue with input disciplines. If we need to do lazy conversion or REx munging as well, then we can. (But first we need to polymorphize the REx engine in terms of SvUTF8.) Larry