develooper Front page | perl.perl5.porters | Postings from February 2000

Re: Unicode character composition

Larry Wall
February 13, 2000 18:17
Re: Unicode character composition
Message ID:
Ilya Zakharevich writes:
: On Sun, Feb 13, 2000 at 09:54:43PM +0200, Jarkko Hietaniemi wrote:
: > Food for thought: should Perl always make its utf8 data to be in the
: > decomposed form to be canonical?  Or, the other way, should it always
: > try to find the composite form (to be more compact)? 
: No and no.

That's correct--eventually we need to support both Normalization Form D
(canonical decomposition) and Normalization Form C (canonical composition).

: > A canonical form would make searching the data rather easier.
: I think there is a Consortium's document on "Levels of
: internationization support in REx engines".  I think there are 3 or 4
: levels, and we are on the first one now.  IIRC, what you propose is
: similar to the level 2.

Y'all want to see Unicode Technical Report #15, Unicode Normalization Forms:

: I would think that such things should be treated by pessimizers for
: RExen: "mutate this REx to support composition/decomposition too".

I think the pessimization (when it's more than just an assertion) would
most naturally happen in the input disciplines, since it only has to
happen once there, and the regular expression engine would have to pay
the price over and over.

That being said, we might choose to do it lazily, and compose/decompose just
before we execute an instruction that requires a particular form.  Of course,
that would require another bit or two in SVs.

And I will give you that REx munging is potentially a more general
solution, since it could also be taught to handle any of various
not-so-canonical forms.

But the thing we definitely need is a way to declare a script to be a
Normalization-Form-C Zone, since the most people feel that that's going
to be the preferred form for most day-to-day text storage, being more
compact, and closer to national character sets, and easier to render

So let's make sure we can force the issue with input disciplines.  If
we need to do lazy conversion or REx munging as well, then we can.
(But first we need to polymorphize the REx engine in terms of SvUTF8.)

Larry Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About