develooper Front page | perl.perl5.porters | Postings from February 2001

Re: The State of The Unicode

From:
andrew
Date:
February 19, 2001 15:07
Subject:
Re: The State of The Unicode
Message ID:
20010219180714.G17705@pimlott.ne.mediaone.net
Thank you for your prompt reply--you did read the whole thing,
right?  ;-)

On Mon, Feb 19, 2001 at 04:47:53PM -0600, Jarkko Hietaniemi wrote:
> (1) The current model, both externally and internally,
>     follows what is described by the Camel Mk3.

Camel III has zero complete examples of Unicode support (unless
there are examples outside of the Unicode section, which I have not
read).  The Unicode chapter is a scant nine pages.  There is nothing
there to violate.

Ok, I lie.  There is one complete example:

    $bytelen = bytes::length("\x{ffff_ffff}");   # returns 7

It is plainly (and you seem to agree) pointless.

>     As the pumpkin
>     I'm somewhat obligated to abide by that, at least that's the
>     first degree approximation.  (Incidentally, the reason I think
>     the Camel is so vague was that when it was written the Unicode
>     model was beinh ripped to shreds to be rebuilt, in a discussion
>     not unlike the one we are having.)
> 
> (2) The basic Unicode support seems to be in a rather good shape now.
>     What I mean by "basic" is that as long you don't start pulling your hair
>     over this very bytes vs UTF-8 vs characters issue, and just concatenate
>     strings, compare them, take their length, do regexes on then, etc, pretty
>     much everything seems to be working.

I agree that I have seen no examples as far as pure string
manipulation.  But the relationship between strings and numbers must
be clear.  It was fairly clear in pre-Unicode perl, but was never
(to my knowledge) made explicit.  That is why it is confusing now.

> Combine (1) and (2) and I see it as "what is broken, so what's there to
> fix" situation, let's call it (3).
> 
> As far "what is broken", I do understand the concern of "exposing too
> much of the internal representation" (which at the moment happens to
> be UTF-8) to the user, having bytes and character is confusing at
> best.  However, I'm not fully convinced that completely hiding it is
> wise, either.  If from Perl level one cannot reach back to the bytes
> comprising the UTF-8 representation of the characters, I feel we are
> trying to pad the cell too softly.

My kingdom for one example.

> What this recent furor seems to be over is mainly two things: the new
> qu operator, and that the methods for "creating characters" were made
> consistent over the contentious 0x80..0xff range.
> 
> If it is decided to hide everything (note my above reservation over
> *everything*) about the UTF-8-ness of the characters, fine.  But
> because of (3), I am personally reluctant to start unless someone
> presents me a list of things to hide.  It need not be an essay or
> a 500-page tech report, just a comprehensive list of things to do.

Working on it.

Andrew



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About