develooper Front page | perl.perl5.porters | Postings from February 2001

Re: Perl-Unicode fundamentals

Thread Previous | Thread Next
Jarkko Hietaniemi
February 21, 2001 13:07
Re: Perl-Unicode fundamentals
Message ID:
> IMO, this is a bad thing.  You do not want to document (an extremely

Ilya, I have tried to keep silent and let Nick do the talking since he
seems to be able to understand both of us, but you are being so silly
that I must stand up.  As the pumking I guess I should stay calm and
count to ten, but I guess I have an ego.  You seem to misinterpret
everything I try to say about the matter, and based on your wild
misinterperations proceed to shovel FUD and conspiracy theories all
over the place.  One would think that we are speaking the same
language, English, but apparently that is not case.

I have several times asked you to supply either the list of things to do
or the list of things you seem as broken to be removed.  You have failed
to supply such a list.

You keep on froth-mouthing about "transparency" without ever giving a
clear definition of it, while at the same time accusing me of fatally
breaking that transparency, without listing item by item to things
that are not "transparent".  As Nick translated to you, we now have
transparency: Unicode works.

You claim that "Jarkko needs to be convinced of that we are 99.9%
done".  Well, count me as really, really confused because that is
exactly what I thought before you started your "Unicode is broken and
Jarkko is leading a secret campaign to break it even more" mongering
a couple of days ago.  One of my main goals for 5.7, leading into 5.8,
was to fix Unicode.  Now you are telling me that I both don't understand
Unicode and that I lie.

You claim to have invented to current Unicode model.  In 5.6.0
practically everything about it was either broken or unimplemented.
You also claim to have a clear vision of both external (to the user)
and internal (to the implementor) aspects of the matter.  Somehow,
strangely, you have failed to contribute (almost any, you did supply
charnames) related patches or design documents or user documents, for
about a year, since 5.6.0.  No, wait, let me guess, you have a 10-line
patch that can be applied on top of 5.6.0 and which fixes all Unicode
problems and gets it as Unicode-functional as we are now?

You seem to be all paniced about qu, how I sneakily floated it to the
language.  In case you haven't noticed, this IS A DEVELOPMENT RELEASE.
I can as easily take features away as I can introduce them.  Some
people liked it, some people don't.  You stomping your feet and
refusing to discuss the matter in sensible tones, without using words
like "disaster" and "madness" (which to me signal serious fatal flaws,
and very rarely worth to be used in relation of any eature of any
programmimg language) almost makes me to want to leave the feature in
and document it as "this feature is in the language at the explicit
request by Ilya not to have it"...

You keep on talking about locales and Unicode and EBCDIC, with some
magic "locale-think" as the solution, without any technical details
that would make any sense, not at least in the context of the current
locale implementation.  What 'use locale' currently affects are the
following things:

(1) definition of some character classes like \w are changed,
    similarly for uc() et al
(2) the collation order of strings gets changed
(3) the decimal number separator (".", ",", ..) used in output
    and input is changed, similarly date format may change (strftime)
(4) the error messages given by various libraries may be changed
    (to be in languages other than English), so "$!" may change

Why I keep repeating that this has nothing to do with Unicode or
at the mercy of what the vendor of our current execution platform has
seemed fit to implement.  All we do on the Perl side is that we have
separate execution paths for non-locale and locale things, we based on
the locale pragma we use either, end of story.

It's worse than this: there is no standard way to ask for, say,
"French", "French, Belgian", "French, Canadian", or "Russian", or
"Russian, KOI8-R", or "Japanese, SJIS", or "Japanese, Unicode".
There is no standard way to even ask whether such locales
exist, for a very simple reason: the names of the locales,
how the name of the country/languages/encoding is encoded
in the names.

Notice also that there is no correlation between or locale and a
character codeset or its encoding.  You can have "fr" or "es_AR" and
you have no way of knowing whether they are using ISO 8859-1, -15, or
Unicode, if Unicode, whether they are using UTF-8, UTF-16, or UTF-32.
Yes, sometimes you get lucky and you have names like "da_DK.ISO8859-1"
or even "da_DK.ISO8859-1@UTF8", there are some standards for building
those names, but not all vendors use them, so basically they are
opaque tags.  Heuristics like /ISO(8859|Latin)/ will help, of course,
but they are heuristics.

There's a multitude of other problems related to locales: they are
per-process (as opposed to per-thread), they are per-process (as
opposed to per-data), they are known to be buggy (things like which
characters are \w), the language-country "coordinates in the locale
space" sometimes do not make sense, you can't customize them easily
("I want USA otherwise but dates in Italian style") ,and so on.

That's why I keep warning us not to go there.  I have been there and
I can tell that locales are broken and non-standard, let's not mix
Unicode with the concept.  If we want to fix all that and have some
vendor independent new solution for cultural things, fine, but let's
call that something else than "locale".

$jhi++; #
        # There is this special biologist word we use for 'stable'.
        # It is 'dead'. -- Jack Cohen

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About