Front page | perl.perl5.porters |
Postings from February 2001
Re: Perl-Unicode fundamentals
From: Ilya Zakharevich
February 21, 2001 15:35
Re: Perl-Unicode fundamentals
Message ID: 20010221183519.A18801@math.ohio-state.edu
On Wed, Feb 21, 2001 at 03:06:49PM -0600, Jarkko Hietaniemi wrote:
> You seem to misinterpret everything I try to say about the matter
To calm yourself down, change "misinterpret" to "misunderstand". I
explicitly asked you several times to clarify things which I find so
doubtful, to no avail.
> I have several times asked you to supply either the list of things to do
> or the list of things you seem as broken to be removed. You have failed
> to supply such a list.
I supplied this list many times already. Cannot make it definite now,
just some things:
x) h2xs not producing a backward-compatible code by default;
x) v-thingies which are neither numbers nor strings;
x) "support for integer operations" - at least it needs some more
thinking about; (this is the situation now - the real problem
with p5p is that it was *included*)
x) all non-transparent operations w.r.t. byte/utf8 duality in the core;
x) unneeded obfuscation of the REx engine;
x) qu// operator which in transparent world is equivalent to qq//
(if I understood what it makes correct).
> You keep on froth-mouthing about "transparency" without ever giving a
> clear definition of it
Are not you froth-mouthing now? You did see my original proposal on
the Unicode, right? What is the reason for all this sudden animosity?
Transparency is a very simple concept.
An operation is byte/utf8 transparent if it produces the "same"
output given the "same" arguments.
Here two strings are the "same" if they contain the same characters.
[Characters are numbers in 0..FFFFFFFFFFFFFFFFFF range, but this is
probably not that important for this discussion. Take them to be in
the 0..FFFF range if this can make things simpler.]
> while at the same time accusing me of fatally breaking that
> transparency, without listing item by item to things that are not
I have no idea which things are not "transparent". The Unicode stuff
is not documented (or at least I did not see any documentation), so I
have no way to figure it out what works how.
What I know is that were there *plans* to make things transparent,
there would be no qu// operator (equivalent to )
> As Nick translated to you, we now have transparency: Unicode works.
If we had it now, how would you explain
<20010220163757.Q22349@chaos.wustl.edu>? Why the rush to fix things
which are not broken?
> Now you are telling me that I both don't understand Unicode and that
> I lie.
Well, this "lie" thing is completely new to me. Sorry if I said
something which lead you to this impression. About understanding
Unicode: either you do not understand it, or I+Nick+Graham do not
understand it. I'm quite ready to admit that I may be confused, but
what was your answer to all my pleas to unconfuse me?
> You seem to be all paniced about qu, how I sneakily floated it to the
Yes, very much. Putting it in says volumes about how you understand
Unicode - and how "we" do not understand this understanding of your.
> You keep on talking about locales and Unicode and EBCDIC, with some
> magic "locale-think" as the solution, without any technical details
> that would make any sense, not at least in the context of the current
> locale implementation. What 'use locale' currently affects are the
> following things:
> (1) definition of some character classes like \w are changed,
> similarly for uc() et al
This is the change of the "cultural info" which I mention.
> (2) the collation order of strings gets changed
Here you use buzzwords which I think are not relevant to the operation
of Perl. I think it is better to restate it as "the results of cmp
and related operations change to a system-defined locale-cognizant
> (3) the decimal number separator (".", ",", ..) used in output
> and input is changed, similarly date format may change (strftime)
I thought this behaviour changes from one version of Perl to another,
so may be considered as an implementation detail.
> (4) the error messages given by various libraries may be changed
> (to be in languages other than English), so "$!" may change
This is outside of the control of Perl. The string value of $! was
never defined, and may fluctuate basing on other parameters as well
(compiler vendor etc).
The only effects which I want people to concentrate their attention to
are (1) and (2). After having done this, I also through away (2). ;-)
[It is not very hard to bring (2) into the discussion later, but let
discuss simple things first.]
Only *then*, after this great simplification, the effects of 'use
locale' become parallel to the other Larry's idea: that it may make
sense to let Perl use other "cultural info" tables than unicode. I note
that we already do it on EBCDIC and after 'use locale' - when working
in the 0..255 range.
My new proposal *defines* the interaction of the effects of (1) and
Unicode by defining an *extension* of the cultural info modification
done in 0..255 range to the whole 0..whatever range.
Sorry that I bombastized what 'use locale' does, but the "other 2"
effects of 'use locale' are not (?) well defined, so I instictively
omited them... ;-)
> Why I keep repeating that this has nothing to do with Unicode or
> EBCDIC is that WE HAVE NO CONTROL OVER WHAT ACTUALLY CHANGES.
Who cares? All we need is to *deduce* which changes happened in the
0..255 range. And we already do it.
> Notice also that there is no correlation between or locale and a
> character codeset or its encoding. You can have "fr" or "es_AR" and
> you have no way of knowing whether they are using ISO 8859-1, -15, or
> Unicode, if Unicode, whether they are using UTF-8, UTF-16, or UTF-32.
My proposal on the behaviour of the "cultural info" table does not
need this information. This information is vital in *other* aspects
of Perl operations, like i/o filters, but it is needed for these
operations without the proposal as well. So I see no reason to bring
this murky topic into the discussion.