Front page | perl.perl5.porters |
Postings from September 2010
From: Aristotle Pagaltzis
September 1, 2010 00:04
Message ID: 20100901070351.GI429@klangraum
* karl williamson <firstname.lastname@example.org> [2010-08-31 15:30]:
> karl williamson wrote:
> >Aristotle Pagaltzis wrote:
> >>* karl williamson <email@example.com> [2010-07-17 19:55]:
> >>>People have made this assertion before, that Perl is always
> >>>Unicode characters. I don't know where it comes from;
> >>>I don't understand how one could claim that. If it were
> >>>true, then feature unicode_strings would be a no-op
> >> $a = $b = chr(0xE9);
> >> utf8::upgrade($b);
> >> print $a eq $b;
> >> __END__
> >> 1
> >>That’s where.
> >>The intent AIUI was always that this equality should extend
> >>to all behaviours.
> >If you were to run that on an EBCDIC platform it would not
> >print 1.
> Actually, I'm wrong. It would print 1 (but it would not be the
> same character as on a POSIX platform). And intent does not
> necessarily mean fact. And the fact is that that equality is
> missing from a number of behaviors
The string model as implemented is simply schizophrenic. It is
impossible to write Perl code that behaves entirely consistently
under all inputs given the current semantics.
(An application author has a fighting chance to do that, though
not without Herculean effort, because s/he can control all
interfaces. A module author, however, is just plain out of luck.)
> with no apparent urgency to fix it, since it's been that way
> now for many years.
I don’t think the reason is urgency. I think the reason is lack
of understanding on many sides. Most of the people who worked on
the perl’s implementation of the Perl string model (and on other
bits of perl that are relevant to it) apparently did not really
understand that string model in the abstract (and likely in many
cases those working on it also don’t).
That is why we now have the abominations known as the `bytes` and
`encoding` pragmata, why Encode has `_utf8_*`, etc.
That situation, I think, has not substantially improved.
You are a rare exception to that rule.
(The UTF8 flag should have been completely transparent for the
purposes of user code, because Perl only has one string type, not
two. But perl has two string implementations. And users *do* have
need for a second string type! Perl does not provide for that.
Predictably, model and implementation got (and gets) confused, so
people seize on the attractive nuisance known as the UTF8 flag in
the misguided that it will do what they’re looking for.) Things
like `bytes`, `encoding` etc resulted – well intentioned, but
they only served to make things that much worse.
By now we have a clusterfuck on our hands.
(Let me state explicitly here that I am extremely grateful for
your taking on the task and for the work you have done.)
> But, I guess you're saying that the intent is what matters
> here, so saying they are Unicode characters in the
> documentation is correct, even though the effort falls short.
Long term, it is my opinion that all deviations from the
consistent and coherent model (using Unicode semantics under
all circumstances) should be treated as bugs.
(The only headache we can’t forseeably discourage at some point
in the future (because there’s no real alternative at this point
in time) is locales.)
And I think users should already be encouraged to think in those
terms, rather than taught the current schizophrenia. When that
doesn’t work for them, the fix should be in perl, not in the
(Although it may well be necessary to cover the deviations for
the benefit of maintainers of legacy code. Along the same lines,
legacy support may force the introduction of more pragmata along
the way. That’s OK, I think allowing correctness by default in
user code is still a worthy goal.)
With regrets over ranting again, yours truly,
Aristotle Pagaltzis // <http://plasmasturm.org/>