Front page | perl.perl5.porters |
Postings from September 2000
RE: unicode support and perl
From: Moore, Paul
September 14, 2000 01:49
RE: unicode support and perl
Message ID: 714DFA46B9BBD0119CD000805FC1F53B012A82AC@UKRUX002.rundc.uk.origin-it.com
From: Marc Lehmann [mailto:email@example.com]
> While the discussion about Encode seems right to me, I'd like
> to diagnose the following problems:
[I'm now going to ignore most of the posting, and just pick up on the
"internal representation" bits...]
From my (uninformed) reading of the various Unicode discussions, it seems to
me that there is a confusion over what a "Perl string" is supposed to be. To
the best of my knowledge, the intention is that a string in Perl is simply a
sequence of characters, where ord() of each character is *not* limited to
0..255. The internal representation is irrelevant, except to low-level
"guts" type code.
In practice, there are two main issues. The first relates to the "low-level
guts" comment above, and is simply that the internal representation exposes
itself at too high a level to be entirely comfortable - specifically, at the
XS level, where most C functions do *not* expect UTF-8, so that XS interface
code has to deal with representation checking and conversion. It could be
argued that this issue is a failing in XS, which needs to be updated
(possibly just with new typemaps?) to make the representation changes
transparent. [Note - I'm not implying that this is necessarily easy...]
The second issue is that there is a significant body of code which treats a
Perl string as a "byte buffer" - sometimes inadvertantly. The obvious case
is where the address of a SV's string area (you know what I mean...) is
passed directly to a C-level function for the function to use or fill in.
Another case is where the Perl code receives a byte stream, say from a file
or socket - in this case, it is even possible (Graham's LDAP case) for
*part* of that byte stream to be a sequence of bytes encoding characters -
in UTF-8, Latin-1, Shift-JIS, or anything at all.
Um. I was just about to say that encoding is not an issue, as at this level
a "byte stream" is simply another encoding like UTF-8 (it's just limited to
chars < 256, with strings containing chars >= 256 being "not well formed").
But there's also the (orthogonal) character set issue - Latin-1 vs Shift-JIS
vs Unicode, vs EBCDIC, etc. It's orthogonal as (for example) Latin-1 can be
encoded in raw bytes, or UTF-8 (and the bit patterns differ!!!). There is an
issue in that raw bytes is too limited to encode all target character sets
(notably Unicode) - but there's also an issue in that converting Shift-JIS
to Latin-1 probably has a *lot* of unconvertable code points.
OK, so encoding issues are hard. But if we work entirely with "perl strings"
of characters (possibly >255) it's a user-level issue.
This effectively leaves us with a single internals problem, in two parts:
1. We need a pair of API calls, which say
a) Convert this block of bytes into a Perl string, stored here
b) Convert this Perl string into a block of bytes here.
In-place versions would probably be convenient, too (although a Perl
SV pointing to an unconverted block of bytes should NEVER be allowed
to escape "into the wild"). Also, error handling needs to be well
defined for case (b) (in case the string contains characters >255).
2. Code needs to use these APIs exclusively.
Of course, (2) is the problem. But this is a bit like the threading issue of
which modules are thread-safe. Ultimately, we can't influence the
Unicode-safety of modules, all we can do is build (and document) APIs which
allow people to write Unicode-safe modules, and then hope... (It helps if we
make using the safe APIs easier than not doing so, of course :-)
BTW, the pair of API calls in (1) will be heavily used, and so should be
optimised as much as possible. Having a UTF-8 vs Byte flag in Perl's SVs is
one way of doing this. To some extent, this is where we are already, but we
seem to be too aware of this entirely internal optimisation at the moment.
This is just my uninformed opinion (I have no experience whatsoever with
i18n issues or Unicode). I'm offering it as an "innocent bystander's" view -
a couple of recent comments have made me think that maybe my idiot-level
understanding isn't too far off base, and we may have a case of people being
too close to the problem. If so, and my comments help, I'm glad. Otherwise
feel free to ignore this posting totally.
Thanks for your time,