Front page | perl.perl5.porters |
Postings from September 2000
Alternate Proposal for Encoding API
Thread Next
From:
Jost Krieger
Date:
September 18, 2000 03:20
Subject:
Alternate Proposal for Encoding API
Message ID:
20000918122007.D10899@ruhr-uni-bochum.de
After reading a lot of messages about UTF8 and Encoding, I think
that the currently proposed API
1. is still way too low-level
2. also has function names that still can be misunderstood.
In my understanding, Perl strings now have following "user-visible" properties:
1. Perl strings are used as byte buffers. This will stay so (in some way or other).
2. Perl strings are used for character handling, characters being encoded in an
unspecified character set that is an 8-bit super set of ASCII. Some people
think it is ISO-8859-1, but perl doesn't care. This is undistinguishable
from 1. All this is unfortunate, but not changeable.
3. Nowadays, Perl has support for strings in a "large enough" character set
with an internal encoding that makes chars with ords up to 2**31-1 possible.
For a naive user, it is hardly important we're talking about Unicode (which
will hopefully stay so), and still less important the encoding is UTF8.
4. For historical reasons, perl has problems doing IO and other stuff with
these new strings.
5. "Byte strings" (and, less commonly "wide strings") can obvioulsy contain
data in any of a large array of character sets and encodings where only the
user knows that fact.
6. All these "strings" except pure byte strings, can potentially be invalid in
various ways (encoding errors or invalid code points being the main cases).
On the other hand, people have data in various encodings, some data streams
are even mixed between character encodings or between binary and character
encodings (we've seen LDAP, and SNMP comes to mind).
Most of you obviously know all this. Now to my minimal proposal:
Just provide two function constructors:
from_encoding_function($encoding, $invalid_handling)
to_encoding_function($encoding, $invalid_handling)
These constructors would return CODEREFS that are used to do the real
conversions.
The from_encoding functions would always return "wide strings", the to_encoding functions
would typically return byte buffers with the fitting interpretation, although there
is no reason not to handle encodings that demand 16-bit code points without encoding.
None of the functions should ignore any internal flags on the source, though.
If you think this interface is too cumbersome to work with for simple uses,
extend it with the obvious
from_encoding($buffer, $encoding, $invalid_handling)
to_encoding($string, $encoding, $invalid_handling)
but I wouldn't recommend doing that.
The function constructors could return versions as optimized as you like,
from_encoding_function('utf8', NO_CHECKING) could be a plain byte copy,
if the input string is not already a "wide string", of course.
Does this make any sense?
Jost
--
| Jost.Krieger@ruhr-uni-bochum.de Please help stamp out spam! |
| Postmaster, JAPH, resident answer machine am RZ der RUB |
| Pluralitas non est ponenda sine necessitate |
| William of Ockham (1285-1347/49) |
Thread Next
-
Alternate Proposal for Encoding API
by Jost Krieger