develooper Front page | perl.perl5.porters | Postings from September 2000

Alternate Proposal for Encoding API

Thread Next
Jost Krieger
September 18, 2000 03:20
Alternate Proposal for Encoding API
Message ID:
After reading a lot of messages about UTF8 and Encoding, I think
that the currently proposed API

1. is still way too low-level
2. also has function names that still can be misunderstood.

In my understanding, Perl strings now have following "user-visible" properties:

1. Perl strings are used as byte buffers. This will stay so (in some way or other).
2. Perl strings are used for character handling, characters being encoded in an
   unspecified character set that is an 8-bit super set of ASCII. Some people
   think it is ISO-8859-1, but perl doesn't care. This is undistinguishable
   from 1. All this is unfortunate, but not changeable.
3. Nowadays, Perl has support for strings in a "large enough" character set
   with an internal encoding that makes chars with ords up to 2**31-1 possible.
   For a naive user, it is hardly important we're talking about Unicode (which
   will hopefully stay so), and still less important the encoding is UTF8.
4. For historical reasons, perl has problems doing IO and other stuff with
   these new strings.
5. "Byte strings" (and, less commonly "wide strings") can obvioulsy contain
   data in any of a large array of character sets and encodings where only the
   user knows that fact.
6. All these "strings" except pure byte strings, can potentially be invalid in
   various ways (encoding errors or invalid code points being the main cases).

On the other hand, people have data in various encodings, some data streams
are even mixed between character encodings or between binary and character
encodings (we've seen LDAP, and SNMP comes to mind).

Most of you obviously know all this. Now to my minimal proposal:

Just provide two function constructors:

from_encoding_function($encoding, $invalid_handling)
to_encoding_function($encoding, $invalid_handling)

These constructors would return CODEREFS that are used to do the real

The from_encoding functions would always return "wide strings", the to_encoding functions
would typically return byte buffers with the fitting interpretation, although there
is no reason not to handle encodings that demand 16-bit code points without encoding.

None of the functions should ignore any internal flags on the source, though.

If you think this interface is too cumbersome to work with for simple uses,
extend it with the obvious

from_encoding($buffer, $encoding, $invalid_handling)
to_encoding($string, $encoding, $invalid_handling)

but I wouldn't recommend doing that.

The function constructors could return versions as optimized as you like,
from_encoding_function('utf8', NO_CHECKING) could be a plain byte copy,
if the input string is not already a "wide string", of course.

Does this make any sense?

|      Please help stamp out spam! |
| Postmaster, JAPH, resident answer machine          am RZ der RUB |
| Pluralitas non est ponenda sine necessitate                      |
|                                 William of Ockham (1285-1347/49) |

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About