Front page | perl.perl5.porters |
Postings from September 2000
Encode, my final take for a while
Thread Next
From:
Jarkko Hietaniemi
Date:
September 13, 2000 07:26
Subject:
Encode, my final take for a while
Message ID:
20000913092558.B16519@chaos.wustl.edu
I'm personally running out of time in churning out these API proposals
(my vacation is coming up, tra-la-la). I'm personally also very much
of the opinion that's it's time to get *something* like this into the
core. I'm not committing myself (or my trusty deputy Nick) to having
it in 5.7.1, but as an incentive I now checked a skeleton for the
Encode extension into the source code repository so that it will haunt
us until we do something about it.
[ --- cut here --- ]
=pod
=head1 NAME
Encode - character encodings
=head2 TERMINOLOGY
=over
=item *
I<char>: a character in the range 0..maxint (at least 2**32-1)
=item *
I<byte>: a character in the range 0..255
=back
The marker [INTERNAL] marks Internal Implementation Details, in
general meant only for those who think they know what they are doing,
and such details may change in future releases.
=head2 bytes
=over 4
=item *
bytes_to_utf8(STRING [, FROM])
The bytes in STRING are recoded in-place into UTF-8. If no FROM is
specified the bytes are expected to be encoded in US-ASCII or ISO
8859-1 (Latin 1). Returns the new size of STRING, or C<undef> if
there's a failure.
[INTERNAL] Also the UTF-8 flag of STRING is turned on.
=item *
utf8_to_bytes(STRING [, TO [, CHECK]])
The UTF-8 in STRING is decoded in-place into bytes. If no TO encoding
is specified the bytes are expected to be encoded in US-ASCII or ISO
8859-1 (Latin 1). Returns the new size of STRING, or C<undef> if
there's a failure.
What if there are characters > 255? What if the UTF-8 in STRING is
malformed? See L</"Handling Malformed Data">.
[INTERNAL] The UTF-8 flag of STRING is not checked.
=back
=head2 chars
=over 4
=item *
chars_to_utf8(STRING)
The chars in STRING are encoded in-place into UTF-8. Returns the new
size of STRING, or C<undef> if there's a failure.
No assumptions are made on the encoding of the chars. If you want to
assume that the chars are Unicode and to trap illegal Unicode
characters, you must use C<from_to('Unicode', ...)>.
[INTERNAL] Also the UTF-8 flag of STRING is turned on.
=over 4
=item *
utf8_to_chars(STRING)
The UTF-8 in STRING is decoded in-place into chars. Returns the new
size of STRING, or C<undef> if there's a failure.
If the UTF-8 in STRING is malformed C<undef> is returned, and also an
optional lexical warning (category utf8) is given.
[INTERNAL] The UTF-8 flag of STRING is not checked.
=item *
utf8_to_chars_check(STRING [, CHECK])
(Note that special naming of this interface since a two-argument
utf8_to_chars() has different semantics.)
The UTF-8 in STRING is decoded in-place into chars. Returns the new
size of STRING, or C<undef> if there is a failure.
If the UTF-8 in STRING is malformed? See L</"Handling Malformed Data">.
[INTERNAL] The UTF-8 flag of STRING is not checked.
=back
=head2 chars With Encoding
=over 4
=item *
chars_to_utf8(STRING, FROM [, CHECK])
The chars in STRING encoded in FROM are recoded in-place into UTF-8.
Returns the new size of STRING, or C<undef> if there's a failure.
No assumptions are made on the encoding of the chars. If you want to
assume that the chars are Unicode and to trap illegal Unicode
characters, you must use C<from_to('Unicode', ...)>.
[INTERNAL] Also the UTF-8 flag of STRING is turned on.
=item *
utf8_to_chars(STRING, TO [, CHECK])
The UTF-8 in STRING is decoded in-place into chars encoded in TO.
Returns the new size of STRING, or C<undef> if there's a failure.
If the UTF-8 in STRING is malformed? See L</"Handling Malformed Data">.
[INTERNAL] The UTF-8 flag of STRING is not checked.
=item *
bytes_to_chars(STRING, FROM [, CHECK])
The bytes in STRING encoded in FROM are recoded in-place into chars.
Returns the new size of STRING in bytes, or C<undef> if there's a
failure.
If the mapping is impossible? See L</"Handling Malformed Data">.
=item *
chars_to_bytes(STRING, TO [, CHECK])
The chars in STRING are recoded in-place to bytes encoded in TO.
Returns the new size of STRING in bytes, or C<undef> if there's a
failure.
If the mapping is impossible? See L</"Handling Malformed Data">.
=item *
from_to(STRING, FROM, TO [, CHECK])
The chars in STRING encoded in FROM are recoded in-place into TO.
Returns the new size of STRING, or C<undef> if there's a failure.
If mapping between the encodings is impossible?
See L</"Handling Malformed Data">.
[INTERNAL] If TO is UTF-8, also the UTF-8 flag of STRING is turned on.
=back
=head2 Testing For UTF-8
=over 4
=item *
is_utf8(STRING [, CHECK])
[INTERNAL] Test whether the UTF-8 flag is turned on in the STRING.
If CHECK is true, also checks the data in STRING for being
well-formed UTF-8. Returns true if successful, false otherwise.
=back
=head2 Toggling UTF-8-ness
=over 4
=item *
on_utf8(STRING)
[INTERNAL] Turn on the UTF-8 flag in STRING. The data in STRING is
B<not> checked for being well-formed UTF-8. Do not use unless you
B<know> that the STRING is well-formed UTF-8. Returns the previous
state of the UTF-8 flag (so please don't test the return value as
I<not> success or failure), or C<undef> if STRING is not a string.
=item *
off_utf8(STRING)
[INTERNAL] Turn off the UTF-8 flag in STRING. Do not use frivolously.
Returns the previous state of the UTF-8 flag (so please don't test the
return value as I<not> success or failure), or C<undef> if STRING is
not a string.
=back
=head2 UTF-16 and UTF-32 Encodings
=over 4
=item *
utf_to_utf(STRING, FROM, TO [, CHECK])
The data in STRING is converted from Unicode Transfer Encoding FROM to
Unicode Transfer Encoding TO. Both FROM and TO may be any of the
following tags (case-insensitive, with or without 'utf' or 'utf-' prefix):
tag meaning
'7' UTF-7
'8' UTF-8
'16be' UTF-16 big-endian
'16le' UTF-16 little-endian
'16' UTF-16 native-endian
'32be' UTF-32 big-endian
'32le' UTF-32 little-endian
'32' UTF-32 native-endian
UTF-16 is also known as UCS-2, 16 bit or 2-byte chunks, and UTF-32 as
UCS-4, 32-bit or 4-byte chunks. Returns the new size of STRING, or
C<undef> is there's a failure.
If FROM is UTF-8 and the UTF-8 in STRING is malformed? See
L</"Handling Malformed Data">.
[INTERNAL] Even if CHECK is true and FROM is UTF-8, the UTF-8 flag of
STRING is not checked. If TO is UTF-8, also the UTF-8 flag of STRING is
turned on. Identical FROM and TO are fine.
=back
=head2 Handling Malformed Data
If CHECK is not set, C<undef> is returned. If the data is supposed to
be UTF-8, an optional lexical warning (category utf8) is given. If
CHECK is true but not a code reference, dies. If CHECK is a code
reference, it is called with the arguments
(MALFORMED_STRING, STRING_FROM_SO_FAR, STRING_TO_SO_FAR)
Two return values are expected from the call: the string to be used in
the result string in place of the malformed section, and the length of
the malformed section in bytes.
=cut
--
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen
Thread Next
-
Encode, my final take for a while
by Jarkko Hietaniemi