Front page | perl.perl5.porters |
Postings from September 2000
Encode, take five
Thread Next
From:
Jarkko Hietaniemi
Date:
September 12, 2000 16:43
Subject:
Encode, take five
Message ID:
20000912184258.A23238@chaos.wustl.edu
I tried to explore the boundary cases and error conditions now thoroughly.
A new feature is customizable error handling. Note also the
s/strict/check/g.
=pod
=head1 NAME
Encode - character encodings
=head2 TERMINOLOGY
=over
=item *
I<byte>: a B<number> in the range 0..255
=item *
I<char>: a B<character> in the range 0..maxint (at least 2**32-1)
=back
The marker [INTERNAL] marks Internal Implementation Details, in
general meant only for those who think they know what they are doing,
and such details may change in future releases.
=head2 bytes
=over 4
=item *
bytes_to_utf8(STRING [, CHECK])
The bytes in STRING are encoded in-place into UTF-8. The bytes are
assumed to be encoded in US-ASCII, bytes between 0 and 127, inclusive.
Returns the new size of STRING, or C<undef> if there's a failure.
If there are characters > 127? See L</"Handling Malformed Data">.
If you want to recode some eight-bit legacy encoding to UTF-8, you
must use C<from_to(STRING, ..., 'utf8')>.
[INTERNAL] Also the UTF-8 flag of STRING is turned on.
=item *
utf8_to_bytes(STRING [, CHECK])
The UTF-8 in STRING is decoded in-place into bytes. Returns the new
size of STRING, or C<undef> if there's a failure.
If the UTF-8 in STRING is malformed? See L</"Handling Malformed Data">.
[INTERNAL] The UTF-8 flag of STRING is not checked.
=back
=head2 chars
=over 4
=item *
chars_to_utf8(STRING)
The chars in STRING are encoded in-place into UTF-8. Returns the new
size of STRING, or C<undef> if there's a failure.
No assumptions are made on the encoding of the chars. If you want to
assume that the chars are Unicode and to trap illegal Unicode
characters, you must use C<from_to('Unicode', ...)>.
[INTERNAL] Also the UTF-8 flag of STRING is turned on.
=over 4
=item *
utf8_to_chars(STRING)
The UTF-8 in STRING is decoded in-place into chars. Returns the new
size of STRING, or C<undef> if there's a failure.
If the UTF-8 in STRING is malformed C<undef> is returned, and also an
optional lexical warning (category utf8) is given.
[INTERNAL] The UTF-8 flag of STRING is not checked.
=item *
utf8_to_chars_check(STRING [, CHECK])
(Note that this interface is exceptionally named since a two-argument
utf8_to_chars() has different semantics.)
The UTF-8 in STRING is decoded in-place into chars. Returns the new
size of STRING, or C<undef> if there is a failure.
If the UTF-8 in STRING is malformed? See L</"Handling Malformed Data">.
[INTERNAL] The UTF-8 flag of STRING is not checked.
=back
=head2 chars With Encoding
=over 4
=item *
chars_to_utf8(STRING, ENCODING[, CHECK])
The chars in STRING encoded in ENCODING are recoded in-place into
UTF-8. Returns the new size of STRING, or C<undef> if there's a failure.
No assumptions are made on the encoding of the chars. If you want to
assume that the chars are Unicode and to trap illegal Unicode
characters, you must use C<from_to('Unicode', ...)>.
[INTERNAL] Also the UTF-8 flag of STRING is turned on.
=item *
utf8_to_chars(STRING, ENCODING [, CHECK])
The UTF-8 in STRING is decoded in-place into chars encoded in
ENCODING. Returns the new size of STRING, or C<undef> if there's a
failure.
If the UTF-8 in STRING is malformed? See L</"Handling Malformed Data">.
[INTERNAL] The UTF-8 flag of STRING is not checked.
=item *
from_to(STRING, FROM_ENCODING, TO_ENCODING [, CHECK])
The chars in STRING encoded in FROM_ENCODING are recoded in-place into
TO_ENCODING. Returns the new size of STRING, or C<undef> if there's a
failure.
If mapping between the encodings is impossible?
See L</"Handling Malformed Data">.
[INTERNAL] If TO_ENCODING is UTF-8, also the UTF-8 flag of STRING is
turned on.
=back
=head2 Testing For UTF-8
=over 4
=item *
is_utf8(STRING [, CHECK])
[INTERNAL] Test whether the UTF-8 flag is turned on in the STRING.
If CHECK is true, also checks the data in STRING for being
well-formed UTF-8. Returns true if successful, false otherwise.
=back
=head2 Toggling UTF-8-ness
=over 4
=item *
on_utf8(STRING)
[INTERNAL] Turn on the UTF-8 flag in STRING. The data in STRING is
B<not> checked for being well-formed UTF-8. Do not use unless you
B<know> that the STRING is well-formed UTF-8. Returns the previous
state of the UTF-8 flag (so please don't test the return value as
I<not> success or failure).
=item *
off_utf8(STRING)
[INTERNAL] Turn off the UTF-8 flag in STRING. Do not use frivolously.
Returns the previous state of the UTF-8 flag (so please don't test the
return value as I<not> success or failure).
=back
=head2 UTF-16 and UTF-32 Encodings
=over 4
=item *
utf_to_utf(STRING, FROM, TO [, CHECK])
The data in STRING is converted from Unicode Transfer Encoding FROM to
Unicode Transfer Encoding TO. Both FROM and TO may be any of the
following tags (case-insensitive, with or without 'utf' or 'utf-' prefix):
tag meaning
'7' UTF-7
'8' UTF-8
'16be' UTF-16 big-endian
'16le' UTF-16 little-endian
'16ne' UTF-16 native-endian
'32be' UTF-32 big-endian
'32le' UTF-32 little-endian
'32ne' UTF-32 native-endian
UTF-16 is also known as UCS-2, 16 bit or 2-byte chunks, and UTF-32 as
UCS-4, 32-bit or 4-byte chunks. Returns the new size of STRING, or
C<undef> is there's a failure.
If FROM is UTF-8 and the UTF-8 in STRING is malformed? See
L</"Handling Malformed Data">.
[INTERNAL] Even if CHECK is true and FROM is UTF-8, the UTF-8 flag of
STRING is not checked. If TO is UTF-8, also the UTF-8 flag of STRING is
turned on. Identical FROM and TO are fine.
=back
=head2 Handling Malformed Data
If CHECK is not set, C<undef> is returned, and also an optional lexical
warning (category utf8) is given. If CHECK is true but not a code
reference, dies. If CHECK is a code reference, it is called with the
arguments
(MALFORMED_STRING, STRING_FROM_SO_FAR, STRING_TO_SO_FAR)
Two return values are expected from the call: the string to be used in
the result string in place of the malformed section, and the length of
the malformed section in bytes.
=cut
--
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen
Thread Next
-
Encode, take five
by Jarkko Hietaniemi