develooper Front page | perl.perl5.porters | Postings from February 2001

Unicode fundamentals

From:
Karsten Sperling
Date:
February 26, 2001 11:43
Subject:
Unicode fundamentals
Message ID:
"iraun1.ira.0095101:010226.194300"@ira.uka.de
I ripped the strict 'strings' proposal out of the document and
changed some things (like s/builtin/function of Encode.pm/g).
It should pretty much reflect the current state of the discussion
now.

Two things still need specification, namely pack("U") and
\x{...} (with the braces). From what i can see from the source,
\x{c1} is "A" on ebcdic, which makes sense, because \x{} is just a
longer version of \x then. \N should be extended to take numbers or
names, where the numbers would be unicode code points on every
platform (just like the names).

--snip--

1) Perl Unicode / wide-character string semantics (Note: most of this
is already implemented in current bleadperl).


1.1) A string is a sequence of arbitrary sized unsigned integers
(capable of holding at least 32bit). Throughout this document, these
are often referred to as "integer elements" or "integer items" of a
string, to explicitly avoid giving wrong semantic hints by calling
them "characters" or "octets".

Perl internally represents characters in a charset that is capable of
holding all Unicode characters. This may be Unicode itself, but it may
also be EBCDIC-twisted-Unicode or something completely different. This
internal character set depends on the platform Perl is running on. The
programmer does not need to care about this character set if he doesn't
want to. This is possible by using the functions provided by the Encode
module.

Internally, the integers comprising a string (which can be larger than
bytes) may be stored in some encoding, which currently happens to be
utf8. This encoding is completely hidden from any perl code. In no way
will any perl code ever be permitted to see what bytes are actually
used on C level to store a string. Even if an integer element happens
to be stored in multiple bytes on C level, it is still just one atomic
item and will never appear to be more than one.


1.2) Strings may be used to represent binary data, in this case every
element of the string holds one octet. If a string that contains
elements > 255 is passed to a function that can only handle octets,
the function still doesn't operate on the bytes that make up the
encoding of that string, but unpack the encoding and ignore everything
but the lowest 8 bits of them. In -w mode, a warning like  "wide
character encountered during <operation>"  is issued, if there are
elements > 255 in the string. Note that the fact that a string is
encoded in utf8 doesn't necessarily mean that it contains elements >
255.

[[[ Meta: current bleadperl dies on some operations while others pass
silently. Warnings seem to be a good compromise. Making all these
fatal seems a bit harsh for a language where "34zy" + "1ns1d3" is 35.
]]]


1.3) ord() takes a string and returns the value of the first integer
elemenet of that string. There is no translation taking place.
ord("A") is 0xC1 on EBCDIC for example, but 0x41 on Latin1.

chr() does the inverse thing. It takes an integer and returns the
string aka integer sequence that contains that single integer element.

chr() and ord() are not limited to the 0..255 range.


1.4) pack() always returns strings that hold octets, so all the
integer elements of a string returned by pack are in range 0..255.

pack("C") packs one octet (8 bit), not one chararacter, it only looks
at the lower 8 bits of the number that it is passed. "a"/"A" only pack
the lower 8 bits, just like "C".

unpack() takes a string and expects it to hold only octets. Only the
lower 8 bits of every integer element of the input string will be
used. If any of
the elements in the input string is > 255, the warning  "wide
character in unpack()"  will be issued in -w mode.

[[[ Meta: "U" remains to be defined. I really dislike pack("U",300) eq
v300, because the output of pack should be suitable for passing to
things that require octets. having "U" as a shortcut to
Encode::encode("unicode/utf8") is an option. ]]]

1.5) Encode::uchr() (referred to as uchr() throughout this document)
takes the number of a Unicode code point and returns a string
representing that character in Perl's internal charset. uchr(0x41) eq
"A" on every platform.

Encode::uord() (referred to as uord() throughout this documnt) does
the inverse thing of uchr(), it takes a string and returns the Unicode
code point number of the first character in the string (which is
internally represented in Perl's own charset, so there may be a
translation taking place). uord("A") == 0x41 on every platform.


1.6) Quoted strings represent themselves, that is, whatever was read
from the DATA file handle (see section 1.9 on that).

"\123" style octal escapes and "\x.." style hex escape emit the
specified integer value into the string. "\c." work only if the next
character $c is in range [a-zA-Z] and emits the integer value 
uord(uc($c))-0x40  into the string. These escapes do not do a
translation, just like chr(). This is what these types of escapes
always used to do.

The v1.2.3 construct returns a string that is made up of the integer
elements specified. There is no translation taking place here either.

"\N{...}" emits the specified Unicode character into the string, but
internally store it in Perl's own charset. It can take either a name
(with use charnames) or a hex number. If a name is given, that is
converted to the corresponding number first. uchr() of that number is
then emitted into the string.

[[[ Meta: Does \x{} have unicode semantics or is it just \x with a
variable sized parameter? IMO the latter makes more sense as it seems
DWIM to have both \x escapes do the same, and \N can be passed a
number to create a Unicode code point. What about qu//? ]]]


1.7) Bitwise string operators ~ & | ^ work on the lower 8 bits of
every string element only, any higher bits are ignored. ~v257 eq ~v1
eq v254. If there are any elements > 255 in the string, the warning 
"wide characters in bitwise string operation"  is issued in -w mode.


1.8) Encode::encode(NAME, EXPR [, CHECK]) returns the string EXPR
encoded according to the named charset/encoding. NAME is in format
"<charset>/<encoding>", for example "unicode/utf8" or "latin1/plain".
The integer elements of the input string are assumed to be characters
in Perl's internal charset and are translated from Perl's charset to
the specified charset before encoding them with the specified
encoding. If only a charset but no encoding is given, only the
translation step is done.

If a character from the input can not be represented in the chosen
charset, the behaviour depends on the CHECK parameter. If CHECK is not
given, encode() fails and returns undef. If CHECK is true, encode()
dies with an appropriate method. If CHECK is a subref, it is called to
resolve the situation

[[[ Meta: calling conventions need to be defined ]]]

Encoding to the platform's native charset/encoding may be no-op (other
than failing on everything > 255) if Perl's internal charset has been
adjusted to match the native charset in the relevant range. This is
what makes legacy programs work. Note that encoding to Unicode may
actually need to do some work on platforms where the internal
character set is twisted (EBCDIC for exapmle).

If an encoding name was given (as opposed to giving just a charset
name), the string returned by encode contains only elements in range
0..255, so it is suitable for passing to functions that require
octets.

If the special NAME "native" is given, the platform's native charset
and encoding are used (which can be specified during Configure). This
is "latin1/plain" for most machines, but may as well be "ebcdic/plain"
or "unicode/utf8". Some names may be specified in abbreviated form:
	"utf8"    -> "unicode/utf8"
	"utf16"   -> "unicode/utf16"
	"latin1"  -> "latin1/plain"
	"ebcdic"  -> "ebcdic/plain"
	

1.8) Encode::decode(NAME, EXPR [, CHECK]) does the reverse of encode()
.NAME is in format "<charset>/<encoding>". If an encoding name is
given (as opposed to specifying only charset name) decode() only looks
at the low 8 bits of every element of the input string. If any of
those elements happens to be > 255, the warning  "wide character in
decode()"  is issued in -w mode.

The string elements of the input string will be treated as characters
specified in that character set and will be converted from that set to
Perl's internal character, after unpacking the encoding (if one was
given). The NAME and CHECK arguments are handled in the same way
encode() handles them.

If the input string is not a valid sequence in the specified encoding,
or if one of the decoded characters in the input sequence can not be
represented in Perl's internal charset, the error handling again
depends on CHECK.

Decoding from the platform's native charset/encoding may be a no-op if
Perl's internal charset has been adjusted to match the native charset
in the relevant range. This is what makes legacy programs work.


1.9) What happens with file io depends on the mode the file handle is
in: binary mode (aka byte mode aka octet mode) handles will return
strings containing octets (all elements in range 0..255) when read,
and will only write the lowest 8 bits of every string element. If a
string to be written contains elements > 255, the warning  "wide
character in print"  is issued in -w mode.
                              
In character mode, a string that is written will be run through
Encode::encode(). The charset/encoding to be passed can be specified
by the user when opening the file, or by modifying it later using the
interface that PerlIO provides for this purpose. If no
charset/encoding name is given, "native" is the default. Similarly,
data that is read will be run through decode().

syswrite()/send() and sysread()/recv() always take octets, there is no
implicit encode()/decode() done there.

If a file handle is opened without specifying a mode, it is in
character mode, and expects to read data in the host-native encoding.
This does not break legacy code, because Perl will usually have it's
internal charset twisted in a way that it keeps the mapping of the
host native charset to 0..255. Thus having character or octet mode
doesn't make a difference for those programs. However it does matter
if Perl is configured with host-native encoding "Unicode/utf8" for
example.

[[[ Meta: Binary mode could be made the default as well, though for a
text processing language, text mode should be a reasonable default.
Plus, text mode has always been the default on platforms where it
makes a difference regarding the line ends. ]]]

Note that this assumption is only valid for the Platforms perl is
currently running on. When Perl is ported to new platforms, it is like
that its internal charset will be Unicode, so it's important to
specify the correct modes when
opening files.

The DATA file handle is in character mode with host native
charset/encoding (namely the default thing). The encoding of the DATA
file handle can be changed using the encoding pragma:

	use encoding "unicode/utf8";
	
This will tell Perl that the source file is not written in the host's
default encoding, but really uses the unicode charset with utf8
encoding from the next line onward. The utf8 pragma introduced by Perl
5.6.0 is the same as  use encoding "unicode/utf8".


1.10) The utf8 and bytes pragmata are only kept for compatibilty with
Perl 5.6.0 and are deprecated. New code should not use them. Using
utf8 is exactly equivalent to saying

	use encoding "unicode/utf8";
	
which changes the encoding of the DATA filehandle. It does NOT affect
the semantics of any operations, as it used to do in some cases in
Perl 5.6.0. As use utf8 does not modify any semantics, there is really
nothing that use bytes
could switch back, so it's a 100% no-op.

[[[ Meta: use bytes could however be used to enable enhanced
compatibility with pre-5.6.0 Perl, like having chr() wrap at 255, and
parsing \x{} and \N{} literally in strings. ]]]


1.11) m// s/// and tr/// work on the integer elements that make up a
string, because there really is no other thing that they could work
on. They NEVER work on the individual bytes that may be making up the
internal encoding of the string on C level.

\123, \x., \c, \x{...} and \N{...} escapes work exactly like in qq//,
that means that \123, \x.., \x{...} and \c. look for that specified
integer element, while \N{...} looks for the integer element that
Perl's internal charset uses to represent the specified Unicode
character.


The semantics of [$a-$b] character ranges in s/// and tr/// are a bit
tricky, because they can be used to work on character data as well as
on octet data.

There are three magic ranges, namely [a-z], [A-Z], and [0-9]. If both
end points are specified via a literal character and are both inside
the same of these magic ranges, the range is magic and matches the
specified subrange of the lower case or upper case latin letters, or
of the digits, even if the charset happens to have gaps somewhere in
those ranges.

In any other case, the range matches any character of

	map { chr } ord($a) .. ord($b)

If the -w option is on, a warning is issued if a literal character or
an \N{...} escape is used as an end point of a range that is not
magic.

\X tries to match a locale independant Unicode grapheme aka "combining
character sequence". It is currently equivalent to  (?:\PM\pM*)  which
means a non-mark character followed by any number (including none) of
mark characters. This may be modified in a future version to do more
exact grapheme matching.

\C matches any character (a complete one) and is deprecated. This
isn't exactly what it used to mean in Perl 5.6.0, but the 5.6.0
interpretation breaks character atomicity, so we can't let it slip
through.

The /U and /C options of tr/// have no effect and are deprecated.

[[[ Meta: Any more backwards compatible interpretation of \C that
keeps character atomicity is welcome. Maybe \C == [\x00-\xff]? ]]]


1.12) Magical string ++ works on non-epty strings that match
	/^[A-Za-z]*[0-9]*$/
Each character is preserved in it's range. Note that these are the
three magic ranges.



2) Answers to recent questions on p5p. This section tries to answer
some of the "how do you do xyz with that strings model" style
questions that recently popped up on p5p.


2.1) If my character string is 100 characters long, how many bytes
will it need when stored to disk?

That depends on the encoding you use to store it of course.

	my $bytes = Encode::encode("unicode/utf8", $characters);
	print("storing ", length($bytes), " to disk ...\n");
	...

Note that while this example really does the encoding to unicode/utf8
to get the length, the Encode module could easily provide a length
function that would compute the lenght of a string in a specified
encoding without actually performing the encoding itself.

Also note that this is not an unnecessary complication. Encoding your
character strings (that possibly contain high code points) to before
giving them to a function that expects a string of octets (like
syswrite()) is mandatory, because those functions have no sensible way
of handling strings with elements > 255.

	
2.2) How do I encode to an Encoding that has not yet been done in
Encode.pm?

Let's assume you want to encode to unicode/utf-16 (big endian)
yourself. All you need to do is look at the string character by
character, get the unicode code point number for the character, and
encode that into bytes.

	my $bytes = "";
	foreach my $c (split //, $characters) {
	    my $x = uord($c);
	   
	    $x <= 0x00ffff and
	        $bytes .= pack("n", $x),
	        next;
	   
	    $x <= 0x10ffff and
	        $bytes .= pack("nn", 0xd800 + ($x-0x10000 >> 10),
	                             0xdc00 + ($x-0x10000 & 0x3ff)),
	        next;
	       
	    die("character out of range for utf-16\n");
	}

--snip--




nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About