develooper Front page | perl.perl6.internals | Postings from April 2004

Strings Manifesto

Thread Next
From:
Jeff Clites
Date:
April 28, 2004 04:22
Subject:
Strings Manifesto
In light of ongoing discussions of Parrot's string model, I've decided 
to prepare a document spelling out my general viewpoint on the subject 
of strings. It's intended also to supply a self-consistent set of 
terminology. It will frame my future comments. Some notes:

1) This is explicitly arguing for one viewpoint, not against another. 
(That is, I've not included any such arguments.)
2) I've written it in a manner which doesn't directly mention Parrot, 
so that I can use it in other contexts.
3) There are two separate issues with respect to Parrot: the strings 
semantics, and the internal representation/implementation. This 
addresses only the former. It also doesn't include an API 
recommendation. This is because the semantics need to be sorted out 
first.
4) There are a few topics yet-to-be-covered.
5) I should probably provide a (much shorter) synopsis.

----------------------------------------------------------------------
What's a String?

This article is intended to clarify some concepts related to strings. 
The motivation is that they are a frequent source of confusion for 
developers, working in various programming language. This lays out a 
framework for thinking about strings--a viewpoint, if you will. It 
reflects a conceptual model that's already embraced by some programming 
environments, but even those environments often lack a detailed 
explanation. This document also aims to disambiguate various 
terms--clarifying some and providing self-consistent definitions of 
others which are used inconsistently in different contexts.

So, what is a string?

In the most general sense, a string is a data type in a programming 
language, meant to model text. It's meant to allow manipulation, 
viewing, analysis, searching, of the stuff you read, or which computer 
programs read on your behalf.

A bit more concretely:

     Definition: A string represents a sequence of abstract characters.

So we can't go much further until we say a few words about what a 
"character" is. But the two important words to remember from that 
definition are "represents" and "characters". (And, "sequence" is 
important too, but you won't overlook that one.)

What's a character?

A character, in fuzzy terms, is the smallest unit of meaning in a 
language.
A character is an abstract concept. Characters are things like letters, 
numbers, punctuation marks, the Japanese symbol for "middle"--stuff 
like that. The concept of a character is supposed to capture an 
intuitive notion--the notion ususally understood by the common language 
speaker. (We are, after all, modelling what's essentially text--the 
stuff that humans read. That's the whole point.) That said, precisely 
determining what constitutes a character, and what doesn't, is 
essentially a judgement call, and a decision. Fortunately, essentially 
all existing standards agree on this, in a general sense. For instance, 
the lowercase "a" and the uppercase "A" are considered to be different 
characters, but stylistic variants (the letter "a" in different fonts, 
for instance) are not (instead, they're just different graphical 
representations of the same abstract character). These could have been 
modelled differently--you could consider "a" and "A" just variants of 
the same character. But, as it turns out, no one does it that way.

And I mention that now to prepare you for some arbitrariness--some 
conventions that could have gone another way, but which are actually 
still intuitively meaningful.

Notice that we haven't said anything about how a computer program (or 
language) might model a character. That's important. So far we've just 
said *what* a computer program is trying to model, not how it might do 
it.

And also, just to hit the point home, we're not talking about the 
"char" datatype in C. To keep things clear, we'll refer to that as a 
"byte", when it comes up.

What is Unicode?

At this point I'd better explain what "Unicode" means, before I slip 
and use the word. People like to throw the word around a lot, and it's 
not always clear which of the following related (but different) things 
they mean. There are at least 6 different possibilities:

1) The Unicode Consortium, a group of companies and organizations that 
got together to sort out the handling of international text. ("Sort 
out" in the sense of making it trivial to allow production of text 
documents containing a mixture of different langauges, rather than 
nearly impossible.)

2) The standard produced by that group, currently at version 4.0.

3) The character model which forms the basis of that standard, and 
which also forms the basis of what I'm talking about here.

4) The database of character properties (or the properties themselves) 
assembled by the Unicode Consortium. These properties are things like, 
"this character has an uppercase version over here", "this character 
represents the decimal digit 3", etc.

5) A family of encodings (word defined below) used for the interchange 
of international text between computer programs (by mean of files or 
network transmissions, for instance). Often, "Unicode" is used to refer 
to a specific one; that usage is inherently ambiguous, but harkens back 
to the first version of the Unicode standard, which defined only one 
encoding.

6) International (multi-language) text handling, in general. This usage 
arises because internaltionalization-aware applications tend to base 
their implementations on the Unicode character model, and tend to 
support the Unicode encodings. But it's important to recognize this 
usage, because sometimes the term "Unicode" is used in contexts which 
don't involve any of the above per se, but rather just indicate an 
awareness and handling of international text issues.

Of these, really (3) is the only one which someone could really 
"disagree" with. (That is, the consortium, standard, database, and 
encodings exist and are precisely defined--there's nothing 
philosophical there to reject.) But the other important point is that 
these are all separate things--sometimes people will cite the need to 
support other encodings as an argument against the character model, 
which doesn't follow logically.

So the string/character model I'm espousing is based largely on the 
Unicode Character Model, but it would be misleading to say that I'm 
advocating, "use Unicode". (What would that mean, anyway?) But it's (3) 
that does the conceptual heavy lifting, and (4) which does the grunt 
work. Or, (3) is the concept, and (4) is the details.

I'll try to avoid using the word "Unicode" by itself, to avoid 
confusion, but if I do, then I'm probably referring to the Unicode 
Standard, or something it specifies.

So let's review. A string is a concrete thing--a data type in a 
language; it represents a sequence of (abstract) characters, a 
character is a unit of meaning in a language, and a character has 
properties which have been collected into a database. To go a bit 
further, the fundamental question you can ask a string is, "what's your 
Nth character". All else flows from that.

What's a code point?

So if the fundamental question you can ask a string is, "what's your 
Nth character", we have to talk a bit about how one would 
programmatically express that answer--that is, what sort of data type 
is/can/should be use to represent a character? Fundamentally, a 
character can of course be represented opaquely--in an object-oriented 
language, you could have character object. That's quite reasonable. As 
it turns out, people find it convenient to programmatically represent a 
character by an integer (think "whole number", not a specific data type 
here). It's convenient for several reasons--it's compact and easy to 
refer to in speech. And if the fundamental thing you can ask a string 
is what its Nth character is, then the fundamental things you do with a 
character is look up its properties, and test it for equality against 
other characters. So if you just go through and give each character a 
little serial number, then you can find the properties of a character 
by using its number as an index into a property table (i.e., character 
3's properties are at slot number 3 in the table), and you can tell 
that 2 characters are different characters by checking whether they are 
represented by different numbers.

A number used to represent a character like this is called a "code 
point". Which number you choose to assign to which character is 
fundamentally arbitrary--its one and only use is as a unique lookup 
key. The letter "A" could be assigned the number 1, or 42, or 31337, or 
2001, or anything else. It doesn't matter. Now, various standards have 
numbered various characters in various ways, though many don't actually 
think in these terms, so to be clear if you are going to refer to 
"character number 42" or "the code point for the letter 'A'", then you 
really need to point out what numbering scheme you are talking about. 
Fortunately, the Unicode Standard has numbered *all* of them--it's 
given a number to essentially every character in every 
digitally-represented langauge in the world. Since its numbering scheme 
is comprehensive, and since the numbering scheme is arbitrary, without 
loss of generality we are going to use the Unicode standard's numbering 
scheme for the rest of this discussion, and we'll go ahead and pretend 
it's the only one. So when we say, "the code point for the letter 'A'", 
we'll mean the code point that the Unicode Standard assigns to it, 
which (in case you are wondering by now) is 65. (See, the number itself 
is not very interesting!)

So, let's review again. For various practical reasons, it's preferable 
to programatically represent characters using integers, you have to 
pick an arbitrary numbering scheme, and somebody's done that, and it's 
a good one. This numbering scheme defines a one-to-one correspondence 
between numbers (code points) and characters, and that makes it 
tempting to pretend that characters *are* numbers. But it's important 
to keep in the back of your mind an awareness that the numbers merely 
help you pick out the characters, and it's the characters themselves 
which are important, and characters are *abstract*--they never actually 
live inside of a computer program. [Note: Of course, some numbers don't 
represent any character--there are only so many characters. So to be 
mathematically precise, there's a one-to-one correspondence between a 
subset of integers and all characters.]

What's an encoding

So, what we've convered so far is the type of stuff you need to work 
with text in-memory. We can use the number 65 to represent the letter 
"A", and we have a properties database to tell us whether A is a 
number, or whitespace, or whatever.

But, text being what it is, people tend to want to save it to disk, or 
transfer it between applications. That is, they want to do IO 
operations on strings.

Now as it turns out, IO is alway an operation which involves bytes. 
Even if your IO is hidden under an abstraction layer, at the bottom 
you're always reading and writing bytes. Strings are not, 
fundamentally, bytes. So how do you create bytes from some high-level 
construct like a string? You define a serialization algorithm, that's 
how. That's an encoding:

     Definition: An encoding is a mapping from sequences of abstract 
character to sequences of bytes (and vice versa).

Since we've said that a string is supposed to represent a sequence of 
abstract characters, we could also just say, "An encoding is a mapping 
from strings to sequences of bytes (and vice versa)."

*Important Note*: This is, in particular, an area where there is much 
conflicting and inconsistent terminology. The concept that I just 
defined is the exact same thing that the Unicode Standard refers to as 
a "Character Map". That would probably be the preferred term, since I 
don't believe it's been used elsewhere with a differing meaning. But, 
it's not a commonly used term, so I'm going to use "encoding" instead, 
with some hesitation, so that someone jumping into the middle of this 
discussion has a chance of figuring out what I'm talking about. My 
usage of "encoding" matches the usage of the XML specification, and the 
Objective-C Foundation framework. Java and MIME use the word "charset" 
for the same concept, and IANA uses "character set". Others use the 
term "character set" for something different. But the important thing 
to remember is just that, for the purposes of this document, I'm using 
the definition above, and when you are reading elsewhere and encounter 
any of those terms, you should look for a definition of what is meant 
there.

The absolutely crucial thing the remember about an encoding is that it 
defines an interchange format--that's 100% what it's for. It's purpose 
is to let you communicate textual data between processes (or, store it 
to the filesystem for later retrieval by some process). Encodings have 
nothing, fundamentally, to do with any sort of in-memory representation 
of a string.

The other thing to remember about an encoding is that there are lots of 
them--lots and lots of national and industry standard define lots of 
different ones. They tend to have names like, "ASCII" or "ISO-8859-1" 
or "Shift-JIS" or "Big5" or "UTF-8".

The reason there are so many of them is that, in the early days of 
computing, most everyone got it into their heads that they needed to 
represent textual data using only one byte per character, which meant 
that an encoding could only handle 256 different characters. This was 
fine for English (and in fact ASCII only encodes 128 characters), but 
as soon as you hit Europe you needed more characters--for French and 
German and Greek and Icelandic and Polish and Turkish and so on. You 
need more than 256 characters to handle all of that, so different 
encodings where invented to handle different collections of characters. 
This doesn't sound so bad, and it's fine as long as everyone just talks 
to the guy next door, but it quickly turns into a mess if you try to 
step outside--you can't create a single text document with words from 
multiple languages (for some combinations), and (potentially worse) 
when you read in a file created by someone else, you need to know what 
encoding they used (that is, you need metadata, of the sort supplied by 
MIME and HTTP headers, but not supplied by most filesystems). This 
wasn't fun for anyone.

The obvious thing to do, in retropect, was to come up with a different 
plan, one which would allow you to handle all languages without a 
separate song-and-dance for each one. That's in fact what the Unicode 
Consortium came together to do. And they did it. It was an enormous 
undertaking.

Okay, back to the point. Not only are there lots of different 
encodings, but by their nature most only know how to encode strings 
containing a limited collection of characters. ASCII, for example, 
doesn't know how to encode Japanese Kanji. But the Unicode Standard 
defines a collection of encodings (UTF-8, UTF-16, and UTF-32, with big- 
and little-endian variant of the latter two) which can encode strings 
containing basically any character.

It's important to take a moment to point out how general the above 
definition of "encoding" is, and what it doesn't imply:

1) As mentioned, an encoding doesn't have to know how to encode every 
possible string. And reciprocally, not every sequence of bytes will be 
decodable by every encoding.

2) Although an encoding is really a pair of algorithms (one to 
serialize, one to de-serialize), that doesn't imply that an encoding 
has to be strictly invertible. That is, if I encode a string into a 
sequence of bytes, and then decode that sequence of bytes into a 
string, I might not get back the same string I started with. For 
example, and encoding might, conceptually, strip accent characters.

3) An encoding doesn't necessarily operate on a character-by-character 
basis. That is, the bytes to encode "ab" might not be the concatenation 
of the bytes to encode "a" with the bytes to encode "b". It's even 
possible, based on the above definition, that an encoding might be able 
to encode the string "ab", but not the string "ba". (In practice, I 
don't know if this ever occurs "in the wild".)


So the key points here are: (1) an encoding is all about IO, (2) not 
all encodings guarantee data integrity, and (3) the Unicode encodings 
_do_ guarantee data integrity. Another important point is that the 
choice of an encoding is a crucial piece of metadata for bytes 
undergoing IO, and thus it's the sort of thing which is indicated 
explicitly in higher-level protocols and formats (HTTP, MIME, and XML, 
to name a few), and it is almost always indicated *by name*. IANA 
mantiains a registry of these, and many protocols use the IANA names to 
specify encodings.

Another important point: The Unicode Character Model (unlike many other 
standards) thinks of the process of going from strings to bytes as a 
multi-step process. That's useful from a pedagogical point of view, but 
in fact no one much cares about the results of the intermediate steps. 
(That is, Unicode describes the process as: characters --> code points 
--> code units --> bytes, and I'm saying that the code units aren't 
very useful to think about.) The reason for this is very 
simple--encodings are only about data interchange, and data interchange 
protocols and formats want to specify the encoding with a single name, 
which picks out the whole strings-to-bytes mapping. They don't much 
care if two different encodings could be thought of as agreeing on one 
of those steps, and differing on another. If you're writing a library 
to convert between sequences of bytes in different encodings, then this 
can also be useful (to minimize code duplication), but it really has 
nothing to do with the semantics of a string.

What's a Character Set?

As mentioned above, some people use the term "character set" to mean 
what I'm calling "encoding".

In my usage, a "character set" is something simpler--it's a set of 
characters, in the mathematical sense of "set". It's an unordered 
collection of characters. The letter "A", the comma, and the Japanese 
character for "middle"--there, that's a character set. The Unicode 
Standard uses the term "abstract character repertoire" for this notion. 
That's actually less ambigous terminology, but quite a mouthful. I'll 
try to use "repertoire".

A character set, in this sense, isn't a terribly interesting concept, 
and it's a good term to stay away from, given the ambiguity. But people 
often say things like, "the ASCII character set". Now, in my usage, the 
ASCII standard primarily defines an encoding. However, subject to a few 
caveats mentioned above, an encoding implicitly picks out a character 
set--specifically, the set of characters that it can encode. So in 
light of this, I can meaningfully say "the ASCII character set", and 
talk about "how the Shift-JIS encoding handles characters in the 
ISO-8859-1 character set". I mentioning this not to particularly 
encourage the usage, but because you'll hear it, and because it's a bit 
less awkward to say than things like, "how the Shift-JIS encoding 
handles characters encodable in the ISO-8859-1 encoding". And also, 
it's convenient to say, "ASCII characters" (through a small abuse of 
language) to talk about "ABC...", even in the context of another 
encoding.

Wow, so we've covered a lot of ground. Let's sum up again: Conceptually 
strings model a sequence of abstract characters--that's their job. 
Encodings (or Character Maps) define interchange formats, and are only 
important to IO--to let you transmit strings between processes (via the 
network or via files). Encodings are basically algorithms, but in usage 
they tend to be identified by conventional names.

What's a Locale?

One of the things that people like to do with strings is to manipulate 
them. Two prototypical things they want to do are case transformations 
(uppercasing and lowercasing) and sorting. And they also want to do 
things like create string representations of numbers and dates, and the 
inverse--interpret strings as representing numbers and dates.

Now as it turns out, different people want to do these things in 
different ways--think of "1/31/2004" v. "31/1/2004" v. "2004-1-31".

Consequently, that means the sorting algorithms and number formatting 
algorithms need to know which way you want to do things (and which way 
I want to do them). Traditionally that means that these types of 
algorithms need to take a parameter (explicitly or implicitly) to 
specify this choice. Traditionally, this parameter is called a 
"locale".

Now this word can cause a lot of confusion. It doesn't need to. In the 
most concrete sense, a "locale" just specifies a set a algorithms and 
settings--things like the format to use for dates, the comparison 
operation to use for sorting, and the characters to use for the decimal 
and thousands separators for numbers (think of "1,000.50" v. 
"1.000,50"). In fact, to reflect this definition, I'm going to coin a 
new term. I'm going to call this a "Text Preferences Set", or "TPS" for 
short, to emphasize this defintion.

Now where the complexity tends to come in also explains where the term 
"locale" came from. People, as it turns out, don't (often) sit around 
thinking up new sort orderings, or new formats to display numbers and 
dates. More often, they want to do these things in a way that matches 
the custom of some language or country or region--they want to sort 
strings to match somebody's phone book or dictionary. Consequently, 
they find it convenient to specify a TPS by specifying a langauge or 
country, and use notations such as "en" for English and "en_US" for US 
English (v. British English). But from an API perspective, this is just 
a way to go look up a TPS--once you have one, it doesn't much matter 
where it came from (looked up by some name or build up 
programmatically, piecemeal), you just use it.

Sorting, and Tailorings

Now as I hinted at above, people in different countries like to sort 
strings in different ways. Fair enough. That makes it sound like I need 
to define a whole bunch of binary string comparison functions--one for 
each different custom. One for German, one for Swedish, one for 
English, one for Japanese, etc. As it turns out, there's a more compact 
way to do this.

The Unicode Standard defines a base collation algorithm (sort 
algorithm)--you can use it to sort strings composed of any characters 
at all, but it doesn't necessarily give you an order that matches any 
particular langauge's custom (though for many languages it's at least 
reasonable). This is called the Unicode Collation Algorithm. Now, this 
algorithms supports the concept of "tailorings"--tweaks to change the 
sort behavior for certain characters (or sequences of characters). The 
idea is that you start with the UCA, and supply one set of tailorings 
to get the customary English sort order, another set to get the German 
sort order, etc. There are three really nice things about this 
approach:

1) Rather than writing a whole new algorithm, you just tweak a base 
algorithm, and these tweaks are usually data-driven. (That is, they 
usually take the form of modifications to a weighting table, rather 
than an algorithmic difference.)

2) Under any set of tailorings, you can sort strings containing any 
characters. So, you can meaningfully talk about how Japanese words sort 
under the English tailorings (or English locale or traditional English 
TPS). It's just that, naturally, the sorting of Japanese words won't 
differ under English v. German sorting rules (but sorting of some 
English words might), and English words probably won't sort differently 
under Japanese v. Taiwanese sorting rules (but Japanese and Chinese 
words might).

3) In contexts where you don't know what language is relevant, you have 
a reasonable fallback sort order.

So for sorting, a language (or more generally, a TPS) just provides a 
bit of tuning on top of a default algorithm, and you can sort with or 
without taking language into account.

Now, it's important to note that the UCA isn't just sorting in terms of 
the numerical values of the Unicode code points assigned to charactes. 
But you can sort this way too--it's sometimes called "binary 
order"--and it can make sense to do this in situations where you don't 
much care _what_ the sort order is, just that you have something 
canonical (for instance, so that you can count duplicates in a list). 
The reason you might want to use binary order in cases like this is 
that it's faster, and simpler to implement. (Also, you might know that 
it's an acceptable sort order for your problem domain--maybe you're 
sorting part numbers, for instance.)


So, another sum-up: A TPS (or "locale") is a set of preferences 
relating to how you want to carry out certain string operations, and 
often these preferences take the form of "tailorings" or tweaks on top 
of a base algorithm.

Another place where this sort of thing comes up is with case 
mappings--how you uppercase or lowercase strings depends somewhat on 
language-based conventions.

Another use for uppercasing or lowercasing in English is to perform 
case-insensitive string comparison, so that "foo" and "FOO" and "Foo" 
and "FoO" compare as the same. In English, you can either uppercase 
everything before comparison, or lowercase everything--it doesn't much 
matter which you do, the result will be the same. In other languages, 
that's not necessarily the case. Conveniently, the Unicode Standard 
defines a process called "case folding", which is designed to let you 
do case-insensitive comparisons. Case folding transforms strings into 
case-canonicalized representations (conceptually like uppercasing then 
lowercasing everything), and it does this in a single, 
non-language-sensitive way. So even though language is relevant to case 
mappings, you can use case folding to do case-insensitive string 
comparisons without specifying a language (or TPS).

So a take-away point here is that the Unicode Specification provides 
infrastructure in a number of areas for doing operations in a 
TPS-sensitive or TPS-insensitive manner, and provides a smooth 
transtion between the two.

A note about Language

There's one thing about langauge-sensitive operations which people tend 
to overlook, and which it vitally important: for string operations, 
it's not the language _of_the_string_ which is important, it's the 
language _of_the_reader_. The canonical example here is this: Consider 
a list of names, some of them German, and some of them Swedish. I, as 
an English-speaking reader, would want to see these sorted in the 
_English_ sort order. I don't much care how Germans or Swedes would 
sort them. My local phone book doesn't take the national origin of a 
person into account when organzing the listing. It doesn't matter--the 
point of having it sorted it to allow people in the US to look up a 
name. This means that there isn't a problem of how to compare "English 
strings" and "German strings"--the language of the sort operation, not 
of the individual strings, determines the binary comparison algorithm 
used in the sort. That also means that it's a non-problem to have a 
single string containing words from different languages.

This also means that, in the US, the German word "STRASSE" (on a sign 
or in the name of a company, for instance) would lowercase as 
"strasse", even though in Germany it might be preferable to use 
"straße".


What's a Grapheme Cluster?

More about this later, but briefly: As it turns out, certain langauges 
have certain funny conventions about grouping sequences of characters 
into what they see as one "thing". For instance, Spanish considers the 
character sequence "ll" to be a separate letter (not two "l"'s), and 
similarly for "rr". Now, they don't really treat them specially from an 
"information processing" point of view--I don't believe that there are 
any encodings which treat "ll" as a single character, and I don't think 
that Spanish typewriters have a separate key for "ll". It's just that 
they think of it as a separate letter--it shows up when they are 
reciting the alphabet, and they see the word "llama" as containing four 
letters (the first letter being called "ell-yay", with the word 
pronounced "yama", not "lama"), and they would sort that word after 
"luego" (because "ll" comes after "l").

The Unicode Standard calls this a "grapheme cluster", and (as mentioned 
above) it's relevant to localized sorting and word-length counting. 
This is a somewhat unfortunate term, since "grapheme" is a linguistics 
term, and it means something different to linguists. Previous versions 
of the Unicode Standard used to term "grapheme", but they've migrated 
to using "grapheme cluster" to minimize confusion, though the standard 
is still somewhat jumbled in this regard.

But the brief take-home idea here is: "grapheme cluster" ("grapheme" 
for short, with reservations), in this context, is a sequence of one or 
more characters (a range in a string) which natural-language users 
would see as a unit. This is again a situation where language (or TPS) 
is relevant, but where the Unicode Standard provides a default 
TPS-independent implementation which allows for tailorings. So again 
there's a smooth transition for language-independent to 
language-dependent settings.

Also, importantly, a grapheme cluster is a notion built on top of 
characters (it's a cluster of characters), and choosing a langauge lets 
you refine how you break up a string into grapheme clusters, but it's 
just a refinement--"adding a language into the mix" doesn't pick out a 
different semantic construct, it just help you customize your choice of 
what ranges make up single graphemes.


I haven't yet covered a few important topics, such as different 
character sequences representing equivalent graphemes, canonical and 
compatability equivalence, and Unicode normalization forms. I also 
haven't said anything yet about concrete implementation or API 
guidelines.

JEff

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About