develooper Front page | perl.perl6.language | Postings from November 2008

Re: Files, Directories, Resources, Operating Systems

Thread Previous | Thread Next
From:
Darren Duncan
Date:
November 26, 2008 19:34
Subject:
Re: Files, Directories, Resources, Operating Systems
Message ID:
492E1531.7080401@darrenduncan.net
Tom Christiansen wrote:
> I believe database folks have been doing the same with character data, but
> I'm not up-to-date on the DB world, so maybe we have some metainfo about
> the locale to draw on there.  Tim?

AFAIK, modern databases are all strongly typed at least to the point that the 
values you store in and fetch from them are each explicitly character data or 
binary data or numbers or what-have-you; and so, when you are dealing with a 
DBMS in terms of character data, it is explicitly specified somewhere (either 
locally for the data or globally/hardcoded for the DBMS) that each value of 
character data belongs to a particular character repertoire and text encoding, 
and so the DBMS knows what encoding etc the character data is in, or at least it 
treats it consistently based on what the user said it was when it input the 
data.  The only time this information isn't really remembered is if the data is 
supplied in terms of being binary data.

Maybe some older or unusual DBMSs aren't this way, and of course technically a 
filesystem etc *is* a database ... I think that example mentioned about filename 
storage being locale dependent, probably meant that at the actual filesystem 
level it was just dealing with the names as binary data.

> There is ABSOLUTELY NO WAY I've found to tell whether these utf-8
> string should test equal, and when, nor how to order them, without
> knowing the locale:
> 
>     "RESUME",
>     "Resume"
>     "resume"
>     "Resum\x{e9}"
>     "r\x{E9}sum\x{E9}"
>     "r\x{E9}sume\x{301}"
>     "Re\x{301}sume\x{301}"
> 
> Case insensitively, in Spanish they should be identical in all
> regards.  In French, they should be identical but for ties, 
> in which case you work your way right to left on the diactricals.

This leads me to talk about my main point about sensitivity etc.

I believe that the most important issues here, those having to do with identity, 
can be discussed and solved without unduly worrying about matters of collation; 
identity is a lot more important than collation, as well as a precondition for 
collation, and collation is a lot more difficult and can be put off.  With 
respect to dealing with a file system, generally it is just identity that 
matters and collation is a concern that can typically be just tacked on after 
identity is solved.

That is, with a file system you need to know whether or not a file name you hold 
will or won't match a file in the system, and matching or not-matching is the 
main function of an identity.  Similarly, the file system has to make sure that 
no 2 distinct files in it have the same file name, that is the same public 
identity.  In contrast, the order that you order or sort a list of files by 
their names usually isn't so important; while all work with a file system 
requires working with identities, most work does not need to deal with 
collation.  In practice several parties can agree on a single means of 
identifying files, while still having their own favorite collations, so the same 
list can be ordered in different ways.

Collation criteria is something that can be naturally applied externally to a 
file system, such as by a user program, and only identity criteria needs to be 
built-in to the file system.

So collation doesn't need to be considered in Perl's file-system interface, 
while identity does; collation can be a layer on top of the core interface that 
just cares about identity.

One maxim I apply in my database work, and that I believe applies to this 
discussion, is "any logical difference is a big difference".  If you have 2 
distinct value literals such that you consider the difference in each literal's 
spelling to be significant, such that you can't for all use cases substitute one 
literal for the other, then the 2 literals denote 2 distinct values; in the 
other case, where you can always substitute one for the other harmlessly, then 
they denote the same value.  The concept of 'value' and 'identity' are the same, 
and any value is its own identity.

And so, with your 7 'resume' literals, I would say that if there is a reason for 
any of the spellings to exist that couldn't be handled by one of the other 
spellings, then all 7 literals are distinct/non-identical taken as-is.

If you *know* that the 7 strings are all UTF-8, then locale doesn't have to be 
considered for equality; just your unicode abstraction level matters, such as if 
you're defining the values in terms of graphemes vs codepoints vs bytes.

When talking about identity, there is no such thing as case-insensitivity or 
accent insensitivity or whitespace insensitivity or what have you.  If you have 
any reason to not replace every "E" with an "e" or vice-versa in your character 
string, then you consider those 2 non-identical and so they wouldn't match; by 
contrast, true case-insensitivity means you can replace every "e" with an "E" 
(for example) and forget than an "e" ever existed; the actual equality test is 
then the same since all comparands would only have the "E".

And so I brought up before that the generalization of case-concerning matters is 
about normalization and folding.  Where the normal situation of 
everything-sensitive character data is just "$foo eq $bar", case-insensitive is 
really "lc($foo) eq lc($bar)", accent-insensitive is "strip_accents($foo) eq 
strip_accents($bar)", and whitespace-insensitive is "strip_ws($foo) eq 
strip_ws($bar)"; in every case, the actual "eq" is everything-sensitive, but 
since its arguments have been normalized, the actual domain of characters they 
would compare is smaller.  In each normalized case, you aren't comparing $foo 
and $bar at all, but rather you are comparing 2 other values.

Now normalization can be arbitarily complex or different, but ultimately its 
just a functional mapping or functional dependency (the normalized version is 
the dependent and the non-normalized one is the determinant) and at the end of 
the day the actual tests for comparison or identity tests are the same and simple.

As for collation, if the collation is deterministic and fully-ordered then every 
2 distinct characters does not compare as 'same' and one will always sort before 
the other.  If 2 characters sort as 'same' then either the collation is just 
partially-ordered, in which case the 2 characters would order randomly, or 
otherwise the 2 characters are in fact the same character and all occurrences of 
one can be safely replaced by the other.  No matter what your collation is, no 2 
characters considered non-identical would compare as 'same' unless the collation 
was just partially ordered.

> See what a mess it's going into?  Larry, can you think of something
> simple?  I haven't been able to.  Unicode solves so few of the problems 
> people think it does.  We've still so much to do, and I don't just
> mean perlers.

AFAIK, Unicode does have an answer for the most important problems.

> Darren>> To summarize, what we really want is something more generic
> Darren>> than case-sensitivity, which is text normalization and text
> Darren>> folding in general, as well as distinctly dealing with
> Darren>> distinctness for representation versus distinctness for mutual
> Darren>> exclusivity.
> 
> I think that you might have to use a Unicode::Collator object, since
> the standard DUCET.  It doesn't help much for actual locales, but it
> does take care of some of things you're concerned with.

Makes sense.

   Two issues:
> 
>   **MAJOR** This is the opposite of small, fast, svelte.
>     minor   You had better use the canonical forms, since
>             you don't want 
> 
>               "e\x{COMBINING DOWN TACK BELOW}\x{COMBINING TILDE}\x{LATIN SMALL LETTER N WITH LEFT HOOK}je\x{COMBINING DOWN TACK BELOW}"
>               "e\x{COMBINING TILDE}\x{COMBINING DOWN TACK BELOW}\x{LATIN SMALL LETTER N WITH LEFT HOOK}je\x{COMBINING DOWN TACK BELOW}"
> 
>             to be different; nor, case-insensitively, for these to differ:
> 
> 	      "EN\x{COMBINING TILDE}E" 
> 	      "e\N{LATIN CAPITAL LETTER N WITH TILDE}e" 

This depends on your abstraction level.  If you're working in terms of graphemes 
then AFAIK those are considered identical.  If in terms of codepoints then not. 
  But still, I agree that canonical form use is very helpful and ideal, since 
then you don't need to use the grapheme level and the codepoint level would do 
what we want.  But technically this is an example of what I was saying about 
normalization.  If Perl's "eq" was always codepoint oriented, then people would 
have to say eg "nfc($foo) eq nfc($bar)" to get grapheme-insensitive comparisons. 
  But the grapheme abstraction level is generally what you want anyway since 
character data is for humans and humans don't consider the various unicode 
normal forms as distinct characters; they *display* with exactly the same glyphs.

> Darren>> [This] implies that sensitivity is special whereas sensitivity
> Darren>> should be considered normal, and rather insensitivity should be
> Darren>> considered special.
> 
> I think Darren may be right, because even case-sensitivity is a real problem.

It sure is.

-- Darren Duncan

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About