Front page | perl.perl6.language |
Postings from November 2008
Re: Files, Directories, Resources, Operating Systems
Thread Previous
|
Thread Next
From:
Richard Hainsworth
Date:
November 27, 2008 04:24
Subject:
Re: Files, Directories, Resources, Operating Systems
Message ID:
492E9190.4080308@rusrating.ru
Just as a variable name in perl6 must conform to a standard and abide by
a set of constraints, why should file or other resource names be an
exception?
The constraints on variable names in perl6 are very flexible, but there
are some rules that must be enforced for a program to work.
It seems to me that resource (eg. file) names too should also be
constrained so that software portability can be ensured. A reasonably
constructed set of constraints for the perl6 core should deal with most
locale/OS/character set considerations, and where a particular
environment cannot cope, then a module will be needed to "eigenmunge"
the names appropriately.
Suppose for the sake of argument we state that resource names in perl6
shall comply with the rules for variable names; and the sort sequence of
such names is the one defined for unicode strings.
Where software in perl6 is written for a specific domain, eg. Catalan or
Russian, the programmer will know more about the domain and how to deal
with resource names in that locale. This would include sort sequences
and the complexities Tom outlined. Such things would be relegated to OS
/ domain specific modules.
Would this help?
Tom Christiansen wrote:
> In-Reply-To: Message from Darren Duncan <darren@darrenduncan.net>
> of "Wed, 26 Nov 2008 19:34:09 PST." <492E1531.7080401@darrenduncan.net>
>
>
>> Tom Christiansen wrote:
>>
>
>
>>> I believe database folks have been doing the same with character data, but
>>> I'm not up-to-date on the DB world, so maybe we have some metainfo about
>>> the locale to draw on there. Tim?
>>>
>
>
>> AFAIK, modern databases are all strongly typed at least to the point
>> that the values you store in and fetch from them are each explicitly
>> character data or binary data or numbers or what-have-you; and so,
>> when you are dealing with a DBMS in terms of character data, it is
>> explicitly specified somewhere (either locally for the data or
>> globally/hardcoded for the DBMS) that each value of character data
>> belongs to a particular character repertoire and text encoding, and so
>> the DBMS knows what encoding etc the character data is in, or at least
>> it treats it consistently based on what the user said it was when it
>> input the data.
>>
>
> Oh, good then. That's what I'd heard was happening, but wasn't sure since
> I've steared clear of such beasties since before it was true.
>
> I wish our filesystems worked that way. But Andrew said something to me
> last week about Ken and Dennis writing quite pointedly that while you
> *could* use the f/s as a database, that you *shouldn't*. I didn't know
> the reference he was thinking of, so just nodded pensively (=thoughtfully).
>
>
>>> There is ABSOLUTELY NO WAY I've found to tell whether these utf-8
>>> string should test equal, and when, nor how to order them, without
>>> knowing the locale:
>>>
>>> "RESUME",
>>> "Resume"
>>> "resume"
>>> "Resum\x{e9}"
>>> "r\x{E9}sum\x{E9}"
>>> "r\x{E9}sume\x{301}"
>>> "Re\x{301}sume\x{301}"
>>>
>
>
>>> Case insensitively, in Spanish they should be identical in all regards.
>>> In French, they should be identical but for ties, in which case you
>>> work your way right to left on the diactricals.
>>>
>
>
>> This leads me to talk about my main point about sensitivity etc.
>>
>
>
>> I believe that the most important issues here, those having to do with
>> identity, can be discussed and solved without unduly worrying about
>> matters of collation;
>>
>
> It's funny you should say that, as I could nearly swear that I just showed
> that identify cannot be determmined in the examples above without knowing
> about locales. To wit, while all of those sort somewhat differently, even
> case-insensitively, no matter whether you're thinking of a French or a
> Spanish ordering (and what is English's, anyway?), you have a a more
> fundadmental = vs != scenario which is entirely locale-dependent.
>
> If I can make a "RESUME" file, ought I be able to make a distcint
> "r\x{E9}sum\x{E9}" or "re\x{301}sume\x{301}" file in a case-ignorant
> filesystem? There is no good answer, because we might think it
> reasonable to
>
> lc(strip_marks($old_fn)) eq lc(strip_marks($new_fn))
>
> Theee problem of what is or is not a "mark" varies by locale,
>
> * Castilian doesn't think ~ is a mark; Portuguese does, and
> so if you strip marks, you in Castilian count as the same
> two letters that it deems disinct, but in Portuguese, you
> incur no lasting harm.
>
> * Catalan doesn't think ¸ is a mark; French does. and so if you strip
> marks, you in Catalan count as the same two letters that it deems
> disinct, but in French or Portuguese, you incur no lasting harm.
>
> * Modern English (usually) decomposes æ into a+e, but OE/AS and
> Icelandic do not.
>
> * Moreover, Icelandic deems é and e to be completely
> different letters altogether. If you strip marks, you
> count as the same letters that that language does not.
> Similarly with ö, which is at the end of their alphabet,
> (like ø in some), and nowhere near o or ó. BTW, those
> are three separate letters, not variants.
>
> * And in OE/AS you could have a long mark on an asc (say "ash" for the
> atomic *letter* æ). If split into a and e and stripped of marks, it
> woudn't make any sense at all.
>
> Case in point: Ælene Frisch, whom many of you doubtless know, insists her
> name be spelt as I have written it. She does not want Aelene Frish, for
> she considers her forename to have 5 letters in it, not 6. But Unicode
> doesn't give us a title case version of that (did AS?), suggesting it a
> ligature not a digraph.
>
> But if we have a file called "ÆLENE", may be assume it the same in a case-
> insensitive sense to both "aelene" and "ælene"?
>
> I can only go on code-points, because I don't want to deal with ß and SS
> and Ss. Case-folding file systems are just begging for trouble, and I just
> don't know what to do. Think of the 3 Greek sigmata.
>
>
>> identity is a lot more important than collation, as well as a
>> precondition for collation, and collation is a lot more difficult and can
>> be put off.
>>
>
> I agree everything with everthing save "and can be put off". I would like
> you to be right. I should truly wish to be mistaken. And I don't know
> what we have for prior (cough) art.
>
>
>> respect to dealing with a file system, generally it is just identity that
>> matters and collation is a concern that can typically be just tacked on
>> after identity is solved.
>>
>
>
>> That is, with a file system you need to know whether or not a file name
>> you hold will or won't match a file in the system, and matching or not-
>> matching is the main function of an identity.
>>
>
> But you can't match without knowing locales. It's NOT just collation. I'll
> leave Icelandic out of it, but look at the trouble with 0xDF spilling from
> one each to two chars and two bytes in the perl5 regex engine. Then look
> at 0xFF spilling from one char to one char and three bytes there. It's
> just plain horripilating.
>
>
>> Collation criteria is something that can be naturally applied externally
>> to a file system, such as by a user program, and only identity criteria
>> needs to be built-in to the file system.
>>
>
> I don't think you can do identify (case-wise) correctly without reguard to
> digraphs and a world of weirdnesses we really wish we didn't. But you know
> what else I wonder: what existing art *IS* there? It's so hard a problem
> that I wonder if any one has done a good job at it.
>
> Talking to the standards geeks at Usenix, including Andrew, brought no joy.
> They basically just through up their hands, and lunch. I really wish I
> could talk to Rob Pike and Udi Manber, my old theory and regex prof, but I
> think they've both drunk the Googlaide now. I know Google strips accents
> willynilly and does case-insensitive compares, but I don't know if that's a
> global sol;ution.
>
>
>> So collation doesn't need to be considered in Perl's file-system
>> interface, while identity does; collation can be a layer on top of the
>> core interface that just cares about identity.
>>
>
> That seems a simplified version of reality. Identity isn't what monoglots
> think it is.
>
>
>> If you *know* that the 7 strings are all UTF-8, then locale doesn't have
>> to be considered for equality; just your unicode abstraction level
>> matters, such as if you're defining the values in terms of graphemes vs
>> codepoints vs bytes.
>>
>
> That's not true. é is not the same letter as e in Icelandic.
>
>
>>> See what a mess it's going into? Larry, can you think of something
>>> simple? I haven't been able to. Unicode solves so few of the problems
>>> people think it does. We've still so much to do, and I don't just
>>> mean perlers.
>>>
>
>
>> AFAIK, Unicode does have an answer for the most important problems.
>>
>
>
>>> Darren>> To summarize, what we really want is something more generic
>>> Darren>> than case-sensitivity, which is text normalization and text
>>> Darren>> folding in general, as well as distinctly dealing with
>>> Darren>> distinctness for representation versus distinctness for mutual
>>> Darren>> exclusivity.
>>>
>
>
>>> I think that you might have to use a Unicode::Collator object, since
>>> the standard DUCET. It doesn't help much for actual locales, but it
>>> does take care of some of things you're concerned with.
>>>
>
>
>> Makes sense.
>>
>
> Yes, I think so too. But it is very expensive in performance. Play with
> my program. Makes you want to cheat.
>
>
>>> Darren>> [This] implies that sensitivity is special whereas sensitivity
>>> Darren>> should be considered normal, and rather insensitivity should
>>> Darren>> be considered special.
>>>
>
>
>>> I think Darren may be right, because even case-sensitivity is a real
>>> problem.
>>>
>
>
>> It sure is.
>>
>
> No kidding. :-(
>
> --tom
>
Thread Previous
|
Thread Next
-
Files, Directories, Resources, Operating Systems
by Richard Hainsworth
-
Re: Files, Directories, Resources, Operating Systems
by Daniel Ruoso
-
Re: Files, Directories, Resources, Operating Systems
by Timothy S. Nelson
-
Re: Files, Directories, Resources, Operating Systems
by Larry Wall
-
Re: Files, Directories, Resources, Operating Systems
by Mark Overmeer
-
Re: Files, Directories, Resources, Operating Systems
by Rafael Garcia-Suarez