develooper Front page | perl.perl6.language | Postings from April 2005

Re: Question about list context for String.chars

Thread Previous | Thread Next
From:
gcomnz
Date:
April 11, 2005 12:40
Subject:
Re: Question about list context for String.chars
Message ID:
fb077977050411124033f8eb9f@mail.gmail.com
I have to say I'm slightly confused too for some languages, especially
for syllabic alphabets. At the same time, I'm pretty clear for CJK,
Syllabaries,  and alphabets, or at least I hope I'm clear (I guess I'm
about to find out), .chars just returns the right unicode level for
whatever the string contents requires.

"abc".chars  would return <a b c>, which I'm guessing would be byte
size usually.

"日本語".chars would return <日 本 語>, which can probably be expressed with UTF8?

> Aaron wrote:
> Same here, though I have to admit that I'm slow on this whole Unicode
> thing, so I'm not sure what you mean by "Unicode chars". For example,
> are you expecting to get "f", "f", "i" or "ffi" back when you say
> "ffi".chars? More interestingly, what about all of the Arabic ligatures
> which someone who speaks that language might reasonably expect to get
> back as multiple "chars", but they have their own Unicode codepoint
> (e.g. ﳳ which is "U+FCF3 ARABIC LIGATURE SHADDA WITH DAMMA MEDIAL FORM"
> which you might expect to get "ﹸ", "ﹽ" from)? Any Arabic speakers to
> confirm or deny this behavior of ligatures?

From Apocalyps 5: "Under level 2 Unicode support, a character is
assumed to mean a grapheme, that is, a sequence consisting of a base
character followed by 0 or more combining characters."

Marcus

On 4/11/05, Aaron Sherman <ajs@ajs.com> wrote:
> On Mon, 2005-04-11 at 14:12, Ingo Blechschmidt wrote:
> 
> > gcomnz wrote:
> > > I'm writing a bunch of examples for perl 6 pleac and it seems rather
> > > natural to expect $string.chars to return a list of unicode chars in
> > > list context, however I can't find anything to confirm that. (The
> > > other alternatives being split and unpack.)
> >
> > I like that.
> 
> Same here, though I have to admit that I'm slow on this whole Unicode
> thing, so I'm not sure what you mean by "Unicode chars". For example,
> are you expecting to get "f", "f", "i" or "ffi" back when you say
> "ffi".chars? More interestingly, what about all of the Arabic ligatures
> which someone who speaks that language might reasonably expect to get
> back as multiple "chars", but they have their own Unicode codepoint
> (e.g. ﳳ which is "U+FCF3 ARABIC LIGATURE SHADDA WITH DAMMA MEDIAL FORM"
> which you might expect to get "ﹸ", "ﹽ" from)? Any Arabic speakers to
> confirm or deny this behavior of ligatures?
> 
> Please be aware, I'm talking about ligatures above, NOT special letters
> such as "æ", which are their own letters, and cannot be decomposed into
> "a", "e" without losing information.
> 
> Given Parrot, what happens when you are presented with a Big5 string
> that does not have a strict Unicode equivalent? Does .chars throw an
> exception, or does it rely on the string to know how to "characterify
> itself" according to its vtable?
> 
> --
> Aaron Sherman <ajs@ajs.com>
> Senior Systems Engineer and Toolsmith
> "It's the sound of a satellite saying, 'get me down!'" -Shriekback
> 
>

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About