develooper Front page | perl.perl6.language | Postings from April 2005

Re: Question about list context for String.chars

Thread Previous | Thread Next
Aaron Sherman
April 11, 2005 12:55
Re: Question about list context for String.chars
Message ID:
On Mon, 2005-04-11 at 15:40, gcomnz wrote:
> I have to say I'm slightly confused too for some languages,
> especiallyfor syllabic alphabets. At the same time, I'm pretty clear
> for CJK,Syllabaries,  and alphabets, or at least I hope I'm clear (I
> guess I'mabout to find out), .chars just returns the right unicode
> level forwhatever the string contents requires.

> "abc".chars  would return <a b c>, which I'm guessing would be
> bytesize usually.

Fair enough.

> "日本語".chars would return <日 本 語>, which can probably be expressed with
> UTF8?

I think you're confusing UTF8 (which can represent ALL Unicode
characters) and "the UTF8 subset which consists of one-byte
representations" (which happens to overlap with 7-bit ASCII).

> >From Apocalyps 5: "Under level 2 Unicode support, a character
> isassumed to mean a grapheme, that is, a sequence consisting of a
> basecharacter followed by 0 or more combining characters."
> Marcus

Hmmm... that doesn't answer the ligature question clearly though. That
answers for the case of combining diacritical marks:

e.g. <A ̀> vs "À", which is a pre-combined example, but there are (as I
understand it), many valid examples which do not have a pre-combined
representation in Unicode.

But not for ligatures:

which are, by definition, actually two or more unique characters which
have a special typographical representation when adjacent. So, they are
a single grapheme, but like I said: certain cultures would be shocked by
a .chars that did not decompose their ligatures (and again, I'm mostly
thinking Arabic, so I'd defer to someone who actually spoke Arabic and
knows how they deal with this).

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About