develooper Front page | perl.perl5.porters | Postings from February 2008

character numbers (was Re: use encoding 'utf8' bug for Latin-1 range)

Thread Previous | Thread Next
From:
Nicholas Clark
Date:
February 28, 2008 11:06
Subject:
character numbers (was Re: use encoding 'utf8' bug for Latin-1 range)
Message ID:
20080228190639.GM87113@plum.flirble.org
On Wed, Feb 27, 2008 at 11:19:14AM +0100, Juerd Waalboer wrote:

> However if anyone is interested in fixing the problem, then by all means
> please do it right, and make \x mean "character number" again.

Does anything apart from ASCII, 8 bit character sets, and Unicode have a
clear concept of "character number"?

In particular, I seem to remember that until I added the -q flag, the
Encode UCM compiler was merrily issuing warnings that 2 or more byte
sequence representations of the same Chinese glyph mapped to the same
Unicode code point. Which suggests to me that these character representation
schemes ("encodings", but of what?) aren't really reversible 1 to 1 mappings
of some ordered countable sequence of characters.

And looking at http://en.wikipedia.org/wiki/EUC-JP I never see a *list*
of characters:

    The structure of EUC is based on the ISO-2022 standard, which specifies a
    way to represent character sets containing a maximum of 94 characters, or
    a 8836 (94²) characters, or 830584 (94³) characters, as sequences of 7-bit
    codes. Only ISO-2022 compliant character sets can have EUC forms. Up to
    four coded character sets (referred to as G0, G1, G2, and G3 or as code
    sets 0, 1, 2, and 3) can be represented with the EUC scheme. G0 is almost
    always an ISO-646 compliant coded character set (e.g. US-ASCII/KS X
    1003/ISO 646:KR in EUC-KR and US-ASCII/the lower half of JIS X 0201 in
    EUC-JP) that is invoked on GL (i.e. with the most significant bit cleared).

And another page links to a *grid* of characters:

http://kanji.zinbun.kyoto-u.ac.jp/~yasuoka/CJK/jisx0208-1990.gif


So I'm not convinced that "character number" is a useful concept outside of
8 bit character sets, and Unicode (and its subsets, including ASCII)

So it seems to be tricky to have \x mean "character number" again as I can't
see a true meaning for "character number" in these character representation
schemes.

Nicholas Clark

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About