Jarkko Hietaniemi <jhi@iki.fi> writes: >I appreciate your detailed analysis, it's certainly more detailed >than what we have seen over the last few days. > >>From my viewpoint, however, the situation is as follows: > >(1) The current model, both externally and internally, > follows what is described by the Camel Mk3. As the pumpkin > I'm somewhat obligated to abide by that, at least that's the > first degree approximation. (Incidentally, the reason I think > the Camel is so vague was that when it was written the Unicode > model was beinh ripped to shreds to be rebuilt, in a discussion > not unlike the one we are having.) > >(2) The basic Unicode support seems to be in a rather good shape now. > What I mean by "basic" is that as long you don't start pulling your hair > over this very bytes vs UTF-8 vs characters issue, and just concatenate > strings, compare them, take their length, do regexes on then, etc, pretty > much everything seems to be working. > >Combine (1) and (2) and I see it as "what is broken, so what's there to >fix" situation, let's call it (3). Having reviewed things, tried a few and judging by the mail topics that recur here is my stab at (3): (3.0) We need to spell out somewhere the "logical" (perl visible) semantics of strings. We need to remove the SvUTF8 and internal stuff from the docs as seen by perl level user - move to perlutf8guts or whatever. (3.1) One true "bug": unpack('C',$str) != ord($str) in some cases. Despite perlfunc saying " sub ordinal { unpack("c",$_[0]); } # same as ord() " As far as I am aware this is the only remaining wart in the ASCII world. (I was not aware of it till this thread started as I am pack/unpack phobic.) (We can make it do what it does now in scope of 'use bytes' of course.) (3.2) Encode::* are less elegant than they could be, I at least have adding things there without documenting them, some documented things are stubs. I suggest I draft a new 'pod', post it here, get agreement and then implement it. (Anyone else is welcome to draft it if they get there first.) In essence where we have $length = from_to($string,'Unicode','foo'); $length = from_to($string,'foo','Unicode'); we will have $encoded = encode_as('foo',$string); $string = decode_from('foo',$encoded); the trivial cases $encode = encode_as_utf8($string); # encode_as('utf8',$string) $string = decode_from_utf8($encoded); will be special cased. The main issue is getting the names right so they "read" correctly and don't confuse people. (3.3) EBCDIC world has legacy reasons why ord('A') != 0x41. We need a formal statement that on EBCDIC platforms we don't use Unicode codepoints but "HybriCode" where U+0 .. U+255 map to IBM-1047 as defined in ext/Encode/Encode/cp1047.ucm, but that other code points map as themselves. _Or_ whatever the formal definition is. That is ord('A') == 0xC1, chr(0xC1) eq 'A' We also need to define (at the internals level) whether utf8_upgrade('A') produces "\x41",SvUTF8_on (as I think it should) or if it tries to apply the UTF-8 algorithm to the HybriCode code point, or if it uses the UT?? thing intended for EBCDIC. (3.4) We needs some cook-book examples for Content-Length:, LDAP, etc. which use new documented API e.g. my $encoded = encode_as('utf8',$string); print $MAIL "Content-Length: ",length($encoded),"\n\n",$encoded; (3.5) We need to explain the risks of 'use bytes'. Personally I will never use it as it stands, but it is in the Camel... -- Nick Ing-Simmons <nik@tiuk.ti.com> Via, but not speaking for: Texas Instruments Ltd.