Jarkko Hietaniemi <jhi@iki.fi> writes: >> Given transparency, you do not need such a thing. > >Wrong. I am (as you know already) on Ilya's "side" here... >There are standards and protocols, and other pieces of >software, out there that *require* producing UTF-8. IIRC LDAP >is one of those outside bits. Java would be another [*]. >Perl must be able to interface with the outside world. Quite. But such things will always be either: A. XS code (which Ilya has already said has to be SvUTF8 aware. or B. IO which has/should-have its own ways of dealing with this and indeed must be able to cope with SVs arriving in either form. C. Just be treating the things as sequences of bytes - in which case the bytes themselves can be represented either way ;-) Case C was Graham's LDAP case. It relied on perl producing UTF8 encoded form for 128..255 and then did 'use byte' to peak at it. It broke when 5.6+ decided to keep 128...255 as 'byte' anyway. The right way to do this is to export the trivial XS code which does an upgrade and then turns off the flag. (As current Encode does.) > >[*] Though don't get me started on how Java's readUTF8() and writeUTF8() >do not do real UTF-8 as defined by the RFC :-) > >> ord('A') should be the same on all the systems, unless use locale or > >It isn't. >In EBCDIC that produces 0xC1, or 193. >It might be nice if it did. >Changing it would break existing code. We _still_ have not got a definition from EBCDIC folk on what the backward compatible version _does_. > >Your sentence is essentially saying that utf8-marking is a hint (that >might be false) that it the string might contain chars above 127, >instead of the current implementation where it is a guarantee of that. Hmm, last time I relied on it being more than a hint perl let me down. What current perl attempts to do is say that UTF8 bit means there are chars above _255_ - that is it tries to turn the bit off and downgrade for chars 128..255. But this is expensive to get right. If I remove the last remaining big char from a 16M string you have to scan whole thing to find out... -- Nick Ing-Simmons