On Fri, Feb 16, 2001 at 08:44:26AM -0600, Jarkko Hietaniemi wrote: > > Given that you cannot distinguish byte-encoded strings from > > utf8-encoded strings from Perl code, I fail to see any difference > > Yes, you can. unpack("C*", ...), unless, of course you intend to > change that, too. There should be no operations in the core which expose the internal representation. Only modules - like Devel::Peek. The &|^ disaster should have taught us this. > { use bytes; length() }. unpack("U*") will croak > if you feed it malformed UTF-8. There should be no 'use bytes'. > > Given transparency, you do not need such a thing. > > Wrong. There are standards and protocols, and other pieces of > software, out there that *require* producing UTF-8. IIRC LDAP > is one of those outside bits. Java would be another [*]. > Perl must be able to interface with the outside world. Of course, but here we discuss the internal operations, not the I/O. Each I/O channel (including system-calls) needs to be marked by the translation used. > > ord('A') should be the same on all the systems, unless use locale or > > It isn't. > In EBCDIC that produces 0xC1, or 193. As I said, in EBCDIC you have an implicit "use locale" around your script. > > somesuch is in effect. EBCDIC ports may behave as if they have an > > This means that you want to impose ISO Latin 1 on everyone in 8-bit world. Unless 'use locale' is in effect. This is exactly what we have now. > > implicit 'use locale' around each script. > > 'use locale' has *nothing* to do with this. You err. > > [locales are just ways to assign a different cultural information to > > integers (=characters). As Larry said, Perl should allow one use > > I wish they where -- but they are not. That's not how they have been > (very weakly defined by standards and (badly) implemented by vendors. > For one thing, they have very little to with character encodings. Here I discuss "locales as seen from Perl", not something else. > > big5 for an internal cultural-info tables instead of unicode. > > Similarly, 'use locale' just loads a different table into the range > > 0..255. {BTW, It may make sense to make the "Unicode 0..255 range" > > Sorry, Ilya, that's completely not what happens. How so? (Unless you consider collation - which is not "completely not" either.) > > A string *must* be marked utf8 if was utf8-encoded and contained chars > > above 127. A string *may* be marked utf8 if it byte-encoded, but does > > not contain chars above 127. > > Your sentence is in opposition with our existing Unicode model and > implementation, which seems to be working rather nicely, so you must > have a complete alternative implementation in your backpocket. Please explain how having a string marked as utf8 and with PVX="a" "opposes" your model. > Your sentence is essentially saying that utf8-marking is a hint (that > might be false) that it the string might contain chars above 127, > instead of the current implementation where it is a guarantee of that. > Unsurprisingly, I find the current model much cleaner. Unsurprisingly, I do not. You need an extra scan on each string operation to (sometimes) switch off utf8-bit. Switching it off gives no visible semantic changes, and is quite time-consuming. It may also significantly slow down (or significantly speed up) the following operations over this SV - but I would prefer to consider semantic changes separately from performance issues (especially for such unclear performance corollaries). The "correct" model would use two bits for encoding: "PVX contains a sequence of byte-encoded chars", "PVX contains a sequence of utf8-encoded chars". The strings with only chars 0..127 (in the "canonical" representation) would be marked as both. In the current model the flag is *used* to distinguish things which need some massage when converting to byte-strings and utf8-strings. It does not make a lot of sense to have the "informal meaning" of the flag so distant from "the meaning of the flag when used". IlyaThread Previous | Thread Next