On 2/7/07, Marvin Humphrey <marvin@rectangular.com> wrote: > > > However, all that encode/decode overhead would kill the performance > of these libraries, rendering them far less useful. It would be nice > it Perl's internal encoding was always, officially UTF-8 -- then > there wouldn't be a conflict. But I imagine that might be very hard > to pull off on EBCDIC systems, so maybe it's better this way -- I get > to choose not to support EBCDIC systems (along with systems that > don't use IEEE 754 floats, and systems where chars are bigger than a > byte). I for one would argue that if we were going to go to a single internal encoding that utf8 would be the wrong one. Utf-16 would be much better. It would allow us to take advantage of the large amount of utf-16 code out there, ranging from DFA regexp engines to other algorithms and libraries. On Win32 the OS natively does utf-16 so much of the work would be done by the OS. Id bet that this was also a reason why other languages choose to use utf-16. In fact i wouldnt be surprised if we were the primary language using utf8 internally at all. I mean heck, utf8 was a kudge worked out on a napkin to make it possible to store unicode filenames in a unix style filesystem. (utf8 has the property that no encoding of a high codepoint contains any special character used by a unix filesystem) WTF would we use a kludge as our primary internal representation when there are better representations to use? Especially when you consider the performance impact of doing so (use unicode and watch the regex engine get much sloooooweeeeeerrrrrrr.) IMO UTF8 internally makes sense only when you condsider that most of the time stuff is happening using latin_1or whatever you want to call the single byte encoding we use. > >> I don't care whether $string is a text-string or byte-string, I > >> just want > >> it to returns the same string. > > > > Perhaps you should care. In a language such as Java, you are forced to > > care, as byte[] and String are different types. Perl blurs this > > difference, > > and lets you believe that you should not need to care. > > I agree, Mark. Silent upgrading of bytes to Unicode strings cost me > a bunch of debugging time when I learned the hard way that you need > to care. I was writing a serializer that concatenated Unicode > strings together with packed integers to make sort keys. It never > occurred to me that such a concat operation would corrupt the packed > integer, and it took me a long time to hunt down why my sort op was > failing. I think this is the real itch. Before utf8 it was fine to think of strings as "byte buffers", but they arent byte buffers and never have been, they are strings, and strings dont contain bytes (whatever Gerrard thinks :-), they contain characters. \x{} doesnt _ever_ produce a specifc byte, it produces a specific _character_ (that just happens to represented by a byte in one of the internal encodings we use). The correct way to get a specific byte is via pack, not via any string escape as string escape operate on characters. The fact that a given perl may internally encode strings as bytes is irrelevent. So, imo what we need is an SV flag that says "this is a byte buffer, it does not contain characters and anytime it is concatenated with a string the string should also be treated as a byte buffer regardless of its actual state." Thus concatenating a byte-buffer with a utf8 string would not upgrade the bytebuffer. The problem here is not our internal encoding, which should be opaque, but rather our lack of support for an explicitly byte oriented storage and our heritage of treating strings as character buffers, even though they arent really. Cheers, Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"Thread Previous | Thread Next