On Wed, Feb 07, 2007 at 10:44:55AM +0100, demerphq wrote: > On 2/7/07, Marvin Humphrey <marvin@rectangular.com> wrote: > > > > > >However, all that encode/decode overhead would kill the performance > >of these libraries, rendering them far less useful. It would be nice > >it Perl's internal encoding was always, officially UTF-8 -- then > >there wouldn't be a conflict. But I imagine that might be very hard > >to pull off on EBCDIC systems, so maybe it's better this way -- I get > >to choose not to support EBCDIC systems (along with systems that > >don't use IEEE 754 floats, and systems where chars are bigger than a > >byte). > > I for one would argue that if we were going to go to a single internal > encoding that utf8 would be the wrong one. Utf-16 would be much > better. It would allow us to take advantage of the large amount of > utf-16 code out there, ranging from DFA regexp engines to other > algorithms and libraries. On Win32 the OS natively does utf-16 so much > of the work would be done by the OS. Id bet that this was also a > reason why other languages choose to use utf-16. In fact i wouldnt be > surprised if we were the primary language using utf8 internally at > all. Jarkko's view, based on the battle scars from the dragons in the regexp engine, was that fixed width 32 bit was better than anything 16 bit variable width. Doing the latter properly *still* requires dealing with surrogates. The *best* solution might well be fixed 7/8/16/32, using the smallest that fits. But I don't see this coming soon. > The problem here is not our internal encoding, which should be opaque, > but rather our lack of support for an explicitly byte oriented storage > and our heritage of treating strings as character buffers, even though > they arent really. Yes. And the fact that UTF-8 peeked through the cracks left right and centre in 5.6.0 really didn't help the opaqueness. Nicholas Clark