On Wed, Feb 07, 2007 at 11:52:48AM +0000, Nicholas Clark wrote: > On Wed, Feb 07, 2007 at 10:44:55AM +0100, demerphq wrote: > > On 2/7/07, Marvin Humphrey <marvin@rectangular.com> wrote: > > > > > > > > >However, all that encode/decode overhead would kill the performance > > >of these libraries, rendering them far less useful. It would be nice > > >it Perl's internal encoding was always, officially UTF-8 -- then > > >there wouldn't be a conflict. But I imagine that might be very hard > > >to pull off on EBCDIC systems, so maybe it's better this way -- I get > > >to choose not to support EBCDIC systems (along with systems that > > >don't use IEEE 754 floats, and systems where chars are bigger than a > > >byte). > > > > I for one would argue that if we were going to go to a single internal > > encoding that utf8 would be the wrong one. Utf-16 would be much > > better. It would allow us to take advantage of the large amount of > > utf-16 code out there, ranging from DFA regexp engines to other > > algorithms and libraries. On Win32 the OS natively does utf-16 so much > > of the work would be done by the OS. Id bet that this was also a > > reason why other languages choose to use utf-16. In fact i wouldnt be > > surprised if we were the primary language using utf8 internally at > > all. > > Jarkko's view, based on the battle scars from the dragons in the regexp > engine, was that fixed width 32 bit was better than anything 16 bit variable > width. Doing the latter properly *still* requires dealing with surrogates. If you mean easier to implement, then fixed width is much easier, especially if you try to add it to a regex engine which assumes characters have a fixed width. Unfortunatly UTF-32 would be very bad for performance. The problem is that your strings use much more memory. It is not that I care that much about memory usage. But for example when doing a regex for a fixed string like m/aap/ in UTF-8 you just have to do a memory search for the bytes representing 'aap' in UTF-32 you can do the same, but you have much more memory to search through. Whether UTF-8 or UTF-16 is faster depends on the content of your strings, if you mostly have ASCII data UTF-8 would be shorter and thus faster, if you are dealing with non western languages, the UTF-16 encoding will probably be shorter and thus faster. The current regex engine does not work very fast for unicode, because it was build on top of the latin1 regex engine. Some things that can greatly improve the performance of the current regex engine: - The regex engine does not know whether the regex will be used for latin1 or unicode, knowing that we are going to match UTF-8 we can optimize for that. - For character classes: A bitmap is only used for codepoints < 256 to see wheter they match, the bitmap could be extended for larger codepoints. (when the bitmap does not match a very expensive swatch is used). - For character classes: For the bitmap test the UTF-8 is decoded to a codepoint and then looked up. Instead the codepoints of the character class could be encoded to UTF-8 only once during compilation and then used in the bitmap. - Counting/tracking codepoints is expensive, so we should not do it when it is not required. In my branch I already started working on these things (mostly to seperate the unicode and byte matching, and not really optimizing for speed), and I already got more then 20% speed increase. Of course when doing Level 2 or 3 unicode regular expression (see http://unicode.org/reports/tr18/ for the levels), things get a lot more complicated and thus slower. > The *best* solution might well be fixed 7/8/16/32, using the smallest that > fits. Find the smallest fixed width and converting your string to it, is probably more overhead, then to just use the standard encoding. And not knowing during compilation of the regex what the encoding will also make things very difficult. Gerard Goossen