John Berthels skribis 2007-03-28 9:52 (+0100): > Well, perl goes to some lengths (implicit conversion) for you to be > able to mix untagged-all-ascii string values and tagged-non-ascii > transparently in your program. As Jarkko already mentioned, Perl internally makes a distinction between latin1 and utf8. BOTH are fully ASCII-compatible, but no special case exists for strings that are fully ASCII-only. > Well, I think is_utf8 is poorly named either way (with several years > of hindsight - I don't think I would have made a better choice at the > time). Agreed, but for different reasons. I think it should be called Internals::internal_encoding_is_utf8_not_latin1, with a user friendly wrapper called Internals::encoding that returns either "latin1" or "utf8". > I don't think that Perl's internal representation for unicode > strings is guaranteed to be utf8. Indeed. It can also be latin1. The flag indicates (negated) if this is the case. > The flag more properly means "please treat this as character data As far as Perl is concerned, ALL strings consist of character data. Internally-latin1 strings are special because there, bytes and characters can safely be considered equal. > And it's the 'special care' bit which can cost performance. My guess is that the performance costs are mostly associated with utf8 being variable width, which means that you need to scan through the string to do just about anything. > [The UTF8 flag is] really a bit of perl's internals which application > code shouldn't really want to examine or change directly. Well said. > >Now, if there is some concern that character-oriented regexes and > >such are considerably slower for ASCII data than alternatives, and > >this is a problem and it can't be otherwise dealt with > I think the unicode regex engine can never be as fast as the > byte-oriented one. Unicode versus bytes is a weird comparison. Unicode strings are stored as bytes too! But it's true that a naive octet matcher is faster. When you're particularly sure about the encoding and you're matching literal strings (no case insensitive stuff) and don't care about character offsets and don't care that the captures might not be correct character strings, you can sometimes gain some performance by encoding (e.g. utf8::encode for max performance) both the subject and the regex before matching and decoding afterwards. But be careful that this way also have an adverse effect, so always benchmark first. > It has more to consider. There's some example code (vaguely like the > sort of templating where I noticed the problem), which shows unicode > running 2-3 times as slow (17s instead of 6s) as the byte engine. I'd like to see and examine that. Templating is a trade that sometimes allows for naive handling, so there might be room for improvement. > I'd rather is_utf8 disappeared from the public API, since it's really > an internal flag and (I think) poorly named. There's nothing wrong with an internal thing being part of a public API. Perl has that everywhere. There is, however, something wrong with people who access these internals not realising that they are, in fact, internals, even though the documentation clearly indicated this. Encode::is_utf8 is very clearly labeled as "[INTERNAL]" in the documentation. The function may certainly be useful sometimes, like when you can output either latin1 or utf8 and just want to get the data out, without the performance loss of re-encoding: binmode $fh, ":raw"; print {$fh} "Content-Type: text/plain; charset=", (Encode::is_utf8($body) ? "UTF-8" : "ISO-8859-1"), "\n\n", $body; > Internally, it could then be renamed requires_unicode_engine or > something. Unicode semantics are also needed, when the string is not encoded as utf-8 internally. Don't forget that Unicode and UTF-8 are different things. Regardless of the internal encoding, "x" and "é" are unicode characters, with lots of unicode properties, like that they are lower case alphabetic characters. > But what I really care about is the ability to just tell perl "data > from this source is in this encoding" binmode $source, ":encoding(...)"; Though when your source is different, you may need to write your own wrappers. > "data going to this destination is in this encoding" binmode $destination, ":encoding(...)"; Same caveat. > and get all the nice automagic handling of conversions for me without > paying the unicode engine cost on ascii data. The conversions themselves may need "the unicode engine" to realise that no further action is required for your ASCII data. > while ($data =~ s/<%-(\d+)([^<]*?).*%-\1>/reverse($2)/e) { If you reverse, you need to know where characters end. If the internal encoding is utf-8, knowing where individual characters end, requires scanning through the string. That's one of the reasons that Perl DOES NOT USE utf-8, when latin1 suffices. Here, you forced the issue with _utf8_on. Perl already has this optimized. It's not in the regex engine, but in the very implementation of strings themselves, so that other operations may benefit too. -- korajn salutojn, juerd waalboer: perl hacker <juerd@juerd.nl> <http://juerd.nl/sig> convolution: ict solutions and consultancy <sales@convolution.nl> Ik vertrouw stemcomputers niet. Zie <http://www.wijvertrouwenstemcomputersniet.nl/>.Thread Previous | Thread Next