> Considering that I like to write modern programs that simply use > Unicode end-to-end as possible, and at least internally, which keeps > everything simple and compatible, it would be easier for me if the > meaning of the utf8 flag was updated to officially be the new > behaviour. Well, perl goes to some lengths (implicit conversion) for you to be able to mix untagged-all-ascii string values and tagged-non-ascii transparently in your program. And you can happily write modern programs using Unicode end-to-end doing so. Both types of strings consist of character data. > I believe that a true utf8 flag should mean that the string contains > data that is valid utf8, not just that it has utf8 characters outside > the ASCII range. Well, I think is_utf8 is poorly named either way (with several years of hindsight - I don't think I would have made a better choice at the time). I don't think that Perl's internal representation for unicode strings is guaranteed to be utf8. The flag more properly means "please treat this as character data, taking special care to realise that some of the character values may be > 255". And it's the 'special care' bit which can cost performance. > As far as I know, the conceptual purpose of the utf8 flag is to > indicate whether Perl considers a string to be unambiguous character > data or binary data which could be ambiguous character data, and thus > how Perl will treat it by default. Yes, agreed. And it's really a bit of perl's internals which application code shouldn't really want to examine or change directly. [snip example of using is_utf8 to check that a perl value contains 'character data'] Why would your library routine care? It can manipulate the string as a sequence of characters in either case. It will produce the wrong results if passed the wrong data, but that will always be true, since it could be passed wrong data tagged as utf8. If your routine wants specific sequences of characters it can check for those, regardless of the is_utf8ness of the string. > Now, if there is some concern that character-oriented regexes and > such are considerably slower for ASCII data than alternatives, and > this is a problem and it can't be otherwise dealt with I think the unicode regex engine can never be as fast as the byte-oriented one. It has more to consider. There's some example code (vaguely like the sort of templating where I noticed the problem), which shows unicode running 2-3 times as slow (17s instead of 6s) as the byte engine. > we could > perhaps have an additional flag which has the meaning that I ascribed > to utf8; eg, is_chars() or is_text() etcetera; but in my mind it > would be simpler to just leave the meaning of is_utf8 adjusted to > mean is unambiguous character data. I'm having trouble thinking of an example where application code might want to check this. It's part of perl's internals, surely? > P.S. On a tangent, it would be nice if there was a simple test to > see if an SV currently considered its numerical or integer or string > etc component to be the authoratative one, so eg I could just check > that rather than using looks_like_number or some such more > complicated solution. Though maybe there is already, perhaps in a > bundled debugging or some such module, and I haven't found it yet? I'd rather is_utf8 disappeared from the public API, since it's really an internal flag and (I think) poorly named. Internally, it could then be renamed requires_unicode_engine or something. But what I really care about is the ability to just tell perl "data from this source is in this encoding", "data going to this destination is in this encoding" and get all the nice automagic handling of conversions for me without paying the unicode engine cost on ascii data. regards, jb Bench output: Rate udata data udata 588/s -- -63% data 1572/s 167% -- Code: #!/usr/bin/perl use warnings; use strict; use Encode; use Benchmark; my $data = ""; my $count = 10; while ($count-- > 0) { $data = "<%-$count tag with some text $data $count-%>"; } my $udata = $data; Encode::_utf8_on($udata); my $do_what = shift || "bench"; my $run_count = shift || 10000; if ($do_what eq 'bench') { Benchmark::cmpthese(-20, { data => sub { stress($data); }, udata => sub { stress($udata); }, }); } elsif ($do_what eq 'bytes') { stress($data) for (1..$run_count); } elsif ($do_what eq 'chars') { stress($udata) for (1..$run_count); } else { die "Don't understand what you wanted me to do: $do_what"; } sub stress { my $data = shift; my $oldlen; while ($data =~ s/<%-(\d+)([^<]*?).*%-\1>/reverse($2)/e) { if ($oldlen) { die "didn't match [$data]" unless length $data < $oldlen; } $oldlen = length $data; } }Thread Previous | Thread Next