develooper Front page | perl.perl5.porters | Postings from March 2007

Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)

Thread Previous | Thread Next
John Berthels
March 28, 2007 05:16
Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)
Message ID:
> Considering that I like to write modern programs that simply use
> Unicode end-to-end as possible, and at least internally, which keeps
> everything simple and compatible, it would be easier for me if the
> meaning of the utf8 flag was updated to officially be the new
> behaviour.

Well, perl goes to some lengths (implicit conversion) for you to be
able to mix untagged-all-ascii string values and tagged-non-ascii
transparently in your program. And you can happily write modern
programs using Unicode end-to-end doing so. Both types of strings
consist of character data.

> I believe that a true utf8 flag should mean that the string contains
> data that is valid utf8, not just that it has utf8 characters outside
> the ASCII range.

Well, I think is_utf8 is poorly named either way (with several years
of hindsight - I don't think I would have made a better choice at the
time). I don't think that Perl's internal representation for unicode
strings is guaranteed to be utf8. The flag more properly means "please
treat this as character data, taking special care to realise that some
of the character values may be > 255". And it's the 'special care' bit
which can cost performance.

> As far as I know, the conceptual purpose of the utf8 flag is to
> indicate whether Perl considers a string to be unambiguous character
> data or binary data which could be ambiguous character data, and thus
> how Perl will treat it by default.

Yes, agreed. And it's really a bit of perl's internals which
application code shouldn't really want to examine or change directly.

[snip example of using is_utf8 to check that a perl value contains
'character data']

Why would your library routine care? It can manipulate the string as a
sequence of characters in either case. It will produce the wrong
results if passed the wrong data, but that will always be true, since
it could be passed wrong data tagged as utf8. If your routine wants
specific sequences of characters it can check for those, regardless of
the is_utf8ness of the string.

> Now, if there is some concern that character-oriented regexes and
> such are considerably slower for ASCII data than alternatives, and
> this is a problem and it can't be otherwise dealt with

I think the unicode regex engine can never be as fast as the
byte-oriented one. It has more to consider. There's some example code
(vaguely like the sort of templating where I noticed the problem),
which shows unicode running 2-3 times as slow (17s instead of 6s) as
the byte engine.

> we could
> perhaps have an additional flag which has the meaning that I ascribed
> to utf8; eg, is_chars() or is_text() etcetera; but in my mind it
> would be simpler to just leave the meaning of is_utf8 adjusted to
> mean is unambiguous character data.

I'm having trouble thinking of an example where application code might
want to check this. It's part of perl's internals, surely?

> P.S.  On a tangent, it would be nice if there was a simple test to
> see if an SV currently considered its numerical or integer or string
> etc component to be the authoratative one, so eg I could just check
> that rather than using looks_like_number or some such more
> complicated solution.  Though maybe there is already, perhaps in a
> bundled debugging or some such module, and I haven't found it yet?

I'd rather is_utf8 disappeared from the public API, since it's really
an internal flag and (I think) poorly named. Internally, it could then
be renamed requires_unicode_engine or something.

But what I really care about is the ability to just tell perl "data
from this source is in this encoding", "data going to this destination
is in this encoding" and get all the nice automagic handling of
conversions for me without paying the unicode engine cost on ascii



Bench output:

        Rate udata  data
udata  588/s    --  -63%
data  1572/s  167%    --

use warnings;
use strict;
use Encode;
use Benchmark;

my $data = "";
my $count = 10;
while ($count-- > 0) {
    $data = "<%-$count tag with some text $data $count-%>";
my $udata = $data;

my $do_what = shift || "bench";
my $run_count = shift || 10000;

if ($do_what eq 'bench') {
    Benchmark::cmpthese(-20, {
            data => sub { stress($data); },
            udata => sub { stress($udata); },
elsif ($do_what eq 'bytes') {
    stress($data) for (1..$run_count);
elsif ($do_what eq 'chars') {
    stress($udata) for (1..$run_count);
else {
    die "Don't understand what you wanted me to do: $do_what";

sub stress {
    my $data = shift;
    my $oldlen;
    while ($data =~ s/<%-(\d+)([^<]*?).*%-\1>/reverse($2)/e) {
        if ($oldlen) {
            die "didn't match [$data]" unless length $data < $oldlen;
        $oldlen = length $data;

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About