develooper Front page | perl.perl5.porters | Postings from March 2007

Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)

Thread Previous | Thread Next
From:
Juerd Waalboer
Date:
March 30, 2007 12:03
Subject:
Re: the utf8 flag (was Re: [perl #41527] decode_utf8 sets utf8 flag on plain ascii strings)
Message ID:
20070330190319.GS31277@c4.convolution.nl
John Berthels skribis 2007-03-28  9:52 (+0100):
> Well, perl goes to some lengths (implicit conversion) for you to be
> able to mix untagged-all-ascii string values and tagged-non-ascii
> transparently in your program. 

As Jarkko already mentioned, Perl internally makes a distinction between
latin1 and utf8. BOTH are fully ASCII-compatible, but no special case
exists for strings that are fully ASCII-only.

> Well, I think is_utf8 is poorly named either way (with several years
> of hindsight - I don't think I would have made a better choice at the
> time).

Agreed, but for different reasons. I think it should be called
Internals::internal_encoding_is_utf8_not_latin1, with a user friendly
wrapper called Internals::encoding that returns either "latin1" or
"utf8".

> I don't think that Perl's internal representation for unicode
> strings is guaranteed to be utf8.

Indeed. It can also be latin1. The flag indicates (negated) if this is
the case.

> The flag more properly means "please treat this as character data

As far as Perl is concerned, ALL strings consist of character data.

Internally-latin1 strings are special because there, bytes and
characters can safely be considered equal.

> And it's the 'special care' bit which can cost performance.

My guess is that the performance costs are mostly associated with utf8
being variable width, which means that you need to scan through the
string to do just about anything.

> [The UTF8 flag is] really a bit of perl's internals which application
> code shouldn't really want to examine or change directly.

Well said.

> >Now, if there is some concern that character-oriented regexes and
> >such are considerably slower for ASCII data than alternatives, and
> >this is a problem and it can't be otherwise dealt with
> I think the unicode regex engine can never be as fast as the
> byte-oriented one.

Unicode versus bytes is a weird comparison. Unicode strings are stored
as bytes too!

But it's true that a naive octet matcher is faster. When you're
particularly sure about the encoding and you're matching literal strings
(no case insensitive stuff) and don't care about character offsets and
don't care that the captures might not be correct character strings, you
can sometimes gain some performance by encoding (e.g. utf8::encode for
max performance) both the subject and the regex before matching and
decoding afterwards. But be careful that this way also have an adverse
effect, so always benchmark first.

> It has more to consider. There's some example code (vaguely like the
> sort of templating where I noticed the problem), which shows unicode
> running 2-3 times as slow (17s instead of 6s) as the byte engine.

I'd like to see and examine that. Templating is a trade that sometimes
allows for naive handling, so there might be room for improvement.

> I'd rather is_utf8 disappeared from the public API, since it's really
> an internal flag and (I think) poorly named. 

There's nothing wrong with an internal thing being part of a public API.
Perl has that everywhere. There is, however, something wrong with people
who access these internals not realising that they are, in fact,
internals, even though the documentation clearly indicated this.

Encode::is_utf8 is very clearly labeled as "[INTERNAL]" in the
documentation. 

The function may certainly be useful sometimes, like when you can output
either latin1 or utf8 and just want to get the data out, without the
performance loss of re-encoding:

    binmode $fh, ":raw";
    print {$fh}
        "Content-Type: text/plain; charset=", 
        (Encode::is_utf8($body) ? "UTF-8" : "ISO-8859-1"),
        "\n\n", $body;

> Internally, it could then be renamed requires_unicode_engine or
> something.

Unicode semantics are also needed, when the string is not encoded as
utf-8 internally. Don't forget that Unicode and UTF-8 are different
things.

Regardless of the internal encoding, "x" and "é" are unicode characters,
with lots of unicode properties, like that they are lower case
alphabetic characters.

> But what I really care about is the ability to just tell perl "data
> from this source is in this encoding"

    binmode $source, ":encoding(...)";

Though when your source is different, you may need to write your own
wrappers.

> "data going to this destination is in this encoding" 

    binmode $destination, ":encoding(...)";

Same caveat.

> and get all the nice automagic handling of conversions for me without
> paying the unicode engine cost on ascii data.

The conversions themselves may need "the unicode engine" to realise that
no further action is required for your ASCII data.

>    while ($data =~ s/<%-(\d+)([^<]*?).*%-\1>/reverse($2)/e) {

If you reverse, you need to know where characters end. If the internal
encoding is utf-8, knowing where individual characters end, requires
scanning through the string.

That's one of the reasons that Perl DOES NOT USE utf-8, when latin1
suffices. Here, you forced the issue with _utf8_on.

Perl already has this optimized. It's not in the regex engine, but in
the very implementation of strings themselves, so that other operations
may benefit too.
-- 
korajn salutojn,

  juerd waalboer:  perl hacker  <juerd@juerd.nl>  <http://juerd.nl/sig>
  convolution:     ict solutions and consultancy <sales@convolution.nl>

Ik vertrouw stemcomputers niet.
Zie <http://www.wijvertrouwenstemcomputersniet.nl/>.

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About