develooper Front page | perl.perl5.porters | Postings from March 2007

Re: perl, the data, and the tf8 flag

Thread Previous | Thread Next
Juerd Waalboer
March 31, 2007 09:04
Re: perl, the data, and the tf8 flag
Message ID:
Tels skribis 2007-03-31 12:40 (+0000):
> The "do not mix it" is the part where I am currently having problems with. 
> As far as I can see, there is nothing in Perl that prevents this from 
> happening, nor can I enable a warning when it happens.

This is true, but no different from other things that you should keep
track of yourself. Some operations can change the type of a variable,
not just inside, but also conceptually.

* references

$ref++, and it's no longer a ref.

* strings

$string++, and it's no longer a string.

* numbers

"x" on a number very rarely makes sense.

Though this is all visible in your code, because there are different
operators, and they are known to force their type upon the values
(simplified explanation).

Text strings and byte strings share a single type, but also a single set
of operators. Indeed, that makes it harder to cope with keeping them

Some people may like a hungarian notation for it.

> All you get is at some point corrupted data, or very inefficient code
> (since Perl internally uses UTF-8 while it could use just the raw
> bytes).

If you accidentally mix them, yes. But if you don't, the byte string
won't be upgraded to utf8 (when it is, that is probably a bug that
should be fixed), and your bytestring just lives on exactly like it
would have in Perl 5.005, or 4, or perhaps 1 even.

> It is not confusing to me, but gzip wouldn't actually help when Perl 
> helpfully upgrades the gzippd data to utf-8 :)

Perl is helpful when it sees you're using the string as a text string.
It them assumes that it had been latin1 all the time.

It would be useful to have magic on a string that enforced
non-upgrading, but only for strings that you want it on.

This would be the bondage part, for when discipline was broken.

> I know what you mean, but the problem is that you are also proposing that 
> the UTF-8 flag should be hidden from the user. So, how can I "not access 
> the UTF-8 encoded" buffer when I don't know if the buffer I access is UTF-8 
> or not?

Accessing the buffer directly is something that byte operators do, e.g.
vec and unpack("C"). If you never mix your byte strings with text
strings, and use these operators only with byte strings, you can be sure
that the variables won't be UTF8 internally.

Note that if you refactor this guideline, the "UTF8" part disappears.

> > > This is costly (scanning for hight bit characters to distiguish between
> > > 7bit ascii and 8bit "something else")
> > I'm not aware of Perl scanning for high bit characters in UTF8less
> > strings, or any performance loss caused by that.
> 	use Benchmark;
> 	use Encode qw/decode/;
> 	my $a = 'a' x 100_000_000;      # 7bit utf-8 off
> 	my $b = 'b' x 100_000_000;      # 7bit utf-8 off
> 	my $c = 'c' x 100_000_000;      # 7bit utf-8 flag on
> 	$c = decode('ISO-8859-1', $c);
> 	timethese (-3, {
> 	  'a eq b' => sub { $a eq $b; },
> 	  'a eq c' => sub { $a eq $c; },
> 	  } );
> Benchmark: running a eq b, a eq c for at least 3 CPU seconds...
>    a eq b: 4s (4.72 usr + -0.02 sys = 4.70 CPU) @7218655.96/s (n=33927683)
>    a eq c: 3s (2.80 usr +  0.46 sys = 3.26 CPU) @ 2.76/s (n=9)

Ah, good to know there are more people who don't mind using 100 MB

I thought you meant implicit scanning, i.e. not caused by manual
decoding, or automatic upgrading.

decode might optimize latin1 or ascii some day. The documentation
already claims that it does that, but it doesn't.

When optimizing, knowledge of the internals can help a great deal. I
stress that you don't need this knowledge for a working program, and
that working with 100 MB strings and then comparing them in a tight loop
is not common. But anyway, a nice optimization is to do utf8::downgrade
on a string that you just decoded from latin1. Then you pay only a
one-time price. Depending on your data, however, a better optimization
may be to utf8::upgrade the other two.
korajn salutojn,

  juerd waalboer:  perl hacker  <>  <>
  convolution:     ict solutions and consultancy <>

Ik vertrouw stemcomputers niet.
Zie <>.

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About