Tels skribis 2007-03-31 12:40 (+0000): > The "do not mix it" is the part where I am currently having problems with. > As far as I can see, there is nothing in Perl that prevents this from > happening, nor can I enable a warning when it happens. This is true, but no different from other things that you should keep track of yourself. Some operations can change the type of a variable, not just inside, but also conceptually. * references $ref++, and it's no longer a ref. * strings $string++, and it's no longer a string. * numbers "x" on a number very rarely makes sense. Though this is all visible in your code, because there are different operators, and they are known to force their type upon the values (simplified explanation). Text strings and byte strings share a single type, but also a single set of operators. Indeed, that makes it harder to cope with keeping them apart. Some people may like a hungarian notation for it. > All you get is at some point corrupted data, or very inefficient code > (since Perl internally uses UTF-8 while it could use just the raw > bytes). If you accidentally mix them, yes. But if you don't, the byte string won't be upgraded to utf8 (when it is, that is probably a bug that should be fixed), and your bytestring just lives on exactly like it would have in Perl 5.005, or 4, or perhaps 1 even. > It is not confusing to me, but gzip wouldn't actually help when Perl > helpfully upgrades the gzippd data to utf-8 :) Perl is helpful when it sees you're using the string as a text string. It them assumes that it had been latin1 all the time. It would be useful to have magic on a string that enforced non-upgrading, but only for strings that you want it on. This would be the bondage part, for when discipline was broken. > I know what you mean, but the problem is that you are also proposing that > the UTF-8 flag should be hidden from the user. So, how can I "not access > the UTF-8 encoded" buffer when I don't know if the buffer I access is UTF-8 > or not? Accessing the buffer directly is something that byte operators do, e.g. vec and unpack("C"). If you never mix your byte strings with text strings, and use these operators only with byte strings, you can be sure that the variables won't be UTF8 internally. Note that if you refactor this guideline, the "UTF8" part disappears. > > > This is costly (scanning for hight bit characters to distiguish between > > > 7bit ascii and 8bit "something else") > > I'm not aware of Perl scanning for high bit characters in UTF8less > > strings, or any performance loss caused by that. > use Benchmark; > use Encode qw/decode/; > my $a = 'a' x 100_000_000; # 7bit utf-8 off > my $b = 'b' x 100_000_000; # 7bit utf-8 off > my $c = 'c' x 100_000_000; # 7bit utf-8 flag on > $c = decode('ISO-8859-1', $c); > timethese (-3, { > 'a eq b' => sub { $a eq $b; }, > 'a eq c' => sub { $a eq $c; }, > } ); > Benchmark: running a eq b, a eq c for at least 3 CPU seconds... > a eq b: 4s (4.72 usr + -0.02 sys = 4.70 CPU) @7218655.96/s (n=33927683) > a eq c: 3s (2.80 usr + 0.46 sys = 3.26 CPU) @ 2.76/s (n=9) Ah, good to know there are more people who don't mind using 100 MB strings. I thought you meant implicit scanning, i.e. not caused by manual decoding, or automatic upgrading. decode might optimize latin1 or ascii some day. The documentation already claims that it does that, but it doesn't. When optimizing, knowledge of the internals can help a great deal. I stress that you don't need this knowledge for a working program, and that working with 100 MB strings and then comparing them in a tight loop is not common. But anyway, a nice optimization is to do utf8::downgrade on a string that you just decoded from latin1. Then you pay only a one-time price. Depending on your data, however, a better optimization may be to utf8::upgrade the other two. -- korajn salutojn, juerd waalboer: perl hacker <juerd@juerd.nl> <http://juerd.nl/sig> convolution: ict solutions and consultancy <sales@convolution.nl> Ik vertrouw stemcomputers niet. Zie <http://www.wijvertrouwenstemcomputersniet.nl/>.Thread Previous | Thread Next