develooper Front page | perl.perl5.porters | Postings from June 2008

Re: Perl 5.8 and perl 5.10 differences on UTF/Pack things

Thread Previous | Thread Next
From:
Juerd Waalboer
Date:
June 29, 2008 09:24
Subject:
Re: Perl 5.8 and perl 5.10 differences on UTF/Pack things
Message ID:
20080629162446.GT27872@c4.convolution.nl
Nicholas Clark skribis 2008-06-18 16:07 (+0100):
> >    open IDX, ">:utf8", "..."
> >         if (utf8::is_utf8($chave))
> >           print IDX pack('a*x',$chave);
> >         }
> > (..)
> > Is there a cleaner way?
> Avoid using 'a' in pack?

The problem is more fundamental: this code is relying on Perl's
internals. Threefold, even. This suggests to me that unicode/utf8
support was bolted onto an existing program, without a clear
understanding of the consequences. Maybe even already in the 5.6 era.

1. The :utf8 layer. Use :encoding(utf8) instead. (No big deal, but read
<http://www.perlfoundation.org/perl5/index.cgi?the_utf8_perlio_layer>)

2. is_utf8. Don't use it at all, ever. Unless you're building Perl
itself and need something to test it. Pretend that it's called
Internals::SvUTF8, analog to Internals::SvREADONLY.

3. pack "a" on a text string in 5.8. There was a bug, and it's fixed.
And now we're really in trouble :). Also, this is a violation of
text/binary separation: if you use an encoding layer, you're going to
deal with character strings. However, pack "a" is clearly documented to
handle byte strings.

To be honest, I don't understand what the code snippet is *supposed* to
do. If you want an all UTF8 file, the :encoding(utf8) (or, if you
insist, :utf8) suffices and then you can just print strings to it. You
don't need to know the internal state, Perl handles the encoding for
you.

Fixing another archaism,
<http://www.perlfoundation.org/perl5/index.cgi?bareword_uppercase_filehandles>,
here's what I think your code should probably look like:

    open my $idx, ">", ...;
    print {$idx} pack("a*x", encode_utf8($chave));

Hm, actually, isn't a*x typically written as Z* instead?

    print {$idx} pack("Z*", encode_utf8($chave));

Or is there some subtle difference that I'm unaware of?

I would probably have written it a bit differently, to avoid a copy of
the entire string in memory. I don't know how large your string is, of
course.

    print {$idx} encode_utf8($chave), "\0";

And in good code you can make it even more memory efficient by encoding
in place, which is perfectly reasonable if this is the last thing
that'll happen to the string.

    utf8::encode($chave);
    print {$idx} $chave, "\0";

(I hear you think "premature optimization", but it's not. I sometimes deal
with big text strings on virtual machines with limited memory, and when
your strings are measured in megabytes, you better be conservative from
the start! Unicode support is great, but it's a main source of bad
performance.)

Using an encoding PerlIO layer seems a bit odd here. If you need that
nullbyte, it's probably not a text file.

Recommended reading: perlunitut, perlunifaq
-- 
Met vriendelijke groet,  Kind regards,  Korajn salutojn,

  Juerd Waalboer:  Perl hacker  <#####@juerd.nl>  <http://juerd.nl/sig>
  Convolution:     ICT solutions and consultancy <sales@convolution.nl>
1;

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About