develooper Front page | perl.perl5.porters | Postings from April 2007

Re: perl, the data, and the tf8 flag

Thread Previous | Thread Next
From:
Juerd Waalboer
Date:
April 1, 2007 14:02
Subject:
Re: perl, the data, and the tf8 flag
Message ID:
20070401210146.GO31277@c4.convolution.nl
Glenn Linderman skribis 2007-03-31 18:44 (-0700):
> You sidestepped comment on what the range is for data stored in 
> multi-bytes format, though... is it [0:2^30-1] or [0:2^32-1], or what?  

I didn't know, but used the following to find out by brute force:

    juerd@lanova:~$ perl -le'$a = 1; while ($a *= 2) { ord(chr $a) == $a or die $a }'
    4294967296 at -e line 1.
    juerd@lanova:~$ perl -le'print ord(chr(4294967295))'
    4294967295

So 32 bits, or in that cryptic format of yours: [0:2^32-1] ;)

> I've read perlunifaq several times trying to figure things out.  I read 
> perlunitut yesterday, when it came up in this discussion.

Great!

> I found perlunifaq quite opaque the first several times I read it.  

It's meant to be read after perlunitut, where the basics are outlined.
The FAQ assumes basic knowledge.

(In an earlier version, they were both in one document.)

> perlunitut seems easier to follow, but didn't answer all my questions 
> either.

The questions that you have asked here may be useful additions to
perlunifaq.

> Again, you say two, but describe 3.... :)  Maybe that is a habit of yours?

You're on to me!

> 1. operations that add characters greater than 255
> 2. joining text strings with byte strings
> 3. byte operations

It's actually still just 1, not 2 or 3: "operations that add characters
greater than 255" is equal to "joining text strings with byte strings",
because anything with characters >255 is a definitely text string.
"joining text strings with byte strings" is again equal to "using text
strings in byte operations", because concatenating with a byte string
can be seen as a byte operation.

> [perlfunc/"pack"]

All of the documentation for pack is, in my humble opinion, a canditate
for a rewrite. 
    
I didn't touch pack's documentation when I updated unicode documentation
recently, because I never thought there would be someone convinced that
Perl should treat entire multibyte characters as single bytes/octets.

But, it appears now that bleadperl does do that for other pack template
letters, just not for "C". I think this change is a bad one and should be
reversed, but if it's not reversed, then the special case for "C" is
indeed bad and should be removed.

Personally, I think it's better to warn when unpack is used on a
multibyte string, and promise no specific return value. 


An alternative would be the possibility for having a string type that is
explicitly only for byte buffers.

Let's call that, hypothetically, a "blob". There could be a "blob"
operator that adds this protecting magic to an existing string.

You could say: "blob my $foo". Sounds dwimmy enough.

Of course, it's attractive to add an optional parameter for the encoding
of the binary string. All non-blobs added to the string would be
automatically encoded. This doesn't reliably work in the other
direction, because the bytes may not be part of the text. An
encodingless blob can't ever have UTF8 things added to it, but doesn't
mind accepting latin1 (UTF8less) data.

This all isn't needed for writing stable programs, but it would help
those who feel insecure without a tool to enforce separation.

> I could imagine something like the following being invented as a 
> communication protocol...
> $x = $text_string . "\0" . pack( "template", @params);

No.

A communication protocol needs to define the required byte encoding for
text. There are many encodings, and defaulting to any of them, without
making that part of your specification, is a huge mistake. You think
UTF-8 is standard? Guess again, because many applications use UTF-16 or
UCS-2 instead. And of course, there are still many things that don't
even handle unicode.

You must encode text strings, and you can safely use the result of that
in a byte string or stream. The world doesn't quite DWYM enough to guess
what you meant.

(You already found out, and replied to your own post, but I felt it
would be good if I responded here anyway.)

> # send somewhere, that does the following
> ( $retrieve_text_string, $unpack_me ) = split( "\0", $x );
> @retrieve_params = unpack( "template", $unpack_me );

And then decode.

> It seems that would violate your recommendation of keeping things 
> separate, but one needs to avoid two separate communications, for 
> efficiency, eh?  

Absolutely!

> So do you have a recommended practice for this sort of action?

decoding and encoding.

Remember, "decoding" converts byte strings to text strings, "encoding" 
converts text strings to byte strings. Whenever you do such a
conversion, you need to know which byte encoding was used, or is to be
used on the byte side.

It's okay to use text strings together, and it's okay to use byte
strings together. Just don't mix text strings and byte strings within
the same filehandle or concatenating operator.

And realise that only you, the programmer, can know the difference. Perl
can't help you here.

> Not sure how it would return undef???  The values are hardly random, 
> either, they come out of the buffer it is handed, right?  

Current stable Perl (5.8.8) uses the internal buffer directly.

Current bleadperl (5.9.5 to be) uses the codepoints, except for the "C"
letter.

I believe that neither is "right", because it simply does not make any
sense to do byte packing or unpacking on text strings. Hence: a warning,
and then return any value you like. I would prefer if the internal
buffer was used, just like in current stable Perl, because otherwise
existing deficient code may start falling apart. But I'm okay with
codepoints (and indeed, then "C" should probably be changed to also use
codepoints), or undef, or random values.

> But if a multi-bytes string is passed to a bytes-expecting template,
> then clearly it won't produce the results you expect...

Exactly.

> OK.  What do you recommend when needing to store a UTF-8 string into a 
> struct?  

My Perl language doesn't have structs. I store strings in scalars, and
those handle both byte data (which could be UTF-8 encoded text) and
unicode data just fine.

> so it truly wants to stay bytes.  But I want to store UTF-8 encoded
> strings into it.

Then add UTF-8 encoded strings to it! That's okay, because *encoded*
strings are byte strings.

    my $encoded_string = encode("UTF-8", $text_string);
    $byte_string .= $encoded_string;  # ok!

You can use "utf8" (no hyphen) instead of "utf-8" (with hyphen) if you
don't care about codepoints that don't exist in Unicode (yet), or
"utf-8" (with hyphen) if you want strictness. For utf8 (no hyphen),
there is the shortcut function encode_utf8.

> Seems like "use bytes;" is a perfect match for the operations that
> work on the simulated memory.
> Maybe this would be a place where you would agree to make an exception 
> to your above advice? 

No. Making an exception here will only hurt in the long run, because the
internal byte buffer that bytes:: accesses may change encoding over
time, or because of its contents.

> So you are saying that Data::Dumper treats strings as text, whether they 
> are text or binary.

No, it uses strings and really doesn't know or care if 8 bit strings are
internally-latin1 text strings, or byte strings. However, if you pull
Dumper's output through an :encoding layer, or through encode(), the
bytes will be assumed to be latin1 text.

Useqq avoids this by outputting the bytes as escapes rather than literal
bytes.

> The problem is that there are two (<- :) *) kinds of data that regexp's 
> can operate on:
> 1) Unicode multi-byte
> 1) ASCII byte
> 1) ASCII multi-byte
> 2) Latin-1 byte
> 1) Latin-1 multi-byte

It's a bit different. 

Regexes, or actually, case independency and predefined character
classes, work on characters (note: that's Perl jargon for
"codepoints"!). 

Codepoints stay the same, regardless of internal re-encoding. The
semantics should also stay the same. 

But they don't. Characters in the non-ascii latin1 range are treated
differently, based on their internal encoding. That sucks, because the
programmer doesn't know the internal encoding, and because the internal
encoding depends on the history of the string.

The easiest examples are \s and \w.

\s matches space, form feed, tab, newline, and carriage return. Except
when the internal encoding happens to be UTF8. Then, it also matches non
breaking space (0xA0).

\w matches A-Z, a-z, 0-9, and underscore. Except when the internal
encoding happens to be UTF8. Then, it also matches accented word
characters like ÿ, Á, ê, ñ, and Ø, and word characters like þ, æ and ð.

Because of backwards compatibility, it cannot be fixed without adding
new syntax. These semantics can be described as "ASCII mode" and
"Unicode mode". That's where the suggested flags /a and /u come from.

Note that not all predefined character classes work like this. \p{}, for
example, always uses unicode semantics.

> * That's 5 kinds of data

The world (not just Perl!) has two kinds of string data, byte strings
and text strings.

Perl has two kinds of string representation: 8 bit octets, and utf8

There is no direct mapping between them, but an overlap.

    Your data is:           binary data          text data
    Perl uses:              8 bit                8 bit or utf8.

or, the other way around:

    Perl uses:              8 bit                utf8
    Your data is:           binary or text       text

Whenever you notice that your byte string got the UTF8 flag somehow, you
found a bug in your code (you didn't properly separate text from binary)
or you found a bug in perl (or a module you used).

Note that Perl has no special treatment for ASCII data! I just call the
"pre-unicode" regex semantics "ASCII mode" because the character classes
only match ASCII with it.

> So the "unicode regexp" problem is really a "Latin-1 bytes regexp"
> problem?  Yes, your /u feature would seem to cure that, then, if that
> is the only problem.

Basically, but it would be nice to have /a too, because the old ASCII \w
was so incredibly widely used, that even with unicode text data, you may
still want to match it. I have, for example, used it for security
reasons: \w was the whitelist for characters in page names. I have now
replaced it with [A-Za-z0-9_] explicitly, because I have the page name
itself is a text string and for security reasons I don't *want to*
support other characters.
-- 
korajn salutojn,

  juerd waalboer:  perl hacker  <juerd@juerd.nl>  <http://juerd.nl/sig>
  convolution:     ict solutions and consultancy <sales@convolution.nl>

Ik vertrouw stemcomputers niet.
Zie <http://www.wijvertrouwenstemcomputersniet.nl/>.

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About