develooper Front page | perl.perl5.porters | Postings from April 2007

Re: perl, the data, and the tf8 flag

Thread Previous | Thread Next
From:
Glenn Linderman
Date:
April 1, 2007 16:06
Subject:
Re: perl, the data, and the tf8 flag
Message ID:
46103A9F.80605@NevCal.com
On approximately 4/1/2007 2:01 PM, came the following characters from 
the keyboard of Juerd Waalboer:
> Glenn Linderman skribis 2007-03-31 18:44 (-0700):
>   
>> You sidestepped comment on what the range is for data stored in 
>> multi-bytes format, though... is it [0:2^30-1] or [0:2^32-1], or what?  
>>     
>
> I didn't know, but used the following to find out by brute force:
>
>     juerd@lanova:~$ perl -le'$a = 1; while ($a *= 2) { ord(chr $a) == $a or die $a }'
>     4294967296 at -e line 1.
>     juerd@lanova:~$ perl -le'print ord(chr(4294967295))'
>     4294967295
>
> So 32 bits, or in that cryptic format of yours: [0:2^32-1] ;)
>   

Thanks!  I thought about doing something like that, but was too busy 
reading, and trying to understand!

And the notation isn't exactly mine, some math professor or researcher 
figured it out :)  Maybe I'm misusing it beyond recognition, though :)


> The questions that you have asked here may be useful additions to
> perlunifaq.
>   

Feel free to use/revise any of it... I won't attempt to impose 
restrictions on reuse of anything in this thread.


>> 1. operations that add characters greater than 255
>> 2. joining text strings with byte strings
>> 3. byte operations
>>     
>
> It's actually still just 1, not 2 or 3: "operations that add characters
> greater than 255" is equal to "joining text strings with byte strings",
> because anything with characters >255 is a definitely text string.
> "joining text strings with byte strings" is again equal to "using text
> strings in byte operations", because concatenating with a byte string
> can be seen as a byte operation.
>   

Well, I can see 1 & 2 as being the same, but if you are truly doing 
"byte operations", then there is no "multi-byte" involved.  but  I guess 
the way you worded that was somewhat ambiguous... they way I worded it 
may have changed your intention...

You had 2 & 3 in the same sentence, and were maybe implying that while 
one might think they were doing a byte operations, they were actually 
not, because they were joining text strings with byte strings.


>> [perlfunc/"pack"]
>>     
>
> All of the documentation for pack is, in my humble opinion, a canditate
> for a rewrite. 
>     
> I didn't touch pack's documentation when I updated unicode documentation
> recently, because I never thought there would be someone convinced that
> Perl should treat entire multibyte characters as single bytes/octets.
>   

Well, I wouldn't go so far as to call entire multibyte characters 
"single bytes/octets", but I would go so far as to call them "binary 
values in the range [0:2^32-1] using a variable-length, multi-byte 
encoding", and a string of them a "binary sequence of values in the 
range [0:2^32-1] using a variable-length, multi-byte encoding".

Why?

Because that's what seems to be actually implemented... it isn't 
Unicode, it isn't UTF-8, perl carefully (and confusingly) calls it utf8, 
and others have referred to it as UTF-X, and it is only operations that 
actually assign specific character or codepoint semantics on the values 
that interpret them as characters.  And so that is encode.pm, i/o 
encodings/decodings, and regexp operations that understand case-shifting 
or character classes.


> But, it appears now that bleadperl does do that for other pack template
> letters, just not for "C". I think this change is a bad one and should be
> reversed, but if it's not reversed, then the special case for "C" is
> indeed bad and should be removed.
>
> Personally, I think it's better to warn when unpack is used on a
> multibyte string, and promise no specific return value. 
>   

Yea, something in pack/unpack needs to change to remove the 
inconsistency... it would be nice if it also documented what actually 
happens, and how it can be used/misused, and how accidents can be avoided.


> An alternative would be the possibility for having a string type that is
> explicitly only for byte buffers.
>
> Let's call that, hypothetically, a "blob". There could be a "blob"
> operator that adds this protecting magic to an existing string.
>
> You could say: "blob my $foo". Sounds dwimmy enough.
>   

Isn't there a syntax like "my blob $foo" ?  Is that restricted to 
objects?  Is a blob an object or not, or could/should it be?

Could an object be created that would embed a "bytes-only string", and 
protect it?  Or is magic really needed?


More about pack/unpack:
> Current stable Perl (5.8.8) uses the internal buffer directly.
>
> Current bleadperl (5.9.5 to be) uses the codepoints, except for the "C"
> letter.
>
> I believe that neither is "right", because it simply does not make any
> sense to do byte packing or unpacking on text strings. Hence: a warning,
> and then return any value you like. I would prefer if the internal
> buffer was used, just like in current stable Perl, because otherwise
> existing deficient code may start falling apart. But I'm okay with
> codepoints (and indeed, then "C" should probably be changed to also use
> codepoints), or undef, or random values.
>   

OK, so there's a significant difference between stable and blead.  And 
it sounds like it is incompatible, and will break some amount of code.

I tend to think that by definition (one of the inconsistent pieces of 
pack/unpack documentation), that the buffer produced by pack should be a 
bytes buffer, and the buffer accepted by unpack should be a bytes 
buffer.  Always.  Even when U is used to insert/extract a (sequence of) 
"binary values in the range [0:2^32-1] using a variable-length, 
multi-byte encoding".

And note that as far as I can tell, U doesn't implement Unicode 
semantics in any way... it just uses a variable-length multi-byte binary 
encoding scheme that is also used in the Unicode standard for UTF-8 
encoding.


>> OK.  What do you recommend when needing to store a UTF-8 string into a 
>> struct?  
>>     
>
> My Perl language doesn't have structs. I store strings in scalars, and
> those handle both byte data (which could be UTF-8 encoded text) and
> unicode data just fine.
>   

Well, that's a cop-out.  My Perl language has structs!  Manipulated with 
pack/unpack.  But in doing so, I surely don't want my struct upgraded or 
downgraded.  But you gave your real answer below, and I also figured it 
out in my followup:


> Then add UTF-8 encoded strings to it! That's okay, because *encoded*
> strings are byte strings.
>
>     my $encoded_string = encode("UTF-8", $text_string);
>     $byte_string .= $encoded_string;  # ok!
>
> You can use "utf8" (no hyphen) instead of "utf-8" (with hyphen) if you
> don't care about codepoints that don't exist in Unicode (yet), or
> "utf-8" (with hyphen) if you want strictness. For utf8 (no hyphen),
> there is the shortcut function encode_utf8.
>
>   
>> Seems like "use bytes;" is a perfect match for the operations that
>> work on the simulated memory.
>> Maybe this would be a place where you would agree to make an exception 
>> to your above advice? 
>>     
>
> No. Making an exception here will only hurt in the long run, because the
> internal byte buffer that bytes:: accesses may change encoding over
> time, or because of its contents.
>   

Gotcha!  (off to recode a bit, but now I know how to en-/decode_utf8 as 
needed).


>> So you are saying that Data::Dumper treats strings as text, whether they 
>> are text or binary.
>>     
>
> No, it uses strings and really doesn't know or care if 8 bit strings are
> internally-latin1 text strings, or byte strings. However, if you pull
> Dumper's output through an :encoding layer, or through encode(), the
> bytes will be assumed to be latin1 text.
>
> Useqq avoids this by outputting the bytes as escapes rather than literal
> bytes.
>   

Useqq avoids performance too, by being pure Perl... on the other hand, I 
don't typically Data::Dumper anything that contains binary gibberish, 
because, after all, it is not very readable, and I like to be able to 
read stuff in files... when possible... if not possible, then I use a 
binary file format, binmode, read it in, and unpack it.


>> The problem is that there are two (<- :) *) kinds of data that regexp's 
>> can operate on:
>> 1) Unicode multi-byte
>> 1) ASCII byte
>> 1) ASCII multi-byte
>> 2) Latin-1 byte
>> 1) Latin-1 multi-byte
>>     
>
> It's a bit different. 
>
> Regexes, or actually, case independency and predefined character
> classes, work on characters (note: that's Perl jargon for
> "codepoints"!). 
>
> Codepoints stay the same, regardless of internal re-encoding. The
> semantics should also stay the same. 
>
> But they don't. Characters in the non-ascii latin1 range are treated
> differently, based on their internal encoding. That sucks, because the
> programmer doesn't know the internal encoding, and because the internal
> encoding depends on the history of the string.
>
> The easiest examples are \s and \w.
>
> \s matches space, form feed, tab, newline, and carriage return. Except
> when the internal encoding happens to be UTF8. Then, it also matches non
> breaking space (0xA0).
>
> \w matches A-Z, a-z, 0-9, and underscore. Except when the internal
> encoding happens to be UTF8. Then, it also matches accented word
> characters like ÿ, Á, ê, ñ, and Ø, and word characters like þ, æ and ð.
>
> Because of backwards compatibility, it cannot be fixed without adding
> new syntax. These semantics can be described as "ASCII mode" and
> "Unicode mode". That's where the suggested flags /a and /u come from.
>
> Note that not all predefined character classes work like this. \p{}, for
> example, always uses unicode semantics.
>   

Yes, that is a good, clear exposition of the details.  But I think the 
results are that the four cases marked 1) work, and the case marked 2) 
doesn't, can you agree?  Of course regexp's that avoid character classes 
and case-shifting work in all cases...


>> * That's 5 kinds of data
>>     
>
> The world (not just Perl!) has two kinds of string data, byte strings
> and text strings.
>   

Don't forget  "binary sequence of values in the range [0:2^32-1] using a 
variable-length, multi-byte encoding" :)

Oh yes, you call those text strings.  Whether or not they contain 
text...  I prefer to discuss ways of storing data (per above), and 
separately discuss when and why such methods are useful, and when and 
why particular semantics are implied for particular data.

So the reason to use bytes strings is if all your binary values that you 
want to represent fit in bytes, and if they are smaller, there is a 
space/time tradeoff vs using vec or other smaller-numbers-of-bits 
representations.

So the reason to use multiple bytes is if your number sequences have a 
larger range of values.  Pack/unpack can help with that (see codes n N s 
S q Q).

So the reason to use "binary sequence of values in the range [0:2^32-1] 
using a variable-length, multi-byte encoding", is if most of your data 
is small values, but there are occasional larger values.  This _happens_ 
(by design, of course) to fit Unicode text semantics well, and so if you 
are dealing with sequences of Unicode codepoints, it can be a very 
effective storage technique.

etc.

Proper terminology, and avoiding mixing the concepts of storage 
technique and semantics, can go a _long_ ways towards aiding 
understanding.  Of course, without thinking that through ahead of time, 
you wind up with things like the "utf8 flag", which really doesn't mean 
utf8 at all, but instead means "storage format is multi-bytes".

> Perl has two kinds of string representation: 8 bit octets, and utf8
>
> There is no direct mapping between them, but an overlap.
>
>     Your data is:           binary data          text data
>     Perl uses:              8 bit                8 bit or utf8.
>
> or, the other way around:
>
>     Perl uses:              8 bit                utf8
>     Your data is:           binary or text       text
>
> Whenever you notice that your byte string got the UTF8 flag somehow, you
> found a bug in your code (you didn't properly separate text from binary)
> or you found a bug in perl (or a module you used).
>
> Note that Perl has no special treatment for ASCII data! I just call the
> "pre-unicode" regex semantics "ASCII mode" because the character classes
> only match ASCII with it.
>   

Sorry, that's word-smithing true to be false in disguise.  I think it 
would be helpful to document somewhere all the places that character 
semantics are actually used in Perl.  I think the list is fairly short, 
and fairly similar for all character representations.  I'm not 100% sure 
I can be 100% complete, here, but I'll claim to be, to get the peanut 
gallery that knows better involved, eh?

ASCII semantics are used for:

\L, \l, \U, \u, \Q operations in bytes string constants (no character 
code values > 255)
regexp suboperations when used with byte string parameter:
   case-insensitivity
   character classes \w \W \s \S \b \B
   modifiers /i

Unicode semantics are used for:

\L, \l, \U, \u, \Q operations in multi-bytes string constants (at least 
one character code value > 255)
regexp suboperations when used with multi-bytes string parameter:
   case-insensitivity
   match codes \w \W \s \S \b \B \Z
   modifiers /i /m

encode.pm and file handles with encoding layers explicitly define the 
character semantics they use, which include ASCII, Unicode, and many 
other encodings.

Note: it appears to me that Perl (except for encode.pm) _never_ applies 
Latin-1 semantics to anything, at present.  But people talk about it, 
because if bytes strings are converted to multi-bytes strings the result 
is the same as converting Latin-1 character codes to Unicode character 
codes.


>> So the "unicode regexp" problem is really a "Latin-1 bytes regexp"
>> problem?  Yes, your /u feature would seem to cure that, then, if that
>> is the only problem.
>>     
>
> Basically, but it would be nice to have /a too, because the old ASCII \w
> was so incredibly widely used, that even with unicode text data, you may
> still want to match it. I have, for example, used it for security
> reasons: \w was the whitelist for characters in page names. I have now
> replaced it with [A-Za-z0-9_] explicitly, because I have the page name
> itself is a text string and for security reasons I don't *want to*
> support other characters.
>   

I didn't mean to preclude /a as part of the solution, but as you say, it 
is not actually a necessary part.

I think this feature would be the first implementation of Latin-1 
semantics in Perl, and here you are calling it Unicode instead!  :)  But 
that is reasonable because Latin-1 is a subset of Unicode.


-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About