develooper Front page | perl.perl5.porters | Postings from April 2007

Re: perl, the data, and the tf8 flag

Thread Previous | Thread Next
From:
Glenn Linderman
Date:
April 3, 2007 17:36
Subject:
Re: perl, the data, and the tf8 flag
Message ID:
4612F2F1.1000703@NevCal.com
Hi Jeurd,

I appreciate all the enlightenment you've given... And Marc, and Yves,
and others.  Although since you all use different terminology, it has
been interesting (and fascinating) gaining an understanding from the
prior discussion between you and Marc.  So that's why I made up my own
terminology... to try to understand it, and to try to bring
understanding to everyone.  I hope we are achieving that to some extent.

To reiterate my understanding, perl supports bytes and multi-bytes...
both of which store sequences of binary values.  "bytes" buffers store
seqeunces of binary buffers all of whose values are < 256.
"multi-bytes" buffers store sequences of binary values, at least one of
value of which, at some point in the buffer's lifetime, exceeded 255,
and all of whose values are < 2^32.

So any and all discussion of text buffers is an abstract concept -- not
a useless abstraction, thinking about things that way can be useful --
terminology-wise, if people don't need multi-bytes buffers for binary
data, only Unicode text data, then it can fit the mold of their
application nicely.  And all the guidelines you've given result in a
method of thinking about that sort of application in a consistent
manner, that avoids the warts of the current implementation.  This is a
good thing, but it is only a subset of reality.

On the other hand, it seems that calling things text buffers is not
fully enlightening.  If the person wants to use multi-bytes buffers for
binary data (if the mix of binary values is mostly < 256, but sometimes
bigger, doing so can be space efficient), then the text terminology gets
in the way, and can be quite confusing.

And my lists of operations for which Perl ascribes character semantics
to its strings is extremely short... so most operations are, in fact,
based on binary data... the discussion of text could (and I think
should, for clarity) be limited to those operations, and the point made
that decoded strings use multi-bytes buffers for storage when needed.

So with that in mind, I respond to the rest of your email, and also the
next one.


On approximately 4/3/2007 5:45 AM, came the following characters from
the keyboard of Juerd Waalboer:
> Glenn Linderman skribis 2007-04-01 16:05 (-0700):
>   
>> and a string of them a "binary sequence of values in the range
>> [0:2^32-1] using a variable-length, multi-byte encoding".
>>     
>
> Sure, but do keep in mind that conceptually, a Unicode text string does
> not have any encoding.
>   

Exactly.  A Unicode text string is an abstract concept.

> It has an encoding internally, of course, because it has to be stored in
> memory in a certain format. Many Windows based tools use UTF-16 or UCS-2
> internally. Many Unix tools use UTF-8 internally. 

Perl finds it convenient to map that abstract concept to its multi-bytes
buffers as a way of implementing storage for Unicode text strings.  I
think we are in violent agreement here.

> Perl is a bit strange,
> because it uses two internal formats for text strings: latin1 and utf8.
>   

I haven't found _any_ operations in Perl (the language) that ascribe
latin1 semantics to any data.  Please point out one or more, if I am
wrong in this.  encode.pm is, of course, a module dedicated to a large
variety of character set semantics, but that is not part of Perl the
language, although it is an extremely useful part of Perl the
programming environment.  Encode seems to be pretty well described and
understood.

The conversion of bytes to multi-bytes can be interpreted as converting
latin1 to utf8 only because the latin1 byte codes are numerically equal
to unicode codepoints in the range 0-255.  This is a convenient
coincidence, but I posit that Perl only really supports ASCII and
Unicode semantics in any of its current operations.

> But "multibyte" and "encoding" aren't relevant, at all, for text
> strings. A text string is a sequence of codepoints ("characters"). Bytes
> are irrelevant until you encode to a byte string.
>
> Perl has a single string type that is used for both byte strings and
> text string, even though theoretically these are mutually incompatible.
> It does this by sticking to an 8 bit encoding as long as possible, in
> other words: until you use it with something that doesn't have this 8
> bit encoding. This is an internal thing that you don't have to know
> about if you separate text and binary values and semantics.
>   

This is an abstraction that can be helpful for some classes of
applications.  If your application fits this mold, and you choose to
think of things this way, that is fine.  And it fine to teach the
abstraction for others with similar applications.  And likely there is
not any application that could not be implemented using this
abstraction.  But some applications may benefit in time or space or
complexity by using other abstractions that conflict with this
abstraction.  That is OK.

> The two conceptual string types are mutually incompatible, but because
> Perl uses a single type of string for both, it allows combining them. If
> you don't (want to) think about the text/byte separation, you suddenly
> need to learn that some operations work on the internal byte encoding,
> while others work on the conceptual sequence of codepoints
> (called "characters" in Perl jargon).
>   

The two string types are not incompatible... they can be converted, one
to the other, in a well-defined manner, as long as the binary value of
each item in the string is < 256.

> To make things worse, something that used to work on the internal byte
> encoding, works on the conceptual sequence of codepoints now. So you
> also have to remember Perl version numbers.
>   

Could you be specific here?  What used to work on internal byte
encoding, that now works on a sequence of codepoints?

Without specifics, it seems that the newer versions of Perl have simply
extended all operations that used to work on bytes to now work on either
bytes or multi-bytes, except pack/unpack, with template U (maint) or U
and C (blead).

If we fix U, or deprecate it and add the substitute M, and fix C in
blead, then I think this will remain true.

> The need for knowledge of the internals can be fully avoided by keeping
> bytes and text separate. Instead of compiling lists of byte operations
> and text operations, I strongly advise using logic instead: things that
> work with fixed octet boundaries, are byte operations, things that make
> sense with values above 256, can be used for both bytes and text, but
> require separation on the programmer's part.
>   

This is a useful abstraction, but doesn't allow binary usage of
multi-bytes buffers, and sidesteps the warts instead of fixing them.

> Some things remain undefined or hard to logically detect, like
> filenames. The operating system may consider them sequences of bytes, or
> sequences of codepoints, but the user will want to use accented letters
> and probably doesn't care about the internals. With things like this,
> Perl is of little help, and you should either find out what your
> platform does, or err on the safe side. (Heck, there are filesystems
> that don't even support using ":" in a filename.)
>   

Interfacing to the OS requires adhering to its rules, including legal
filenames, etc.  This is true for any OS, and any language, and any form
of character encoding.

>> Because that's what seems to be actually implemented...
>>     
>
> The best advice I can give is to fully ignore the actual implementation
> of text strings. While knowledge of the internals can be used for some
> huge optimizations, it's often outright dangerous to do anything with
> that information if you don't know all the consequences yet.
>   

Again, that can be a useful abstraction for many applications.  Fully
understanding the actual implementation of the two forms of binary
strings, bytes and multi-bytes, can, however be even more enlightening,
and instead of having a fear of the unknown, you can operate with full
understanding of the known.

> If you keep byte strings far away from text strings, you don't need to
> know the internal implementation. 

I agree, and your abstractions are useful for that approach.

> Decoding and encoding are the only
> correct means of dealing with text in binary data.
>   

I agree that encoding and decoding are a great way of dealing with mixed
text and binary data, particularly when dealing with interfaces that
only understand bytes buffers, not multi-bytes buffers.  I disagree that
it is the only correct means.

>> it isn't Unicode, it isn't UTF-8, perl carefully (and confusingly)
>> calls it utf8, 
>>     
>
> Perl's strings are character strings, consisting of codepoints.
>   

Perl's strings are binary sequences of 0-255, or 0-2^32-1, as needed.
Codepoints are a Unicode thing, and the semantics of code points are
only useful with a small subset of Perl operations.


> Internally, bytes are used in some encoding, but ignore that whenever
> you can (which is almost always).
>   

So a different abstraction would be that character sets are useful for
input and output, and encoding/decoding is necessary for I/O to
particular devices, but the reality is just binary numbers, and
sequences of numbers, and you should ignore character semantics whenever
you can (which is almost always, there being only a few operations that
understand such).

> The only difference with Unicode is that Perl allows using codepoints
> that aren't defined yet, as long as they are with in the 32 bit positive
> number range. For all intents and purposes, it's practical to call text
> strings "unicode strings".
>
> Only INTERNALLY, they may be UTF8 strings.
>   

The only difference with Unicode is that many of its characters require
the use of multi-bytes strings rather than bytes strings in order to
contain the characters.  So the use of multi-bytes strings for Unicode
characters is imperative, unless they are first encoded into an
equivalent bytes string.

>> and others have referred to it as UTF-X
>>     
>
> That's what the perlebcdic manpage does. The difference between
> UTF-EBCDIC and UTF-8 is only relevant on ebcdic platforms. I have always
> ignored ebcdic specifics and will continue to do so.
>
> Many identifiers, both internally and in the introspection API, have
> "utf8" in the name, but referring to utf-x. utf-x is very uncommon, so I
> will call it utf8, just like perl itself does.
>   

OK, thanks for clearing that up too.


>>> You could say: "blob my $foo". Sounds dwimmy enough.
>>>       
>> Isn't there a syntax like "my blob $foo" ?  
>>     
>
> I specifically chose a syntax like binmode's. It might be incredibly
> useful to do this on variables imported from modules.
>   

I didn't pick up on the correlation, thanks for explaining.  I was
unaware "binmode my $foo" was useful syntax.  Perhaps an alternative
is:  my $foo; blob $foo;  and you were just combining the declaration
and the first operation on the new variable.  I don't think that is
possible with binmode, don't you have to open a file and store its
handle in the variable before you can binmode it usefully?


>> Could an object be created that would embed a "bytes-only string", and 
>> protect it?  Or is magic really needed?
>>     
>
> I'd prefer it to use normal scalar strings with magic, because using an
> object very probably has side-effects elsewhere.
>   

I don't fully understand objects and their side effects, I was just
wondering if perhaps they might permit an implementation independent of
the Perl core change that would be necessary in order to use magic.  If
so, then the benefits could be obtained more quickly, via a CPAN module,
or such.

>> OK, so there's a significant difference between stable and blead.  And 
>> it sounds like it is incompatible, and will break some amount of code.
>>     
>
> Only code that breaks the text/byte separation. Code that separates it
> properly, didn't break in older Perls, and won't use the new "fix" in
> newer Perls.
>
> In that respect, this silly change in unpack might help people to
> properly separate string types more than before, because otherwise their
> code isn't compatible with multiple Perl versions :). Still, though, I
> prefer the old (stable) semantics.
>
>   
>> And note that as far as I can tell, U doesn't implement Unicode 
>> semantics in any way... it just uses a variable-length multi-byte binary 
>> encoding scheme that is also used in the Unicode standard for UTF-8 
>> encoding.
>>     
>
> It uses that INTERNALLY. pack "U" and unpack "U" are different from all
> the other (un)pack templates, because they create/split text strings
> ("unicode strings") instead of byte strings.
>
> Note that pack "U" does not create a UTF-8 string. It creates a unicode
> string, a text string.
>
> "UTF-8 string" is short for "UTF-8 encoded string", and any encoded
> string is a byte string.
>
> The U template stands for Unicode, not UTF-8. These unicode strings use
> utf8 internally, as the documentation clearly says. As people constantly
> confuse utf8 with unicode, and think that internals are very important,
> I think "to utf8 internally" should be substituted for "to unicode".
>
> In fact, if I were to rewrite the documentation for pack, I'd mention U*
> as a special case, possibly even with its own =item to stress it :)
>   


As far as I can tell, even from what you have said here, U doesn't
implement Unicode semantics in any way... it just (maint and blead)
causes the result of the pack operation to be multi-bytes instead of
bytes (if first), and otherwise encodes a large number into a sequence
of bytes according to the same rules as Unicode uses for storing large
numbers in bytes for its UTF-8 encoding.

So U is only "misbehaved" if it appears as the first template
character... otherwise it already works like I describe for M ...
packing a large number into a sequence of bytes according to the same
rules as Unicode uses for storing large numbers in bytes for its UTF-8
encoding.

So if one is careful to not use U as the first template character (and
there is a documented workaround of using C0 first), pack/unpack works
consistently, producing (always) bytes buffer results from pack, and
interpreting the parameter as a bytes buffer for unpack.  This is
somewhat better than I first thought...


>>>> OK.  What do you recommend when needing to store a UTF-8 string into a 
>>>> struct?  
>>>>         
>>> My Perl language doesn't have structs. 
>>>       
>> Well, that's a cop-out.  My Perl language has structs!
>>     
>
> Ah, such "structs" are just binary strings. To encode a text sting to
> UTF8, you can use any of the following:
>
>     1. $bytes = Encode::encode_utf8($text)
>     2. $bytes = Encode::encode('utf8', $text)
>     3. $bytes = Encode::encode('utf-8', $text)   # strictly unicode range
>     4. utf8::encode($string)
>
> Then, $bytes (or $string), can be used with the string templates of
> pack.
>
> The reverse operations work too -- unpack and decode.
>   

Yes, that is a good way to handle it, for many cases.


>> Useqq avoids performance too, by being pure Perl... 
>>     
>
> I use Data::Dumper as a simple debugging tool, and performance is not
> relevant there. If you want to serialize data, and performance is an
> issue, you'll be better off with something else anyway, even without
> Useqq :)
>
> In fact, I didn't even know there was a non-pure Perl version of D::D!
>   

Yeah, that's mostly how I use it to, but sometimes I'll "cache" a
complicated binary format as a "Data::Dumper"d output, for faster access
later.


>> \L, \l, \U, \u, \Q operations in bytes string constants (no character 
>> code values > 255)
>> \L, \l, \U, \u, \Q operations in multi-bytes string constants (at least 
>> one character code value > 255)
>>     
>
> At least one, in the entire history of the string. Once upgraded
> internally, it remains upgraded. Apparently, these things are broken in
> exactly the same way that the regex engine is.
>   

Well, OK, I mentioned them in the case of string _constants_ for which
my statements are correct.  But then there is string _interpolation_
which is a non-constant operation that uses these same operators, and so
that is, indeed, a case where the past history of the interpolated
variables could have an effect on the format of the result buffer.

> DAMN. That sucks, because while the regex engine can be fixed by adding
> flags, any fix for these buggers would be incompatible.
>   

Hmm.  Why?  Many fixes would be incompatible, but not a fix such as:

{ use Unicode_semantics; $uppercase_string = "\U$string"; }

Of course, this is _ugly_ but not incompatible.  Maybe someone can make
something prettier...


>> Note: it appears to me that Perl (except for encode.pm) _never_ applies 
>> Latin-1 semantics to anything, at present.
>>     
>
> Some things do the following weird thing:
>
> - If the string is in UTF8 internally, use unicode semantics
> - If not, use ASCII semantics (even though the rest of Perl considers
> non-UTF8 to be latin1, not ascii!)
>
> Operators that do this, are BROKEN. Perl doesn't have an ascii/utf8
> distinction, it has a latin1/utf8 distinction. (Note, this is all
> internals, but when the internals are inconsistent, the user sometimes
> needs to know about the bugs caused by that.)
>   

Could you be explicit about what are the "some things" ?  I think regexp
work like you describe.  I guess \L \l \U \u \Q work like you describe.
But these were already on the list (sort of).  Is there anything else?


-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About