Front page | perl.perl5.porters |
Postings from April 2007
Re: perl, the data, and the tf8 flag
Thread Previous
|
Thread Next
From:
Juerd Waalboer
Date:
April 3, 2007 05:46
Subject:
Re: perl, the data, and the tf8 flag
Message ID:
20070403124554.GP31277@c4.convolution.nl
Glenn Linderman skribis 2007-04-01 16:05 (-0700):
> and a string of them a "binary sequence of values in the range
> [0:2^32-1] using a variable-length, multi-byte encoding".
Sure, but do keep in mind that conceptually, a Unicode text string does
not have any encoding.
It has an encoding internally, of course, because it has to be stored in
memory in a certain format. Many Windows based tools use UTF-16 or UCS-2
internally. Many Unix tools use UTF-8 internally. Perl is a bit strange,
because it uses two internal formats for text strings: latin1 and utf8.
But "multibyte" and "encoding" aren't relevant, at all, for text
strings. A text string is a sequence of codepoints ("characters"). Bytes
are irrelevant until you encode to a byte string.
Perl has a single string type that is used for both byte strings and
text string, even though theoretically these are mutually incompatible.
It does this by sticking to an 8 bit encoding as long as possible, in
other words: until you use it with something that doesn't have this 8
bit encoding. This is an internal thing that you don't have to know
about if you separate text and binary values and semantics.
The two conceptual string types are mutually incompatible, but because
Perl uses a single type of string for both, it allows combining them. If
you don't (want to) think about the text/byte separation, you suddenly
need to learn that some operations work on the internal byte encoding,
while others work on the conceptual sequence of codepoints
(called "characters" in Perl jargon).
To make things worse, something that used to work on the internal byte
encoding, works on the conceptual sequence of codepoints now. So you
also have to remember Perl version numbers.
The need for knowledge of the internals can be fully avoided by keeping
bytes and text separate. Instead of compiling lists of byte operations
and text operations, I strongly advise using logic instead: things that
work with fixed octet boundaries, are byte operations, things that make
sense with values above 256, can be used for both bytes and text, but
require separation on the programmer's part.
Some things remain undefined or hard to logically detect, like
filenames. The operating system may consider them sequences of bytes, or
sequences of codepoints, but the user will want to use accented letters
and probably doesn't care about the internals. With things like this,
Perl is of little help, and you should either find out what your
platform does, or err on the safe side. (Heck, there are filesystems
that don't even support using ":" in a filename.)
> Because that's what seems to be actually implemented...
The best advice I can give is to fully ignore the actual implementation
of text strings. While knowledge of the internals can be used for some
huge optimizations, it's often outright dangerous to do anything with
that information if you don't know all the consequences yet.
If you keep byte strings far away from text strings, you don't need to
know the internal implementation. Decoding and encoding are the only
correct means of dealing with text in binary data.
> it isn't Unicode, it isn't UTF-8, perl carefully (and confusingly)
> calls it utf8,
Perl's strings are character strings, consisting of codepoints.
Internally, bytes are used in some encoding, but ignore that whenever
you can (which is almost always).
The only difference with Unicode is that Perl allows using codepoints
that aren't defined yet, as long as they are with in the 32 bit positive
number range. For all intents and purposes, it's practical to call text
strings "unicode strings".
Only INTERNALLY, they may be UTF8 strings.
> and others have referred to it as UTF-X
That's what the perlebcdic manpage does. The difference between
UTF-EBCDIC and UTF-8 is only relevant on ebcdic platforms. I have always
ignored ebcdic specifics and will continue to do so.
Many identifiers, both internally and in the introspection API, have
"utf8" in the name, but referring to utf-x. utf-x is very uncommon, so I
will call it utf8, just like perl itself does.
> >You could say: "blob my $foo". Sounds dwimmy enough.
> Isn't there a syntax like "my blob $foo" ?
I specifically chose a syntax like binmode's. It might be incredibly
useful to do this on variables imported from modules.
> Could an object be created that would embed a "bytes-only string", and
> protect it? Or is magic really needed?
I'd prefer it to use normal scalar strings with magic, because using an
object very probably has side-effects elsewhere.
> OK, so there's a significant difference between stable and blead. And
> it sounds like it is incompatible, and will break some amount of code.
Only code that breaks the text/byte separation. Code that separates it
properly, didn't break in older Perls, and won't use the new "fix" in
newer Perls.
In that respect, this silly change in unpack might help people to
properly separate string types more than before, because otherwise their
code isn't compatible with multiple Perl versions :). Still, though, I
prefer the old (stable) semantics.
> And note that as far as I can tell, U doesn't implement Unicode
> semantics in any way... it just uses a variable-length multi-byte binary
> encoding scheme that is also used in the Unicode standard for UTF-8
> encoding.
It uses that INTERNALLY. pack "U" and unpack "U" are different from all
the other (un)pack templates, because they create/split text strings
("unicode strings") instead of byte strings.
Note that pack "U" does not create a UTF-8 string. It creates a unicode
string, a text string.
"UTF-8 string" is short for "UTF-8 encoded string", and any encoded
string is a byte string.
The U template stands for Unicode, not UTF-8. These unicode strings use
utf8 internally, as the documentation clearly says. As people constantly
confuse utf8 with unicode, and think that internals are very important,
I think "to utf8 internally" should be substituted for "to unicode".
In fact, if I were to rewrite the documentation for pack, I'd mention U*
as a special case, possibly even with its own =item to stress it :)
> >>OK. What do you recommend when needing to store a UTF-8 string into a
> >>struct?
> >My Perl language doesn't have structs.
> Well, that's a cop-out. My Perl language has structs!
Ah, such "structs" are just binary strings. To encode a text sting to
UTF8, you can use any of the following:
1. $bytes = Encode::encode_utf8($text)
2. $bytes = Encode::encode('utf8', $text)
3. $bytes = Encode::encode('utf-8', $text) # strictly unicode range
4. utf8::encode($string)
Then, $bytes (or $string), can be used with the string templates of
pack.
The reverse operations work too -- unpack and decode.
> Useqq avoids performance too, by being pure Perl...
I use Data::Dumper as a simple debugging tool, and performance is not
relevant there. If you want to serialize data, and performance is an
issue, you'll be better off with something else anyway, even without
Useqq :)
In fact, I didn't even know there was a non-pure Perl version of D::D!
> \L, \l, \U, \u, \Q operations in bytes string constants (no character
> code values > 255)
> \L, \l, \U, \u, \Q operations in multi-bytes string constants (at least
> one character code value > 255)
At least one, in the entire history of the string. Once upgraded
internally, it remains upgraded. Apparently, these things are broken in
exactly the same way that the regex engine is.
DAMN. That sucks, because while the regex engine can be fixed by adding
flags, any fix for these buggers would be incompatible.
> Note: it appears to me that Perl (except for encode.pm) _never_ applies
> Latin-1 semantics to anything, at present.
Some things do the following weird thing:
- If the string is in UTF8 internally, use unicode semantics
- If not, use ASCII semantics (even though the rest of Perl considers
non-UTF8 to be latin1, not ascii!)
Operators that do this, are BROKEN. Perl doesn't have an ascii/utf8
distinction, it has a latin1/utf8 distinction. (Note, this is all
internals, but when the internals are inconsistent, the user sometimes
needs to know about the bugs caused by that.)
--
korajn salutojn,
juerd waalboer: perl hacker <juerd@juerd.nl> <http://juerd.nl/sig>
convolution: ict solutions and consultancy <sales@convolution.nl>
Ik vertrouw stemcomputers niet.
Zie <http://www.wijvertrouwenstemcomputersniet.nl/>.
Thread Previous
|
Thread Next