Front page | perl.perl5.porters |
Postings from April 2007
Re: perl, the data, and the tf8 flag
From: Juerd Waalboer
April 1, 2007 16:28
Re: perl, the data, and the tf8 flag
Message ID: 20070401232756.GQ31277@c4.convolution.nl
demerphq skribis 2007-04-02 0:26 (+0200):
> This makes a certain amount of sense if you assume that
> strings can (apparenly) randomly change from octect encoding to utf8
No, it does not happen randomly. It only happens when confronted with
(1). characters above 255. These are NEVER encountered in binary data, so
this is not a problem.
(2). strings that are internally prepared to handle characters above 255.
They became this way because of (1) or an explicit text-only operation.
This is only a problem in broken code.
> my $s=pack 'N',12345678;
> $s.=chr(256); # upgrade $s to utf8 by catting on a unicode codepoint
That does not fall under "randomly" but under "characters above 255".
Once you start adding such a character to your string, any binary
operation, such as unpack "N" makes no sense at all.
> So 'N' works with codepoints, not with bytes. Apparently this holds
> true for most of the pack template formats. HOWEVER, it doesnt apply
> to the pattern 'C' (and if i understand his recent posts this is what
> Marc was objecting to recently) which reads bytes.
If we choose to keep this behaviour, indeed the C pattern should change
too. But I think it is suboptimal to keep this behaviour, and suggest
that the previous change be reversed.
> Which to me says that almost any use of 'C' as an unpack template in
> Perl 5.9.x and later will be totally wrong.
In fact, any use of C as an unpack template, on an internally UTF8
encoded string, is always already wrong. This is fairly irrelevant to
the rest of the discussion, though. Just wanted to point it out.
> My feeling is that Marc's suggestion about making 'C' and alias for
> 'U' and introducing a new template char for what 'C' does currently (O
> for octect maybe) is the right thing to do. (...)
If unpack for non-U template letters uses codepoints, then it would not
make sense to have U. I see the fact that we DO have U as proof that
they, who implemented this in the past, thought that using codepoints
for byte operations would be wrong.
> To repeat, my feeling is that any use of the 'C' template in Perl
> 5.9.x and later will be totally incorrect and errorprone.
While that may be bad indeed, I believe that the change that has already
been applied is more dangerous.
The change assumes that it makes sense to use unpack on strings with the
UTF8 flag set. While I deny this, let's assume for a moment that it
does. If it does make sense, there must be people doing it already,
either on purpose or accidentally (I think only the latter). Every
single program that does that will BREAK once they upgrade their perl
from current stable to current blead, because semantics changed.
I feel that changing unpack from operating on bytes to operating on
characters is theoretically unnecessary, theoretically wrong, and will
cause even more problems for people who haven't managed to keep text
data and binary data separate. By reverting the change, backwards
compatibility is guaranteed, and the big, complex paragraphs that explain
the backwards incompatibility can be dropped from perldelta.
Instead of using codepoints, I suggest a different course:
1. Revert the change, to ensure backwards compatibility (admittedly, for
2. Warn when the template contains both U and byte-specific letters (and
that's any letter except U).
3. When the template contains byte-specific letters, and the string
unpack will operate on has the UTF8 flag set, emit a warning (always,
not just when there are codepoints >255) and operate on the internal
octets, ignoring that it may be the result of UTF8 encoding (see point
(Actually, I think the U template is a mistake. While unpack "U*" and
pack "U*" are great as list operators like ord and chr respectively,
unicode data doesn't fit in the functionality of (un)pack at all,
because pack/unpack has always been specifically for bit and byte
packing. It is way too late to remove U now, but perhaps "U*" can be
special-cased, and every other use of U deprecated. Just thinking out
loud, now, by the way.)
(the rest is just nit picking; feel free to ignore.)
> If you were Icelandic youd probably want that funky o with a strike
> through it.
Icelandic uses ö (ouml) instead of ø (oslash).
The funky latin1 word characters for icelandic are þ (thorn), æ (aelig)
and ð (eth). And it also has non-funky accented characters.
> If you were French youd want all the nice accented vowels and the c
> circumflex and stuff.
C cedilla :)
juerd waalboer: perl hacker <email@example.com> <http://juerd.nl/sig>
convolution: ict solutions and consultancy <firstname.lastname@example.org>
Ik vertrouw stemcomputers niet.