Front page | perl.perl5.porters |
Postings from April 2007
Re: perl, the data, and the tf8 flag
Thread Previous
|
Thread Next
From:
demerphq
Date:
April 1, 2007 15:26
Subject:
Re: perl, the data, and the tf8 flag
Message ID:
9b18b3110704011526h201d471ch7d5febe1a5b15309@mail.gmail.com
On 3/31/07, Glenn Linderman <perl@nevcal.com> wrote:
> A) pack -- it is not clear to me how this operation could produce
> anything except bytes for the packed buffer parameter, regardless of
> other parameters supplied.
This sounds reasonable to me.
> B) unpack -- it is not clear to me how this operation could successfully
> process a multi-bytes buffer parameter, except by first downgrading it,
> if it contains no values > 255, since all the operations on it are
> defined in terms of unpacking bytes.
This is sortof what happens with pack *in blead*. Except that instead
of downgrading it treats codepoints as being bytes by doing mod 256 on
their values. This makes a certain amount of sense if you assume that
strings can (apparenly) randomly change from octect encoding to utf8
encoding. For instance:
my $s=pack 'N',12345678;
$s.=chr(256); # upgrade $s to utf8 by catting on a unicode codepoint
chop $s; # lose the catted codepoint, encoding remains utf8
print unpack 'N',$s; # prints 12345678
So 'N' works with codepoints, not with bytes. Apparently this holds
true for most of the pack template formats. HOWEVER, it doesnt apply
to the pattern 'C' (and if i understand his recent posts this is what
Marc was objecting to recently) which reads bytes.
So to expand on the previous example:
my $s=pack 'N',123456789;
print "octect encoded \$s=", join(", ",unpack "C*",$s),"\n";
print "octect unpack 'N' = ",unpack('N',$s),"\n";
print "octect unpack 'CN' = ",join(", ",unpack 'CCCN',$s."aaaa"),"\n";
$s.=chr(256); # upgrade $s to utf8 by catting on a unicode codepoint
chop $s; # lose the catted codepoint, encoding remains utf8
print "utf8 encoded \$s=", join(", ",unpack "C*",$s),"\n";
print "utf8 unpack 'N' = ",unpack('N',$s),"\n";
print "octect unpack 'CN' = ",join(", ",unpack 'CCCN',$s."aaaa"),"\n";
which outputs:
octect encoded $s=7, 91, 205, 21
octect unpack 'N' = 123456789
octect unpack 'CN' = 7, 91, 205, 358703457
utf8 encoded $s=7, 91, 195, 141, 21
utf8 unpack 'N' = 123456789
Malformed UTF-8 character (unexpected continuation byte 0x8d, with
no preceding start byte) in unpack at irk.pl line 11.
octect unpack 'CN' = 7, 91, 195, 1401185
Which to me says that almost any use of 'C' as an unpack template in
Perl 5.9.x and later will be totally wrong. My feeling is that Marc's
suggestion about making 'C' and alias for 'U' and introducing a new
template char for what 'C' does currently (O for octect maybe) is the
right thing to do, with warning when moding the result of 'U' results
in a number larger than 255.
To repeat, my feeling is that any use of the 'C' template in Perl
5.9.x and later will be totally incorrect and errorprone. (and to
emphasize the point im cc'ing rafael on this mail).
> G) regular expressions -- lots of reference is made to regular
> expressions being broken, or at least different, for multi-byte stuff.
> I fail to see why regular expressions are so hard to deal with. Of
> course, I haven't implemented a regular expression engine, and so some
> of my naive ideas may result in horrible performance, but it seems that
> multi-byte regular expression stuff already has horrible performance, so
> maybe my ideas aren't any worse, just different. Or maybe they are worse.
>
> Firstly, regular expressions deal in "characters", not bytes, or
> multi-byte sequences.
I dont know where this meme comes from. Its just not true. Regular
expressions dont give a toss for characters at all. The only time
"character" interpretations come in is when you use named classes like
\w or \d or when you do case insensitive matches.
The former is a problem because the question "what constitutes a word
character" is a semantic feature of a language using a particular
encoding, or in the case of generic encodings (like unicode) of the
properties of that encoding.
So for instance, in the "normal" case (specifically US/English use of
latin_1) \w is analagous to [a-zA-Z0-9_]. However if you were german
you would probably want GERMAN-SHARP-ESS, U-WITH-UMLAUT and etc to be
included in \w. If you were Icelandic youd probably want that funky o
with a strike through it. If you were French youd want all the nice
accented vowels and the c circumflex and stuff.
The only way you get these things in "octect" encoded text is by using
"use locale" and having your locale appropriately configured. Of
course this will make your regexes very slow as the way we deal with
this stuff is less than brilliant.
Alternatively you could use unicode, but unicode as a general purpose
encoding doesnt do logic like "what does a person from culture X think
is a word char" it does logic like "\w will be any character that any
person from any culuture or language or script might call a word
character". So in unicode the numer of characters in \w is predictably
large, and the number of characters in \d will probably melt your
brain.
Case insensitivity is the other place where you will see differences.
The languages that most people on this list speak as a mother tongue
have "uppercase" and "lowercase". Well it turns out that there are
languages that have an additional case (titlecase) and that the
commonly understood rules for doing a case insensitive match wont
work. For instance a naive assumption would be that to do a case
insensitive match you would either uppercase or lowercase all of the
characters in the both strings and then proceed from there. Well this
wont work with Greek say, and in fact it wont work with German either.
So to do case insensitive matching in unicode you need to do
"foldcase" matching, which is that you convert the sequence into a
normalized folded versions and then compare that. Where this gets
tricky is that in some languages, German for example, the folded
version of a particular letter is in fact more than one letter. So the
foldcase of GERMAN-SHARP-ESS aka \x{DF} aka ß is 'ss'. The uppercase
of the letter is ß, and unsurprisingly so is the lowercase.
Now where this gets really annoying is that \x{DF} is the ONLY letter
in unicode that is in latin_1 that has a multibyte foldcase
representation, yet at the same time Perl has never considered \x{DF}
to match 'ss' in latin_1.
So if you have a string that contains \x{DF} youll find it will match
case insensitively 'ss' if the string is in unicode, but not if its in
latin_1.
Anyway, hope this clarifies things a bit.
Cheers,
Yves
--
perl -Mre=debug -e "/just|another|perl|hacker/"
Thread Previous
|
Thread Next