Front page | perl.perl5.porters |
Postings from October 2011
Re: The "Unicode Bug"
Thread Previous
|
Thread Next
From:
Aristotle Pagaltzis
Date:
October 17, 2011 16:07
Subject:
Re: The "Unicode Bug"
Message ID:
20111017230715.GA16512@klangraum.plasmasturm.org
* Vladimir V. Perepelitsa <inthrax@gmail.com> [2011-10-18 00:00]:
> upgrade(encode_utf8) isn't correct. You upgrade from utf-8 sequence
> using latin1 charset.
It is perfectly correct. The byte buffer changes, but the string means
the same thing, because the UTF8 flag changes along with it.
> I.e. you do: decode 'latin1' => encode utf8 => "\xb2";
>
> You take char U+00B2, which is represented as "\xb2". Ok, in perl it
> is folded into single byte, it's ugly, but ok, I won't look into
> internal storage.
It’s not ugly. It’s a perfectly reasonable way to store U+B2.
> Then you encode it into utf8. you got a byte array, that consists of
> 2 bytes 0xc2 0xb2. It's also ok.
Right. Now we have a string that represents a sequence of bytes.
> then you upgrade it, which is equivalent to decode latin1 (see perldoc
> utf8) I.e. you mean, that your string consists of 2 latin1 characters
> 0xc2 and 0xb2, which, in turn, are represented by 4 byte utf8
> sequence.
Yes, you get C3 82 C2 B2 *PLUS* the UTF8 flag turned on. *BECAUSE* of
the flag being turned on, what this string means is its UTF-8 decoding,
ie "\xC2\xB2". It does *NOT* mean "\xC3\x82\xC2\xB2" and if your code
behaves as if that was what it meant, then your code is broken.
> Ok, perl says, that utf8(\x{c2}\x{b2}) eq bytes(0xc2 0xb2) /* because
> of backward compatibility to latin1 */
No, by design.
> but that doesn't mean, that xml parser should threat 2 utf8 characters
> U+00C2 and U+00B2 as a single char U+00B2.
Whoa now, how did we go from 4 bytes to 1 character?
Do you mean your parser gets 4 bytes + UTF8=on as input and produces
a 1-character string as output?
If so, then yes, that is exactly what the parser should be doing. There
are 4 bytes, C3 82 C2 B2. But the UTF8 flag is on. So you have to decode
them first, and when you do that you get C2 B2. Presumably, you say that
your parser accepts bytes, then these values mean byte values, and your
parser then has to decode them according to the XML encoding – which is
mostly UTF-8. So you decode C2 B2 by the UTF-8 algorithm and you get the
Unicode character U+B2.
So yes. You started with 4 bytes, and you ended up with 1 character.
That is exactly correct.
Regards,
--
Aristotle Pagaltzis // <http://plasmasturm.org/>
Thread Previous
|
Thread Next