On 20 December 2010 22:52, Eric Brine <ikegami@adaelis.com> wrote: > On Mon, Dec 20, 2010 at 1:13 PM, demerphq <demerphq@gmail.com> wrote: >> >> And the point is that certain codepoints are illegal in general, and >> >> so can be treated as essentially maping to themselves. >> >> Others are legal ONLY in UTF16, > > You are calling both the encoded form and the decoded form "code point", and > you are using them interchangeably. I can't respond to your post if you > don't clear that up. I dont think I am. Surrogate pairs are codepoints. When they are interpreted correctly they produce a different codepoint. >> So for an example. Consider we have the codepoint U+10400 which case >> folds to U+10428. When represented in UTF-16 the codepoint U+10400 >> ends up as the surrogate pair U+D801,U+DC00. Now, if somebody naively >> converts this UTF-16 sequence to UTF8 by converting code point by >> codepoint, the end result will be that OUR code does NOT see codepoint >> U+10400, > > No, the result will be a warning when you decode the bad UTF-8 you produced > this way. We all agree the decoder should warn (with an option to disable) > when it sees invalid UTF-8 or Unicode. What do you mean "no"? Are you saying that we will see the correct codepoint U+10400? Cause I can assure you that our code will NOT. Yes I agree we should warn when we write UTF8 that contains surrogate pair codepoints. However I also think we should warn when we try to lc() or uc() a string containing them, as we WILL NOT DO IT CORRECTLY. There is no room for argument. debate, or personal opinion on the latter assertion. It is a fact. Cheers, yves -- perl -Mre=debug -e "/just|another|perl|hacker/"Thread Previous | Thread Next