On Sat, Nov 27, 2010 at 07:01:57AM -0700, Tom Christiansen wrote: > Nick wrote: > > >>> (WRONG in the general case. It feels like an awful lot of end-user > >>> code to deal with encodings is heuristics and bodgery, rather than > >>> actual understanding) > > >> Very true, and a source of perpetual annoyance. But it's a separate > >> issue, isn't it? > > > Not in my mind. Finding the need to resort to flipping the internal > > flag for UTF-8 is a red flag that the proper conversion layer isn't > > implemented, because the flow of data hasn't been thought about. > > It does leave a code-smell, doesn't it? I've always been uncomfy > with it, but I don't know what else to do. Could you please tell > me how I *should* then be writing the unless test and block at > the bottom of this code snippet: What is this code trying to do? It's not obvious to me. > for my $codepoint ( $first_codepoint .. $last_codepoint ) { > > # gaggy UTF-16 surrogates are invalid UTF-8 code points > next if $codepoint >= 0xD800 && $codepoint <= 0xDFFF; > > # from utf8.c in perl src; must avoid fatals in 5.10 > next if $codepoint >= 0xFDD0 && $codepoint <= 0xFDEF; > > # both FFFE and FFFF are "not characters" in any plane > next if 0xFFFE == ($codepoint & 0xFFFE); > > # see "Unicode non-character %s is illegal for interchange" in perldiag(1) > $_ = do { no warnings "utf8"; chr($codepoint) }; > > # fixes "the Unicode bug" > unless (utf8::is_utf8($_)) { > $_ = decode("iso-8859-1", $_); > } And (unless I'm missing something) the code as-is *isn't* flipping the internal flag, so there's no way it can leave internal structures in an inconsistent state. Nicholas ClarkThread Previous