Andrew McNaughton <andrew@tki.org.nz> writes: > >I'm rather concerned by what's happening with the utf-8 implementation. By >trying to modify existing functions, while retaining compatibility with >existing code, the semantics are getting muddled, _were_ muddled back in perl5.6.0 - a lot of discussion and patches have happened since then. >and I expect this to >lead to a host of security problems. It is important that utf-8 text >should be cleanly utf-8. as soon as character sequences which are not >valid utf-8 start being processed by utf-8 text handlers, the ambiguities >will lead to a great many validation and security issues. >I do understand >that this is difficult territory, but the only way to get through is with >a clean and consistent data model. We think we have one now - or rather (for backward compatibility and efficiency reasons) two : A. iso8859-1 bytes 0..255 B. UTF-8 encoded UNICODE characters. But those are _supposed_ to be only exposed to the C code of the internals and XS modules. Perl code sees sequences of characters. It gets messy when IO gets involved - and sorting that out is what I should be doing rather than getting all defensive here ... >In my view introducing a perl specific >text encoding scheme which behaves like utf-8 sometimes, but not at other >times is a serious mistake. Possibly, but the need to graft UNICODE support into perl5 without breaking existing iso8859-1/binary bashing applications forces the issue. -- Nick Ing-SimmonsThread Previous | Thread Next