Quoth rgarciasuarez@gmail.com ("Rafael Garcia-Suarez"): > > For those two cases we can: > * Add a pragma that says "in this block, apply Unicode semantics". > Additionally, we can add a regexp flag qr//u, that says "this > regexp matches with Unicode semantics". (I'm thinking out loud > here) (Also, probably any regexp that uses \p should be considered > "in Unicode mode") > * Drop relying on the SvUTF8 flag to choose whether Unicode semantics > should be applied. Big change, not backwards compatible, but IMO > needed for sanity. ++ > But sometimes we want perl to magically switch between Unicode and > non-Unicode semantics depending on the data it's handling. Does that > mean that we need to add a new kind of data to perl, "Unicode SV" ? > Will that solve problems ? What problems will this create ? This seems sane to me. While we're there we can make the new type (QV? UPV?) a wchar_t* instead of a utf8-encoded char*. That way we get autoconversion as needed, with cacheing. We also get the ability to declare 'all my 8-bit strings are in $encoding' rather than being fixed to ISO8859-1. (This must be a *different* option from the one that says 'my source code is in $encoding', though they could default to the same thing. I don't know what to do about literal strings: reencode them, probably.) A lot of care will be needed to get all the cases right. For instance, what happens when a (POK, UPOK) SV is string-compared with a (POK) SV? I think the right answer is - by default, if any argument of a string operation is UPOK then all of them are upgraded to UPOK and the operation occurs on SvUPV; if all are !UPOK then they are all upgraded to POK and the operation occurs on SvPV. (This assumes all numbers can be represented in the current character set :).) This 'upgrade' may in fact be a 'downgrade' by current SvUTF8 terminology, from UPOK->POK, in which case any characters that can't be encoded elict a warning. Ideally all of Encode's options should be applicable. What to do about chr/ord/"\x", especially given that some encodings have more that 256 characters, I'm not sure. I suspect the current 'assume numbers <256 are byte values and go in SvPV, and numbers >255 are Unicode codepoints and go in SvUPV' is a decent compromise, *given that users can ask for sane semantics if they want them*. - under 'use bytes', all string operations upgrade all SVs to POK, 'upgrade' as above. chr stuffs literal bytes into SvPV. - under 'use unicode', all string operations upgrade all SVs to UPOK, and chr takes a Unicode codepoint and returns a string that is UPOK only. This means that the numbers passed to chr mean different things under 'unicode' and 'bytes'. This is a feature :). - regexes know which of SvPV and SvUPV they should be matching against. I think we need two new flags, /u and /U (or maybe /b), with the default being bytes if use-bytes, unicode if use-unicode, and guess if neither. - 'use locale' can probably be made to work again, if it is only applied to SvPV and never to SvUPV. 'use locale' should probably imply 'use bytes', and set the current encoding. This would at least allow user to specify that they understand Unicode and want consistent semantics, without losing the ability to manipulate binary data. Ben -- We do not stop playing because we grow old; we grow old because we stop playing. ben@morrow.me.uk