2008/5/21 Ben Morrow <ben@morrow.me.uk>: > OK, new proposal; this is how I have thought Unicode *ought* to work in > Perl for some time. It's entirely possible there's some serious flaw > with it, of couse... I was (assuming you were) intending UPOK to > represent a new entry in the SV, so struct xpv would become something > like > > struct xpv { > char * xpv_pv; /* byte string */ > STRLEN xpv_cur; > STRLEN xpv_len; > > wchar_t * xpv_upv; /* Unicode string */ (I'm under the impression that wchar_t is not portable and not suitable to store Unicode, since its size is implementation-defined. However we could use UTF-16 or -32 here, if that's more convenient.) > STRLEN xpv_ucur; > STRLEN xpv_ulen; > } > > (or perhaps xpv would stay as-is, and we'd have an xpvupv like that). > POK says the PV slot is valid, UPOK says the UPV slot is valid, and you > can have both valid at once so you don't have to keep converting a given > string between bytes and characters. You have to keep track of which > representation is canonical, of course, exactly as with string<->number > conversions. > > Then you have two forms of (say) 'eq', each of which sv_upgrades its > arguments to the appropriate type; except that (for compatibility) if > you specify neither 'bytes' nor 'unicode' it guesses which you wanted. A > new quote-like qu// would be useful, but it would just be a shortcut for > do { use unicode; qq// }; similarly, a qb// would be useful to get > binary strings when under 'unicode'. > >> Here's my position : >> - to deal with encodings, use Encode. >> - no encoding-aware strings in core perl. (of course, you can still >> use magic, ties, etc. to add behaviour) > > I agree here: while strings that knew what encoding they started out as > sound like a cool idea, I suspect it would quickly become unmanageable. > >> - the "Unicodeness" of a string would be independent of its SvUTF8 flag. >> If will just indicate that <some list of perl built-ins> must apply >> Unicode semantics when dealing with it. > > This feels wrong, to me. Perl has always had polymorphic values and > monomorphic operators; allowing the string to choose which version of > the operator it gets seems like going the other way. In an ideal world, > I would advocate a new set of operators: ueq, ult, u., usubstr, and so > on; since this is obviously impractical, a pragma to choose which 'eq' > you want seems like the way to go. I now tend to agree with this. Actually and moreoever I now tend to agree with Juerd: always apply default Unicode semantics. Get alternative ops -- or a pragma -- to get latin-1 semantics. Which makes the whole point of UPOK strings rather unuseful now. Some way to mark PVs as "binary" and not upgradeable to SvUTF8 would be handy, though. [snipping a lot of thoughtful stuff]Thread Previous | Thread Next