Rafael Garcia-Suarez skribis 2008-05-20 10:55 (+0200): > 2008/5/20 Marc Lehmann <schmorp@schmorp.de>: > > Unfortunately, perl doesn't really handle it that way. regexes for example > > treat the same number on the perl level differently depending on how its > > encoded internally. > > And this is a problem. > You could add uc/lc to the list. If you're looking for a list, Unicode::Semantics has documentation that has such a list. It's probably not complete, but a starting point. * uc, lc, ucfirst, lcfirst, \U, \L, \u, \l * \d, \s, \w, \D, \S, \W * /.../i, (?i:...) * /[[:posix:]]/ > Now, at the perl language level, I think the problem we have is that > we sometimes want uc, lc or //i to have Unicode semantics, and sometimes > not. (other operations here ?) Er, why "sometimes not"? Why would you uppercase something that's not text? I suggest that we keep the possibility to uppercase only the ASCII character range, and call that ASCII::uc(), while the normal uc() is made Unicode compliant regardless of the PV's state. Maybe this should even be called Unicode::uc(), and uc() should "default" to Unicode, with "use ASCII qw(uc);" and "use Unicode qw(uc);" as ways to override the default. > For those two cases we can: > * Add a pragma that says "in this block, apply Unicode semantics". There are three ways of dealing with text data in Perl: 1. The text is unicode (i.e. uc("aä") eq "AÄ") 2. The text is ASCII (i.e. uc("aä") eq "Aä") 3. It's determined by the UTF8 flag. It is now widely agreed that 3 is wrong. However, many parts of perl use option 3 now. I suggest that there'll also be a "use feature" called unicode_by_default, that does no more than include the new pragma to enable unicode semantics. This, to make "use v5.12;" include the pragma, so to avoid that you forget to request a certain behaviour. > Additionally, we can add a regexp flag qr//u, that says "this > regexp matches with Unicode semantics". (I'm thinking out loud > here) I have suggested /u(nicode), /a(scii) before. These are "needed" in addition to the pragma, because of qr//: there must be a way to stringify the lexically selected behavior so it survives the end of the lexical scope. > * Drop relying on the SvUTF8 flag to choose whether Unicode semantics > should be applied. Big change, not backwards compatible, but IMO > needed for sanity. Yes! However, there's also a way to > But sometimes we want perl to magically switch between Unicode and > non-Unicode semantics depending on the data it's handling. No, we don't want this to happen MAGICALLY. Or at least I really do not want Perl to do that. This is one place where DWIM heuristics simply cannot work. > Does that mean that we need to add a new kind of data to perl, > "Unicode SV" ? Will that solve problems ? What problems will this > create ? Indeed there could be a way to indicate "I intend this string to be a byte string". I have a module, called BLOB.pm, in the works that makes this very easy. I'll try to release it really soon so you can have a look. Because of the way BLOB works, it could probably be used by XS and core code too. BLOB assumes that everything is text until explicitly marked as binary. -- Met vriendelijke groet, Kind regards, Korajn salutojn, Juerd Waalboer: Perl hacker <#####@juerd.nl> <http://juerd.nl/sig> Convolution: ICT solutions and consultancy <sales@convolution.nl> 1;