Rafael Garcia-Suarez skribis 2008-05-20 10:55 (+0200):
> 2008/5/20 Marc Lehmann <schmorp@schmorp.de>:
> > Unfortunately, perl doesn't really handle it that way. regexes for example
> > treat the same number on the perl level differently depending on how its
> > encoded internally.
> > And this is a problem.
> You could add uc/lc to the list.
If you're looking for a list, Unicode::Semantics has documentation that
has such a list. It's probably not complete, but a starting point.
* uc, lc, ucfirst, lcfirst, \U, \L, \u, \l
* \d, \s, \w, \D, \S, \W
* /.../i, (?i:...)
* /[[:posix:]]/
> Now, at the perl language level, I think the problem we have is that
> we sometimes want uc, lc or //i to have Unicode semantics, and sometimes
> not. (other operations here ?)
Er, why "sometimes not"?
Why would you uppercase something that's not text?
I suggest that we keep the possibility to uppercase only the ASCII
character range, and call that ASCII::uc(), while the normal uc() is
made Unicode compliant regardless of the PV's state.
Maybe this should even be called Unicode::uc(), and uc() should
"default" to Unicode, with "use ASCII qw(uc);" and "use Unicode qw(uc);"
as ways to override the default.
> For those two cases we can:
> * Add a pragma that says "in this block, apply Unicode semantics".
There are three ways of dealing with text data in Perl:
1. The text is unicode (i.e. uc("aä") eq "AÄ")
2. The text is ASCII (i.e. uc("aä") eq "Aä")
3. It's determined by the UTF8 flag. It is now widely agreed that 3 is
wrong. However, many parts of perl use option 3 now.
I suggest that there'll also be a "use feature" called
unicode_by_default, that does no more than include the new pragma to
enable unicode semantics. This, to make "use v5.12;" include the pragma,
so to avoid that you forget to request a certain behaviour.
> Additionally, we can add a regexp flag qr//u, that says "this
> regexp matches with Unicode semantics". (I'm thinking out loud
> here)
I have suggested /u(nicode), /a(scii) before. These are "needed" in
addition to the pragma, because of qr//: there must be a way to
stringify the lexically selected behavior so it survives the end of the
lexical scope.
> * Drop relying on the SvUTF8 flag to choose whether Unicode semantics
> should be applied. Big change, not backwards compatible, but IMO
> needed for sanity.
Yes!
However, there's also a way to
> But sometimes we want perl to magically switch between Unicode and
> non-Unicode semantics depending on the data it's handling.
No, we don't want this to happen MAGICALLY. Or at least I really do not
want Perl to do that. This is one place where DWIM heuristics simply
cannot work.
> Does that mean that we need to add a new kind of data to perl,
> "Unicode SV" ? Will that solve problems ? What problems will this
> create ?
Indeed there could be a way to indicate "I intend this string to be a
byte string". I have a module, called BLOB.pm, in the works that makes
this very easy. I'll try to release it really soon so you can have a
look.
Because of the way BLOB works, it could probably be used by XS and core
code too. BLOB assumes that everything is text until explicitly marked
as binary.
--
Met vriendelijke groet, Kind regards, Korajn salutojn,
Juerd Waalboer: Perl hacker <#####@juerd.nl> <http://juerd.nl/sig>
Convolution: ICT solutions and consultancy <sales@convolution.nl>
1;