2008/5/20 Marc Lehmann <schmorp@schmorp.de>: > Unfortunately, perl doesn't really handle it that way. regexes for example > treat the same number on the perl level differently depending on how its > encoded internally. > > And this is a problem. You could add uc/lc to the list. > I think the pragma already exists, namely "use locale". > > If I "use locale" in my program, I would expect perl to apply the current > locale to any strings, in regexes or elsewhere (to the extent possible). > > If I don't "use locale", then I would expect regexes to interpret my strings > as unicode, regardless of the utf-8 flag, which I can't see in my source. > (the "surprising" behaviour). > > Regarding filenames, this is very easy on unix: all filenames are > interpreted as octte strings, no specific encoding (perl cnanot know the > encoding of filenames on unix), so the functions all have to downgrade, > and if that fails, we have a bug (filenames are not locale-dependent > on unix, they are simply octet strings where only "/" and \000 are > interpreted). > > (if it does not fail, it might still be a bug, we we cannot detect this). > > I know "use locale" has weird side effects, but it basically boils down to > what perluniintro calls "native 8-bit encoding" (fortunately, it is not > even limited to 8-bit). > > even if there were need for a new pragma, I wouldn't call it > "compatibility", because both behaviours are useful. The difference is > that I can control which interpretation is applied to my strings and do > not have to rely on an invisible flag on my scalars. > > But then, "locale" maps exactly on the concept of "native encoding", > because my unix process might run ina locale using koi8-r, and then I > would want a way to take advantage of the locale w.r.t. to interpreting my > koi8-r data. (do not get confused by the mention of POSIX in the locale > manpage, locales are an ISO-C thing and ought to exist on windows as > well. I think we need a *new* pragma. I don't want to mix locales and Unicode. Their purposes are different, and they come from different worlds. A locale is mostly intended to indicate in which language a string is, and applyinig language specific rules to it. (UTF8 locales just indicate that the strings returned by or passed to the C locale API are encoded in UTF8 instead of latin1 or anything else.) For example, under a Turkish locale, you'll get different rules for uppercasing "i". Unicode is a different matter. Now Unicode *also* specifies rules for collation and casing, and special rules for some languages. Those special rules are not used in Perl (as far as I know) (But we need a way to implement them in the language.) Now, at the perl language level, I think the problem we have is that we sometimes want uc, lc or //i to have Unicode semantics, and sometimes not. (other operations here ?) For those two cases we can: * Add a pragma that says "in this block, apply Unicode semantics". Additionally, we can add a regexp flag qr//u, that says "this regexp matches with Unicode semantics". (I'm thinking out loud here) (Also, probably any regexp that uses \p should be considered "in Unicode mode") * Drop relying on the SvUTF8 flag to choose whether Unicode semantics should be applied. Big change, not backwards compatible, but IMO needed for sanity. But sometimes we want perl to magically switch between Unicode and non-Unicode semantics depending on the data it's handling. Does that mean that we need to add a new kind of data to perl, "Unicode SV" ? Will that solve problems ? What problems will this create ?Thread Previous | Thread Next