Front page | perl.perl5.porters |
Postings from June 2014
Re: Regex bracket char class range limiting
Thread Previous
|
Thread Next
From:
Karl Williamson
Date:
June 24, 2014 02:23
Subject:
Re: Regex bracket char class range limiting
Message ID:
53A8E0FE.5040407@khwilliamson.com
On 05/15/2014 10:46 AM, Abigail wrote:
> On Thu, May 15, 2014 at 12:20:53AM -0400, shawn wilson wrote:
>> On May 14, 2014 4:33 PM, "Abigail" <abigail@abigail.be> wrote:
>>>
>>> On Wed, May 14, 2014 at 03:28:14PM -0400, shawn wilson wrote:
>>
>>>
>>> I do not know which characters are between Z and a, but at least it's easy
>>> to see it includes all the ASCII letters, in either case. Replace the
>>> 'A' and 'z' with their hex escapes, and I would have no clue.
>>>
>>
>> But it includes more than letters. The space between that contains [\]^_`.
>> I'm not sure if that adds anything noticeable to code but it is a side
>> effect of doing [A-z].
>
> Yeah, that's what I said. I don't know on top of my head which characters
> they are, but I know that [A-z] includes all upper case letters, all lower
> case letters, and then some.
>
> If you write it as [\x41-\x7A], I don't even know it includes A-Z and a-z;
> at least not until I've looked up which characters \x41 and \x7A are.
> And then I still don't know which characters between Z and a are included.
>
> All I'm saying is that your quoted suggestion from rjbs (that [A-z] should
> warn, suggesting it should be written as [\x41-\x7A]) doesn't seem to
> be any improvement to me. [\x41-\x7A] is still the same character class
> as [A-z], but it's even less clear which characters are included.
>
>> The point is that if it warns, maybe it'd fix mistakes and make people
>> realize lowers don't come right after uppers. And if you want to match
>> between groups like that, you know what you're doing or are looking at a
>> chart and are commenting the hell out of what's going through your head.
>
> I'm always very wary to change existing contructs that are really old
> (in this case, over 20 years old, not counting the time it existed
> before Perl), to have them warn on just because "it may make people
> fix mistakes".
>
> How often would such a warning trigger, and how often would it rightfully
> trigger?
>
> I don't recall ever seeing [A-z]; I do recall seeing [X-Y], with both X and Y
> being \W char where it was done intentionally.
>
> Now, if [] were a new construct, I'd say, allow [X-Y] only if X, Y are both
> lower case letters, or both upper case letters, or both digits, and then
> only if both X and Y are from the same Unicode block.
>
>
>
> Abigail
>
Having had time to think about this, what I'd like to see is
1) No changes to anything with []. I agree with Abigail
2) Two new warnings in the experimental (?[]) construct which I aim to
make a better []
a) if a character is printable ASCII, warn if it is expressed as a
non-literal sequence, much like this warning we already have:
"\c{" is more clearly written as ";"
b) if the endpoints of a range are literal, they must be both in one
of three classes: 0-9, A-Z, a-z.
3) Provide a way for someone to say that, in a given scope, they want
the (?[]) behavior for regular []. This has already been requested in
this thread. Something like the (ugly) "use re '(?[])'" Better
syntactic sugar welcome.
This would provide an optional path for people to get better error
checking of their bracketed character classes, without having any
backwards compatibility issues with non-experimental code.
It would encourage people to use an instantly recognizable character
instead of a hex or octal escape for those characters where that is
universally possible.
And it would encourage restricting the use of ranges to the also
non-ambiguous cases.
But the alphabetic are only non-ambiguous by long-standing convention
that restrict the sets to the ASCII ones. We can't really extend this
to beyond ASCII. It's a language by language (locale by locale, really)
issue as to which characters should be in a given range. It differs
from French to German, or any number of Western European languages. The
Cyrillic alphabet could be done for modern Russian, but I don't know
about Belorussian, which also uses Cyrillic, etc. etc.
The one non-ambiguous case (that I know about) of non-ASCII ranges would
be ranges of other decimal digits given as literals in the character
class (under 'use utf8' ), like ০-৯ (0-9 in Bengali).
Thread Previous
|
Thread Next