develooper Front page | perl.perl5.porters | Postings from June 2014

Re: Regex bracket char class range limiting

Thread Previous | Thread Next
From:
Karl Williamson
Date:
June 24, 2014 02:23
Subject:
Re: Regex bracket char class range limiting
Message ID:
53A8E0FE.5040407@khwilliamson.com
On 05/15/2014 10:46 AM, Abigail wrote:
> On Thu, May 15, 2014 at 12:20:53AM -0400, shawn wilson wrote:
>> On May 14, 2014 4:33 PM, "Abigail" <abigail@abigail.be> wrote:
>>>
>>> On Wed, May 14, 2014 at 03:28:14PM -0400, shawn wilson wrote:
>>
>>>
>>> I do not know which characters are between Z and a, but at least it's easy
>>> to see it includes all the ASCII letters, in either case. Replace the
>>> 'A' and 'z' with their hex escapes, and I would have no clue.
>>>
>>
>> But it includes more than letters. The space between that contains [\]^_`.
>> I'm not sure if that adds anything noticeable to code but it is a side
>> effect of doing [A-z].
>
> Yeah, that's what I said. I don't know on top of my head which characters
> they are, but I know that [A-z] includes all upper case letters, all lower
> case letters, and then some.
>
> If you write it as [\x41-\x7A], I don't even know it includes A-Z and a-z;
> at least not until I've looked up which characters \x41 and \x7A are.
> And then I still don't know which characters between Z and a are included.
>
> All I'm saying is that your quoted suggestion from rjbs (that [A-z] should
> warn, suggesting it should be written as [\x41-\x7A]) doesn't seem to
> be any improvement to me. [\x41-\x7A] is still the same character class
> as [A-z], but it's even less clear which characters are included.
>
>> The point is that if it warns, maybe it'd fix mistakes and make people
>> realize lowers don't come right after uppers. And if you want to match
>> between groups like that, you know what you're doing or are looking at a
>> chart and are commenting the hell out of what's going through your head.
>
> I'm always very wary to change existing contructs that are really old
> (in this case, over 20 years old, not counting the time it existed
> before Perl), to have them warn on just because "it may make people
> fix mistakes".
>
> How often would such a warning trigger, and how often would it rightfully
> trigger?
>
> I don't recall ever seeing [A-z]; I do recall seeing [X-Y], with both X and Y
> being \W char where it was done intentionally.
>
> Now, if [] were a new construct, I'd say, allow [X-Y] only if X, Y are both
> lower case letters, or both upper case letters, or both digits, and then
> only if both X and Y are from the same Unicode block.
>
>
>
> Abigail
>

Having had time to think about this, what I'd like to see is

1) No changes to anything with [].  I agree with Abigail

2) Two new warnings in the experimental (?[]) construct which I aim to 
make a better []
    a) if a character is printable ASCII, warn if it is expressed as a 
non-literal sequence, much like this warning we already have:
          "\c{" is more clearly written as ";"
    b) if the endpoints of a range are literal, they must be both in one 
of three classes: 0-9, A-Z, a-z.

3) Provide a way for someone to say that, in a given scope, they want 
the (?[]) behavior for regular [].  This has already been requested in 
this thread.  Something like the (ugly) "use re '(?[])'"  Better 
syntactic sugar welcome.

This would provide an optional path for people to get better error 
checking of their bracketed character classes, without having any 
backwards compatibility issues with non-experimental code.

It would encourage people to use an instantly recognizable character 
instead of a hex or octal escape for those characters where that is 
universally possible.

And it would encourage restricting the use of ranges to the also 
non-ambiguous cases.

But the alphabetic are only non-ambiguous by long-standing convention 
that restrict the sets to the ASCII ones.  We can't really extend this 
to beyond ASCII.  It's a language by language (locale by locale, really) 
issue as to which characters should be in a given range.  It differs 
from French to German, or any number of Western European languages.  The 
Cyrillic alphabet could be done for modern Russian, but I don't know 
about Belorussian, which also uses Cyrillic, etc. etc.

The one non-ambiguous case (that I know about) of non-ASCII ranges would 
be ranges of other decimal digits given as literals in the character 
class (under 'use utf8' ), like ০-৯ (0-9 in Bengali).


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About