develooper Front page | perl.perl5.porters | Postings from November 2003

5.8.1 perlre man page: [:punct:] vs. \p{IsPunct}

Thread Next
From:
David Graff
Date:
November 2, 2003 09:00
Subject:
5.8.1 perlre man page: [:punct:] vs. \p{IsPunct}
Message ID:
200311021700.hA2H0OL2022694@unagi.cis.upenn.edu

I just happened to notice that the perlre man page describes the 
POSIX "[:punct:]" character class as being equivalent to the unicode 
"\p{IsPunct}" character class.

I haven't tried to track down the respective standards documents for
POSIX and Unicode to see whether these classes are _supposed_ to be
equivalent over the printable ASCII character set, but when I test them
in Perl 5.8.1, they are _not_ equivalent, as the following snippet will
demonstrate:

for $x ( 0x20 .. 0x7e ) { 
    $_ = chr( $x );
    $res = ( /[[:punct:]]/ ) ? "matches  :punct:" : "is not a :punct:";
    $res .= ( /\p{IsPunct}/ ) ? " matches  {IsPunct}" : " fails on {IsPunct}";
    printf( " 0x%x (%3d.) %s %s\n", $x, $x, $_, $res ) if ( $res =~ /matches/ );
}

The differences involve these nine characters:  $ + < = > ^ ` | ~

Except for the back-tick (`), I wouldn't be surprised if POSIX and 
Unicode are supposed to differ on these points, so maybe it's just a 
matter of fixing the perlre man page.  (I'm not sure yet what the 
behavior of [:punct:] is supposed to be on non-ASCII punctuation 
characters in Unicode -- maybe the man page should clarify this too.)

	Dave Graff



Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About