Front page | perl.perl5.porters |
Postings from November 2008
Re: PATCH [perl #59342] chr(0400) =~ /\400/ fails for >= 400
Thread Previous
|
Thread Next
From:
Tom Christiansen
Date:
November 12, 2008 17:44
Subject:
Re: PATCH [perl #59342] chr(0400) =~ /\400/ fails for >= 400
Message ID:
12210.1226540605@chthon
Glenn Linderman <perl@NevCal.com> wrote:
| On approximately 11/12/2008 7:03 AM, came the following characters
| from the keyboard of Rafael Garcia-Suarez:
«« However, I see some value in still allowing [\000-\377] character »»
«« ranges, for example. Do we really want to deprecate that as well? »»
«« This doesn't seem necessary. »»
| The [below] items could be added to the language immediately, during the
| deprecation cycle for \nnn octal notation, giving people an extremely
| simple way to convert their octal constants: inside of strings/regices,
| insert o after \ and wrap the digits with {}; outside of strings/regices,
| insert o after leading 0.
I have obliged to change my Perl code at three and *only* three
instances over the last TWENTY-ONE YEARS:
1) When log() became a keyword, the inverse of exp(). If parens
had been mandatory on function calls of no arguments, a very wise
practice, this wouldn't have been a problem. This was in 1988,
or perhaps 1989.
2) When perl5 made arrays interpolate in "@strings" unconditionally.
This was in 1994. This was the right thing to do.
3) When perl5.010 finally blew away $* (and $#). This too was the
right thing to do. This was early this year, 2008, and it
was in the following singleton program, written before /m existed:
#!/usr/local/bin/perl
$/ = '';
while (<>) {
#$* = 1;
s/^-- ?$//m if eof;
s/^[-+]{2}\w+$//m if eof;
next unless split(/\n/);
$max = 0;
#$* = 0;
for (@_) {
1 while s/\t+/' 'x (length($&) * 8 - length($`) % 8)/e;
$max = ($max > length) ? $max : length;
}
$edge = "+" . "-" x ($max+2) . "+\n";
print $edge;
for (@_) { printf "| %-${max}s |\n", $_; }
print $edge, "\n";
}
I find the notion of rendering illegal the existing octal syntax of "\33"
is an *EXTRAÖRDINARILY* bad idea, a position I am prepared to defend at
laborious length--and, if necessary, appeal to the Decider-in-Chief, who's
always done everything possible to *NOT* break others' code without *VERY*
*STRONG* reason. I submit that that very high bar has *NOT* been met; far
from it. I'm rather hoping I shan't have to do any of that, but I certainly
shall if I must.
There's no reason at all to delete it: because regexes have \g{1} now, and
strings need never be written "\333" if you mean "\33" . "3".
There is GREAT reason *not* to delete it, as the quantity of code you would
see casually rendered illegal is incomprehensibly large, with the work
involved in updating code, databases, config files, and educating
programmers and users incalculable great. To add insult to injury, this
work you would see thrust upon others, not taken on yourself.
There is nothing fundamentally broken here, as there was for $*. This is
trying to create a language where it is impossible to "think bad thoughts".
One cannot succeed at that.
| I personally see no value in octal notation now that Unicode uses hex,
^^^^^^^^^^ ^^^^^^^^
Good to see the prefatory warning that this your *personal* view. :-)
vvvvvvvv
As for "Unicode using hex", me, I've always thought of it as using bits.
Rather, I think of the various standards specifying code points in the
U+XXXXXX notation to mean code point at that hexadecimal number. Not
the same thing at all. That why I always write
sub uchar(_) { pack( "U*", shift() ) }
because that way all of these
say "chr $_ is " => uchar for 181, 223, 231, 240, 241, 254;
say "chr $_ is " => uchar for 0265,0337,0347,0360,0361,0376;
say "chr $_ is " => uchar for 0xb5,0xdf,0xe7,0xf0,0xf1,0xfe;
say "chr $_ is " => uchar for 0b10110101,0b11011111,0b11100111,
0b11110000,0b11110001,0b11111110;
correctly say:
chr 181 is µ
chr 223 is ß
chr 231 is ç
chr 240 is ð
chr 241 is ñ
chr 254 is þ
and similarly
say "uc ", uchar, " is ", uc uchar
for 181, 0xDF, 0347, 3*2**4*5, 0361, 0b11111110;
says
uc µ is M
uc ß is SS
uc ç is Ç
uc ð is Ð
uc ñ is Ñ
uc þ is Þ
Because I'd be really annoyed if
sub uchar(_) { pack( "U*", hex shift() ) }
say "chr $_ has ord " => ord uchar for 181, 223, 231, 240, 241, 254;
were giving me answers like:
chr 181 has ord 385
chr 223 has ord 547
chr 231 has ord 561
chr 240 has ord 576
chr 241 has ord 577
chr 254 has ord 596
| and most programmers are familiar with it. [···] I daresay that hex
| is about the second thing most programmers learn, these days. "This
| is a computer... this is hexadecimal numbering system... there are
| lots of computer languages..."
Hm, ok. If you say so. Hadn't noticed it myself.
| Another approach would be to change the escape from \nnn to
| \o{nnnnn...} [···] The {} provide explicit delimiters, so octal
| numbers could then achieve parity with hex in the range of numbers
| available. If people think octal is still worth supporting, this looks
| like a better syntax to support it wholeheartedly.
That's not needed, unless you really want to promote octal for
Unicode strings. In a pattern, \g{1} now handles the situation
you're talking about. For DQ-strings, one can always avoid it.
Type "man ascii"; note that the table given first is octal.
| Python 3.0 has moved to 0onnnnn for its octal integers (zero oh digit-
| sequence) after concluding that leading zeros alone are just too
| problematical, so the "o" indicator has a precedent (albeit recent) in
| addition to reasonably intuitively meaning octal to anyone that
| understands the hexadecimal notation and has ever heard of octal. The
| 0o syntax could also be added to Perl integer constants outside of
| strings/regices.
My only trouble with the 0o notation is on fonts without cross 0's,
and its gratuitous superfluousness.
--tom
--
+------------------------------------------------------------+
| SINGULAR PLURAL |
+-------------+----------------------------------------------+
| NOMINATIVE | magnus rex magni reges |
| VOCATIVE | magne rex magni reges |
| GENITIVE | magni regis magnorum regum |
| ACCUSATIVE | magnum regem magnos reges |
| DATIVE | magno regi magnis regibus |
| ABLATIVE | magno rege magnis regibus |
| LOCATIVE | magni regi (or rege) magnis regibus |
+-------------+----------------------------------------------+
% man ascii
ASCII(7) OpenBSD Reference Manual ASCII(7)
NAME
ascii - octal, hexadecimal and decimal ASCII character sets
DESCRIPTION
The octal set:
000 nul 001 soh 002 stx 003 etx 004 eot 005 enq 006 ack 007 bel
010 bs 011 ht 012 nl 013 vt 014 np 015 cr 016 so 017 si
020 dle 021 dc1 022 dc2 023 dc3 024 dc4 025 nak 026 syn 027 etb
030 can 031 em 032 sub 033 esc 034 fs 035 gs 036 rs 037 us
040 sp 041 ! 042 " 043 # 044 $ 045 % 046 & 047 '
050 ( 051 ) 052 * 053 + 054 , 055 - 056 . 057 /
060 0 061 1 062 2 063 3 064 4 065 5 066 6 067 7
070 8 071 9 072 : 073 ; 074 < 075 = 076 > 077 ?
100 @ 101 A 102 B 103 C 104 D 105 E 106 F 107 G
110 H 111 I 112 J 113 K 114 L 115 M 116 N 117 O
120 P 121 Q 122 R 123 S 124 T 125 U 126 V 127 W
130 X 131 Y 132 Z 133 [ 134 \ 135 ] 136 ^ 137 _
140 ` 141 a 142 b 143 c 144 d 145 e 146 f 147 g
150 h 151 i 152 j 153 k 154 l 155 m 156 n 157 o
160 p 161 q 162 r 163 s 164 t 165 u 166 v 167 w
170 x 171 y 172 z 173 { 174 | 175 } 176 ~ 177 del
The hexadecimal set:
00 nul 01 soh 02 stx 03 etx 04 eot 05 enq 06 ack 07 bel
08 bs 09 ht 0a nl 0b vt 0c np 0d cr 0e so 0f si
10 dle 11 dc1 12 dc2 13 dc3 14 dc4 15 nak 16 syn 17 etb
18 can 19 em 1a sub 1b esc 1c fs 1d gs 1e rs 1f us
20 sp 21 ! 22 " 23 # 24 $ 25 % 26 & 27 '
28 ( 29 ) 2a * 2b + 2c , 2d - 2e . 2f /
30 0 31 1 32 2 33 3 34 4 35 5 36 6 37 7
38 8 39 9 3a : 3b ; 3c < 3d = 3e > 3f ?
40 @ 41 A 42 B 43 C 44 D 45 E 46 F 47 G
48 H 49 I 4a J 4b K 4c L 4d M 4e N 4f O
50 P 51 Q 52 R 53 S 54 T 55 U 56 V 57 W
58 X 59 Y 5a Z 5b [ 5c \ 5d ] 5e ^ 5f _
60 ` 61 a 62 b 63 c 64 d 65 e 66 f 67 g
68 h 69 i 6a j 6b k 6c l 6d m 6e n 6f o
70 p 71 q 72 r 73 s 74 t 75 u 76 v 77 w
78 x 79 y 7a z 7b { 7c | 7d } 7e ~ 7f del
The decimal set:
0 nul 1 soh 2 stx 3 etx 4 eot 5 enq 6 ack 7 bel
8 bs 9 ht 10 nl 11 vt 12 np 13 cr 14 so 15 si
16 dle 17 dc1 18 dc2 19 dc3 20 dc4 21 nak 22 syn 23 etb
24 can 25 em 26 sub 27 esc 28 fs 29 gs 30 rs 31 us
32 sp 33 ! 34 " 35 # 36 $ 37 % 38 & 39 '
40 ( 41 ) 42 * 43 + 44 , 45 - 46 . 47 /
48 0 49 1 50 2 51 3 52 4 53 5 54 6 55 7
56 8 57 9 58 : 59 ; 60 < 61 = 62 > 63 ?
64 @ 65 A 66 B 67 C 68 D 69 E 70 F 71 G
72 H 73 I 74 J 75 K 76 L 77 M 78 N 79 O
80 P 81 Q 82 R 83 S 84 T 85 U 86 V 87 W
88 X 89 Y 90 Z 91 [ 92 \ 93 ] 94 ^ 95 _
96 ` 97 a 98 b 99 c 100 d 101 e 102 f 103 g
104 h 105 i 106 j 107 k 108 l 109 m 110 n 111 o
112 p 113 q 114 r 115 s 116 t 117 u 118 v 119 w
120 x 121 y 122 z 123 { 124 | 125 } 126 ~ 127 del
FILES
/usr/share/misc/ascii
HISTORY
An ascii manual page appeared in Version 2 AT&T UNIX.
OpenBSD 4.4 May 31, 2007 2
Thread Previous
|
Thread Next