develooper Front page | perl.perl5.porters | Postings from June 2009

RFC: \p{Cntrl}

Thread Next
From:
karl williamson
Date:
June 8, 2009 12:39
Subject:
RFC: \p{Cntrl}
Message ID:
4A2D68DD.2070801@khwilliamson.com
Guess what the major category C stands for in Unicode?








It stands for 'Other', and means characters (code points) that don't fit 
into anything else.

Perl on the other hand thinks it stands for control, and so \p{Cntrl} in 
Perl is a synonym for the assigned characters in \p{C}.

Unicode does has a category meaning control, and that is Cc.  Starting
in release 4.1, they added 'cntrl' as a synonym for Cc.  Because the 
shortcut for \p{gc: cntrl} would be \p{cntrl}, this conflicts with the 
perl \p{cntrl}.

The bottom line is that in Perl cntrl is composed of of all the assigned 
code points that don't fit into anything else, namely: Cc,  Co (private 
use), Cs (surrogates), and Cf (format characters), whereas in Unicode it 
is just Cc.

I don't know whether to change the perl definition to match Unicode's or
not.  One could argue that the surrogates and maybe the private use code
points have little or no utility in perl, and so removing them from 
perl's definition wouldn't matter.  But there are 139 format characters 
in the Perl definition (the lowest one being the soft hyphen (U+00A0) 
that aren't in the Unicode one.  All but 36 of these characters are 
either deprecated or strongly discouraged from using.

I don't know if it's worth bringing Perl's definition into line with
Unicode's or not.  Based on a previous RFC, the Unicode definition will 
be accessible through \p{gc: cntrl}.  Does anyone have an opinion?.

Below is a list of the 139 format characters.  It is strongly 
discouraged to use the Tag characters:

Other, Format (139 entries)
U+00AD (SOFT HYPHEN [discretionary hyphen]); NOTE: commonly abbreviated 
as SHY; COMMENT: Latin-1 punctuation and symbols.  Based on ISO/IEC 
8859-1 (aka Latin-1).
U+0600 (ARABIC NUMBER SIGN); COMMENT: Subtending marks
U+0601 (ARABIC SIGN SANAH); COMMENT: Subtending marks
U+0602 (ARABIC FOOTNOTE MARKER); COMMENT: Subtending marks
U+0603 (ARABIC SIGN SAFHA); COMMENT: Subtending marks
U+06DD (ARABIC END OF AYAH); COMMENT: Koranic annotation signs
U+070F (SYRIAC ABBREVIATION MARK [SAM]); NOTE: marks the beginning of a 
Syriac abbreviation; COMMENT: Syriac format control character
U+17B4 (KHMER VOWEL INHERENT AQ); COMMENT: Inherent vowels.  These are 
for phonetic transcription to distinguish Indic language inherent vowels 
from Khmer inherent vowels. These characters are included solely for 
compatibility with particular applications; their use in other contexts 
is discouraged.
U+17B5 (KHMER VOWEL INHERENT AA); COMMENT: Inherent vowels.  These are 
for phonetic transcription to distinguish Indic language inherent vowels 
from Khmer inherent vowels. These characters are included solely for 
compatibility with particular applications; their use in other contexts 
is discouraged.
U+200B (ZERO WIDTH SPACE); NOTE: commonly abbreviated ZWSP, this 
character is intended for line break control; it has no width, but its 
presence between two characters does not prevent increased letter 
spacing in justification; COMMENT: Spaces
U+200C (ZERO WIDTH NON-JOINER); NOTE: commonly abbreviated ZWNJ; 
COMMENT: Format characters
U+200D (ZERO WIDTH JOINER); NOTE: commonly abbreviated ZWJ; COMMENT: 
Format characters
U+200E (LEFT-TO-RIGHT MARK); NOTE: commonly abbreviated LRM; COMMENT: 
Format characters
U+200F (RIGHT-TO-LEFT MARK); BIDI: Right-to-Left; NOTE: commonly 
abbreviated RLM; COMMENT: Format characters
U+202A (LEFT-TO-RIGHT EMBEDDING); NOTE: commonly abbreviated LRE; 
COMMENT: Format characters
U+202B (RIGHT-TO-LEFT EMBEDDING); BIDI: Right-to-Left Embedding; NOTE: 
commonly abbreviated RLE; COMMENT: Format characters
U+202C (POP DIRECTIONAL FORMATTING); NOTE: commonly abbreviated PDF; 
COMMENT: Format characters
U+202D (LEFT-TO-RIGHT OVERRIDE); NOTE: commonly abbreviated LRO; 
COMMENT: Format characters
U+202E (RIGHT-TO-LEFT OVERRIDE); BIDI: Right-to-Left Override; NOTE: 
commonly abbreviated RLO; COMMENT: Format characters
U+2060 (WORD JOINER); NOTE: commonly abbreviated WJ, a zero width 
non-breaking space (only), intended for disambiguation of functions for 
byte order mark; COMMENT: Format character
U+2061 (FUNCTION APPLICATION); NOTE: contiguity operator indicating 
application of a function; COMMENT: Invisible operators
U+2062 (INVISIBLE TIMES); NOTE: contiguity operator indicating 
multiplication; COMMENT: Invisible operators
U+2063 (INVISIBLE SEPARATOR [invisible comma]); NOTE: contiguity 
operator indicating that adjacent mathematical symbols form a list, e.g. 
when no visible comma is used between multiple indices; COMMENT: 
Invisible operators
U+2064 (INVISIBLE PLUS); NOTE: contiguity operator indicating addition; 
COMMENT: Invisible operators
U+206A (INHIBIT SYMMETRIC SWAPPING); COMMENT: Deprecated
U+206B (ACTIVATE SYMMETRIC SWAPPING); COMMENT: Deprecated
U+206C (INHIBIT ARABIC FORM SHAPING); COMMENT: Deprecated
U+206D (ACTIVATE ARABIC FORM SHAPING); COMMENT: Deprecated
U+206E (NATIONAL DIGIT SHAPES); COMMENT: Deprecated
U+206F (NOMINAL DIGIT SHAPES); COMMENT: Deprecated
U+FEFF (ZERO WIDTH NO-BREAK SPACE [BYTE ORDER MARK (BOM), ZWNBSP]); 
NOTE: may be used to detect byte order by contrast with the noncharacter 
code point FFFE (), use as an indication of non-breaking is deprecated; 
see U+2060 instead; COMMENT: Special
U+FFF9 (INTERLINEAR ANNOTATION ANCHOR); NOTE: marks start of annotated 
text; COMMENT: Interlinear annotation.  Used internally for Japanese 
Ruby (furigana), etc.
U+FFFA (INTERLINEAR ANNOTATION SEPARATOR); NOTE: marks start of 
annotating character(s); COMMENT: Interlinear annotation.  Used 
internally for Japanese Ruby (furigana), etc.
U+FFFB (INTERLINEAR ANNOTATION TERMINATOR); NOTE: marks end of 
annotation block; COMMENT: Interlinear annotation.  Used internally for 
Japanese Ruby (furigana), etc.
U+1D173 (MUSICAL SYMBOL BEGIN BEAM); COMMENT: Beams and slurs
U+1D174 (MUSICAL SYMBOL END BEAM); COMMENT: Beams and slurs
U+1D175 (MUSICAL SYMBOL BEGIN TIE); COMMENT: Beams and slurs
U+1D176 (MUSICAL SYMBOL END TIE); COMMENT: Beams and slurs
U+1D177 (MUSICAL SYMBOL BEGIN SLUR); COMMENT: Beams and slurs
U+1D178 (MUSICAL SYMBOL END SLUR); COMMENT: Beams and slurs
U+1D179 (MUSICAL SYMBOL BEGIN PHRASE); COMMENT: Beams and slurs
U+1D17A (MUSICAL SYMBOL END PHRASE); COMMENT: Beams and slurs
U+E0001 (LANGUAGE TAG); COMMENT: Tag identifiers
U+E0020 (TAG SPACE); COMMENT: Tag components
U+E0021 (TAG EXCLAMATION MARK); COMMENT: Tag components
U+E0022 (TAG QUOTATION MARK); COMMENT: Tag components
U+E0023 (TAG NUMBER SIGN); COMMENT: Tag components
U+E0024 (TAG DOLLAR SIGN); COMMENT: Tag components
U+E0025 (TAG PERCENT SIGN); COMMENT: Tag components
U+E0026 (TAG AMPERSAND); COMMENT: Tag components
U+E0027 (TAG APOSTROPHE); COMMENT: Tag components
U+E0028 (TAG LEFT PARENTHESIS); COMMENT: Tag components
U+E0029 (TAG RIGHT PARENTHESIS); COMMENT: Tag components
U+E002A (TAG ASTERISK); COMMENT: Tag components
U+E002B (TAG PLUS SIGN); COMMENT: Tag components
U+E002C (TAG COMMA); COMMENT: Tag components
U+E002D (TAG HYPHEN-MINUS); COMMENT: Tag components
U+E002E (TAG FULL STOP); COMMENT: Tag components
U+E002F (TAG SOLIDUS); COMMENT: Tag components
U+E0030 (TAG DIGIT ZERO); COMMENT: Tag components
U+E0031 (TAG DIGIT ONE); COMMENT: Tag components
U+E0032 (TAG DIGIT TWO); COMMENT: Tag components
U+E0033 (TAG DIGIT THREE); COMMENT: Tag components
U+E0034 (TAG DIGIT FOUR); COMMENT: Tag components
U+E0035 (TAG DIGIT FIVE); COMMENT: Tag components
U+E0036 (TAG DIGIT SIX); COMMENT: Tag components
U+E0037 (TAG DIGIT SEVEN); COMMENT: Tag components
U+E0038 (TAG DIGIT EIGHT); COMMENT: Tag components
U+E0039 (TAG DIGIT NINE); COMMENT: Tag components
U+E003A (TAG COLON); COMMENT: Tag components
U+E003B (TAG SEMICOLON); COMMENT: Tag components
U+E003C (TAG LESS-THAN SIGN); COMMENT: Tag components
U+E003D (TAG EQUALS SIGN); COMMENT: Tag components
U+E003E (TAG GREATER-THAN SIGN); COMMENT: Tag components
U+E003F (TAG QUESTION MARK); COMMENT: Tag components
U+E0040 (TAG COMMERCIAL AT); COMMENT: Tag components
U+E0041 (TAG LATIN CAPITAL LETTER A); COMMENT: Tag components
U+E0042 (TAG LATIN CAPITAL LETTER B); COMMENT: Tag components
U+E0043 (TAG LATIN CAPITAL LETTER C); COMMENT: Tag components
U+E0044 (TAG LATIN CAPITAL LETTER D); COMMENT: Tag components
U+E0045 (TAG LATIN CAPITAL LETTER E); COMMENT: Tag components
U+E0046 (TAG LATIN CAPITAL LETTER F); COMMENT: Tag components
U+E0047 (TAG LATIN CAPITAL LETTER G); COMMENT: Tag components
U+E0048 (TAG LATIN CAPITAL LETTER H); COMMENT: Tag components
U+E0049 (TAG LATIN CAPITAL LETTER I); COMMENT: Tag components
U+E004A (TAG LATIN CAPITAL LETTER J); COMMENT: Tag components
U+E004B (TAG LATIN CAPITAL LETTER K); COMMENT: Tag components
U+E004C (TAG LATIN CAPITAL LETTER L); COMMENT: Tag components
U+E004D (TAG LATIN CAPITAL LETTER M); COMMENT: Tag components
U+E004E (TAG LATIN CAPITAL LETTER N); COMMENT: Tag components
U+E004F (TAG LATIN CAPITAL LETTER O); COMMENT: Tag components
U+E0050 (TAG LATIN CAPITAL LETTER P); COMMENT: Tag components
U+E0051 (TAG LATIN CAPITAL LETTER Q); COMMENT: Tag components
U+E0052 (TAG LATIN CAPITAL LETTER R); COMMENT: Tag components
U+E0053 (TAG LATIN CAPITAL LETTER S); COMMENT: Tag components
U+E0054 (TAG LATIN CAPITAL LETTER T); COMMENT: Tag components
U+E0055 (TAG LATIN CAPITAL LETTER U); COMMENT: Tag components
U+E0056 (TAG LATIN CAPITAL LETTER V); COMMENT: Tag components
U+E0057 (TAG LATIN CAPITAL LETTER W); COMMENT: Tag components
U+E0058 (TAG LATIN CAPITAL LETTER X); COMMENT: Tag components
U+E0059 (TAG LATIN CAPITAL LETTER Y); COMMENT: Tag components
U+E005A (TAG LATIN CAPITAL LETTER Z); COMMENT: Tag components
U+E005B (TAG LEFT SQUARE BRACKET); COMMENT: Tag components
U+E005C (TAG REVERSE SOLIDUS); COMMENT: Tag components
U+E005D (TAG RIGHT SQUARE BRACKET); COMMENT: Tag components
U+E005E (TAG CIRCUMFLEX ACCENT); COMMENT: Tag components
U+E005F (TAG LOW LINE); COMMENT: Tag components
U+E0060 (TAG GRAVE ACCENT); COMMENT: Tag components
U+E0061 (TAG LATIN SMALL LETTER A); COMMENT: Tag components
U+E0062 (TAG LATIN SMALL LETTER B); COMMENT: Tag components
U+E0063 (TAG LATIN SMALL LETTER C); COMMENT: Tag components
U+E0064 (TAG LATIN SMALL LETTER D); COMMENT: Tag components
U+E0065 (TAG LATIN SMALL LETTER E); COMMENT: Tag components
U+E0066 (TAG LATIN SMALL LETTER F); COMMENT: Tag components
U+E0067 (TAG LATIN SMALL LETTER G); COMMENT: Tag components
U+E0068 (TAG LATIN SMALL LETTER H); COMMENT: Tag components
U+E0069 (TAG LATIN SMALL LETTER I); COMMENT: Tag components
U+E006A (TAG LATIN SMALL LETTER J); COMMENT: Tag components
U+E006B (TAG LATIN SMALL LETTER K); COMMENT: Tag components
U+E006C (TAG LATIN SMALL LETTER L); COMMENT: Tag components
U+E006D (TAG LATIN SMALL LETTER M); COMMENT: Tag components
U+E006E (TAG LATIN SMALL LETTER N); COMMENT: Tag components
U+E006F (TAG LATIN SMALL LETTER O); COMMENT: Tag components
U+E0070 (TAG LATIN SMALL LETTER P); COMMENT: Tag components
U+E0071 (TAG LATIN SMALL LETTER Q); COMMENT: Tag components
U+E0072 (TAG LATIN SMALL LETTER R); COMMENT: Tag components
U+E0073 (TAG LATIN SMALL LETTER S); COMMENT: Tag components
U+E0074 (TAG LATIN SMALL LETTER T); COMMENT: Tag components
U+E0075 (TAG LATIN SMALL LETTER U); COMMENT: Tag components
U+E0076 (TAG LATIN SMALL LETTER V); COMMENT: Tag components
U+E0077 (TAG LATIN SMALL LETTER W); COMMENT: Tag components
U+E0078 (TAG LATIN SMALL LETTER X); COMMENT: Tag components
U+E0079 (TAG LATIN SMALL LETTER Y); COMMENT: Tag components
U+E007A (TAG LATIN SMALL LETTER Z); COMMENT: Tag components
U+E007B (TAG LEFT CURLY BRACKET); COMMENT: Tag components
U+E007C (TAG VERTICAL LINE); COMMENT: Tag components
U+E007D (TAG RIGHT CURLY BRACKET); COMMENT: Tag components
U+E007E (TAG TILDE); COMMENT: Tag components
U+E007F (CANCEL TAG); COMMENT: Tag components


Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About