Front page | perl.perl5.porters |
Postings from June 2009
RFC: \p{Cntrl}
Thread Next
From:
karl williamson
Date:
June 8, 2009 12:39
Subject:
RFC: \p{Cntrl}
Message ID:
4A2D68DD.2070801@khwilliamson.com
Guess what the major category C stands for in Unicode?
It stands for 'Other', and means characters (code points) that don't fit
into anything else.
Perl on the other hand thinks it stands for control, and so \p{Cntrl} in
Perl is a synonym for the assigned characters in \p{C}.
Unicode does has a category meaning control, and that is Cc. Starting
in release 4.1, they added 'cntrl' as a synonym for Cc. Because the
shortcut for \p{gc: cntrl} would be \p{cntrl}, this conflicts with the
perl \p{cntrl}.
The bottom line is that in Perl cntrl is composed of of all the assigned
code points that don't fit into anything else, namely: Cc, Co (private
use), Cs (surrogates), and Cf (format characters), whereas in Unicode it
is just Cc.
I don't know whether to change the perl definition to match Unicode's or
not. One could argue that the surrogates and maybe the private use code
points have little or no utility in perl, and so removing them from
perl's definition wouldn't matter. But there are 139 format characters
in the Perl definition (the lowest one being the soft hyphen (U+00A0)
that aren't in the Unicode one. All but 36 of these characters are
either deprecated or strongly discouraged from using.
I don't know if it's worth bringing Perl's definition into line with
Unicode's or not. Based on a previous RFC, the Unicode definition will
be accessible through \p{gc: cntrl}. Does anyone have an opinion?.
Below is a list of the 139 format characters. It is strongly
discouraged to use the Tag characters:
Other, Format (139 entries)
U+00AD (SOFT HYPHEN [discretionary hyphen]); NOTE: commonly abbreviated
as SHY; COMMENT: Latin-1 punctuation and symbols. Based on ISO/IEC
8859-1 (aka Latin-1).
U+0600 (ARABIC NUMBER SIGN); COMMENT: Subtending marks
U+0601 (ARABIC SIGN SANAH); COMMENT: Subtending marks
U+0602 (ARABIC FOOTNOTE MARKER); COMMENT: Subtending marks
U+0603 (ARABIC SIGN SAFHA); COMMENT: Subtending marks
U+06DD (ARABIC END OF AYAH); COMMENT: Koranic annotation signs
U+070F (SYRIAC ABBREVIATION MARK [SAM]); NOTE: marks the beginning of a
Syriac abbreviation; COMMENT: Syriac format control character
U+17B4 (KHMER VOWEL INHERENT AQ); COMMENT: Inherent vowels. These are
for phonetic transcription to distinguish Indic language inherent vowels
from Khmer inherent vowels. These characters are included solely for
compatibility with particular applications; their use in other contexts
is discouraged.
U+17B5 (KHMER VOWEL INHERENT AA); COMMENT: Inherent vowels. These are
for phonetic transcription to distinguish Indic language inherent vowels
from Khmer inherent vowels. These characters are included solely for
compatibility with particular applications; their use in other contexts
is discouraged.
U+200B (ZERO WIDTH SPACE); NOTE: commonly abbreviated ZWSP, this
character is intended for line break control; it has no width, but its
presence between two characters does not prevent increased letter
spacing in justification; COMMENT: Spaces
U+200C (ZERO WIDTH NON-JOINER); NOTE: commonly abbreviated ZWNJ;
COMMENT: Format characters
U+200D (ZERO WIDTH JOINER); NOTE: commonly abbreviated ZWJ; COMMENT:
Format characters
U+200E (LEFT-TO-RIGHT MARK); NOTE: commonly abbreviated LRM; COMMENT:
Format characters
U+200F (RIGHT-TO-LEFT MARK); BIDI: Right-to-Left; NOTE: commonly
abbreviated RLM; COMMENT: Format characters
U+202A (LEFT-TO-RIGHT EMBEDDING); NOTE: commonly abbreviated LRE;
COMMENT: Format characters
U+202B (RIGHT-TO-LEFT EMBEDDING); BIDI: Right-to-Left Embedding; NOTE:
commonly abbreviated RLE; COMMENT: Format characters
U+202C (POP DIRECTIONAL FORMATTING); NOTE: commonly abbreviated PDF;
COMMENT: Format characters
U+202D (LEFT-TO-RIGHT OVERRIDE); NOTE: commonly abbreviated LRO;
COMMENT: Format characters
U+202E (RIGHT-TO-LEFT OVERRIDE); BIDI: Right-to-Left Override; NOTE:
commonly abbreviated RLO; COMMENT: Format characters
U+2060 (WORD JOINER); NOTE: commonly abbreviated WJ, a zero width
non-breaking space (only), intended for disambiguation of functions for
byte order mark; COMMENT: Format character
U+2061 (FUNCTION APPLICATION); NOTE: contiguity operator indicating
application of a function; COMMENT: Invisible operators
U+2062 (INVISIBLE TIMES); NOTE: contiguity operator indicating
multiplication; COMMENT: Invisible operators
U+2063 (INVISIBLE SEPARATOR [invisible comma]); NOTE: contiguity
operator indicating that adjacent mathematical symbols form a list, e.g.
when no visible comma is used between multiple indices; COMMENT:
Invisible operators
U+2064 (INVISIBLE PLUS); NOTE: contiguity operator indicating addition;
COMMENT: Invisible operators
U+206A (INHIBIT SYMMETRIC SWAPPING); COMMENT: Deprecated
U+206B (ACTIVATE SYMMETRIC SWAPPING); COMMENT: Deprecated
U+206C (INHIBIT ARABIC FORM SHAPING); COMMENT: Deprecated
U+206D (ACTIVATE ARABIC FORM SHAPING); COMMENT: Deprecated
U+206E (NATIONAL DIGIT SHAPES); COMMENT: Deprecated
U+206F (NOMINAL DIGIT SHAPES); COMMENT: Deprecated
U+FEFF (ZERO WIDTH NO-BREAK SPACE [BYTE ORDER MARK (BOM), ZWNBSP]);
NOTE: may be used to detect byte order by contrast with the noncharacter
code point FFFE (), use as an indication of non-breaking is deprecated;
see U+2060 instead; COMMENT: Special
U+FFF9 (INTERLINEAR ANNOTATION ANCHOR); NOTE: marks start of annotated
text; COMMENT: Interlinear annotation. Used internally for Japanese
Ruby (furigana), etc.
U+FFFA (INTERLINEAR ANNOTATION SEPARATOR); NOTE: marks start of
annotating character(s); COMMENT: Interlinear annotation. Used
internally for Japanese Ruby (furigana), etc.
U+FFFB (INTERLINEAR ANNOTATION TERMINATOR); NOTE: marks end of
annotation block; COMMENT: Interlinear annotation. Used internally for
Japanese Ruby (furigana), etc.
U+1D173 (MUSICAL SYMBOL BEGIN BEAM); COMMENT: Beams and slurs
U+1D174 (MUSICAL SYMBOL END BEAM); COMMENT: Beams and slurs
U+1D175 (MUSICAL SYMBOL BEGIN TIE); COMMENT: Beams and slurs
U+1D176 (MUSICAL SYMBOL END TIE); COMMENT: Beams and slurs
U+1D177 (MUSICAL SYMBOL BEGIN SLUR); COMMENT: Beams and slurs
U+1D178 (MUSICAL SYMBOL END SLUR); COMMENT: Beams and slurs
U+1D179 (MUSICAL SYMBOL BEGIN PHRASE); COMMENT: Beams and slurs
U+1D17A (MUSICAL SYMBOL END PHRASE); COMMENT: Beams and slurs
U+E0001 (LANGUAGE TAG); COMMENT: Tag identifiers
U+E0020 (TAG SPACE); COMMENT: Tag components
U+E0021 (TAG EXCLAMATION MARK); COMMENT: Tag components
U+E0022 (TAG QUOTATION MARK); COMMENT: Tag components
U+E0023 (TAG NUMBER SIGN); COMMENT: Tag components
U+E0024 (TAG DOLLAR SIGN); COMMENT: Tag components
U+E0025 (TAG PERCENT SIGN); COMMENT: Tag components
U+E0026 (TAG AMPERSAND); COMMENT: Tag components
U+E0027 (TAG APOSTROPHE); COMMENT: Tag components
U+E0028 (TAG LEFT PARENTHESIS); COMMENT: Tag components
U+E0029 (TAG RIGHT PARENTHESIS); COMMENT: Tag components
U+E002A (TAG ASTERISK); COMMENT: Tag components
U+E002B (TAG PLUS SIGN); COMMENT: Tag components
U+E002C (TAG COMMA); COMMENT: Tag components
U+E002D (TAG HYPHEN-MINUS); COMMENT: Tag components
U+E002E (TAG FULL STOP); COMMENT: Tag components
U+E002F (TAG SOLIDUS); COMMENT: Tag components
U+E0030 (TAG DIGIT ZERO); COMMENT: Tag components
U+E0031 (TAG DIGIT ONE); COMMENT: Tag components
U+E0032 (TAG DIGIT TWO); COMMENT: Tag components
U+E0033 (TAG DIGIT THREE); COMMENT: Tag components
U+E0034 (TAG DIGIT FOUR); COMMENT: Tag components
U+E0035 (TAG DIGIT FIVE); COMMENT: Tag components
U+E0036 (TAG DIGIT SIX); COMMENT: Tag components
U+E0037 (TAG DIGIT SEVEN); COMMENT: Tag components
U+E0038 (TAG DIGIT EIGHT); COMMENT: Tag components
U+E0039 (TAG DIGIT NINE); COMMENT: Tag components
U+E003A (TAG COLON); COMMENT: Tag components
U+E003B (TAG SEMICOLON); COMMENT: Tag components
U+E003C (TAG LESS-THAN SIGN); COMMENT: Tag components
U+E003D (TAG EQUALS SIGN); COMMENT: Tag components
U+E003E (TAG GREATER-THAN SIGN); COMMENT: Tag components
U+E003F (TAG QUESTION MARK); COMMENT: Tag components
U+E0040 (TAG COMMERCIAL AT); COMMENT: Tag components
U+E0041 (TAG LATIN CAPITAL LETTER A); COMMENT: Tag components
U+E0042 (TAG LATIN CAPITAL LETTER B); COMMENT: Tag components
U+E0043 (TAG LATIN CAPITAL LETTER C); COMMENT: Tag components
U+E0044 (TAG LATIN CAPITAL LETTER D); COMMENT: Tag components
U+E0045 (TAG LATIN CAPITAL LETTER E); COMMENT: Tag components
U+E0046 (TAG LATIN CAPITAL LETTER F); COMMENT: Tag components
U+E0047 (TAG LATIN CAPITAL LETTER G); COMMENT: Tag components
U+E0048 (TAG LATIN CAPITAL LETTER H); COMMENT: Tag components
U+E0049 (TAG LATIN CAPITAL LETTER I); COMMENT: Tag components
U+E004A (TAG LATIN CAPITAL LETTER J); COMMENT: Tag components
U+E004B (TAG LATIN CAPITAL LETTER K); COMMENT: Tag components
U+E004C (TAG LATIN CAPITAL LETTER L); COMMENT: Tag components
U+E004D (TAG LATIN CAPITAL LETTER M); COMMENT: Tag components
U+E004E (TAG LATIN CAPITAL LETTER N); COMMENT: Tag components
U+E004F (TAG LATIN CAPITAL LETTER O); COMMENT: Tag components
U+E0050 (TAG LATIN CAPITAL LETTER P); COMMENT: Tag components
U+E0051 (TAG LATIN CAPITAL LETTER Q); COMMENT: Tag components
U+E0052 (TAG LATIN CAPITAL LETTER R); COMMENT: Tag components
U+E0053 (TAG LATIN CAPITAL LETTER S); COMMENT: Tag components
U+E0054 (TAG LATIN CAPITAL LETTER T); COMMENT: Tag components
U+E0055 (TAG LATIN CAPITAL LETTER U); COMMENT: Tag components
U+E0056 (TAG LATIN CAPITAL LETTER V); COMMENT: Tag components
U+E0057 (TAG LATIN CAPITAL LETTER W); COMMENT: Tag components
U+E0058 (TAG LATIN CAPITAL LETTER X); COMMENT: Tag components
U+E0059 (TAG LATIN CAPITAL LETTER Y); COMMENT: Tag components
U+E005A (TAG LATIN CAPITAL LETTER Z); COMMENT: Tag components
U+E005B (TAG LEFT SQUARE BRACKET); COMMENT: Tag components
U+E005C (TAG REVERSE SOLIDUS); COMMENT: Tag components
U+E005D (TAG RIGHT SQUARE BRACKET); COMMENT: Tag components
U+E005E (TAG CIRCUMFLEX ACCENT); COMMENT: Tag components
U+E005F (TAG LOW LINE); COMMENT: Tag components
U+E0060 (TAG GRAVE ACCENT); COMMENT: Tag components
U+E0061 (TAG LATIN SMALL LETTER A); COMMENT: Tag components
U+E0062 (TAG LATIN SMALL LETTER B); COMMENT: Tag components
U+E0063 (TAG LATIN SMALL LETTER C); COMMENT: Tag components
U+E0064 (TAG LATIN SMALL LETTER D); COMMENT: Tag components
U+E0065 (TAG LATIN SMALL LETTER E); COMMENT: Tag components
U+E0066 (TAG LATIN SMALL LETTER F); COMMENT: Tag components
U+E0067 (TAG LATIN SMALL LETTER G); COMMENT: Tag components
U+E0068 (TAG LATIN SMALL LETTER H); COMMENT: Tag components
U+E0069 (TAG LATIN SMALL LETTER I); COMMENT: Tag components
U+E006A (TAG LATIN SMALL LETTER J); COMMENT: Tag components
U+E006B (TAG LATIN SMALL LETTER K); COMMENT: Tag components
U+E006C (TAG LATIN SMALL LETTER L); COMMENT: Tag components
U+E006D (TAG LATIN SMALL LETTER M); COMMENT: Tag components
U+E006E (TAG LATIN SMALL LETTER N); COMMENT: Tag components
U+E006F (TAG LATIN SMALL LETTER O); COMMENT: Tag components
U+E0070 (TAG LATIN SMALL LETTER P); COMMENT: Tag components
U+E0071 (TAG LATIN SMALL LETTER Q); COMMENT: Tag components
U+E0072 (TAG LATIN SMALL LETTER R); COMMENT: Tag components
U+E0073 (TAG LATIN SMALL LETTER S); COMMENT: Tag components
U+E0074 (TAG LATIN SMALL LETTER T); COMMENT: Tag components
U+E0075 (TAG LATIN SMALL LETTER U); COMMENT: Tag components
U+E0076 (TAG LATIN SMALL LETTER V); COMMENT: Tag components
U+E0077 (TAG LATIN SMALL LETTER W); COMMENT: Tag components
U+E0078 (TAG LATIN SMALL LETTER X); COMMENT: Tag components
U+E0079 (TAG LATIN SMALL LETTER Y); COMMENT: Tag components
U+E007A (TAG LATIN SMALL LETTER Z); COMMENT: Tag components
U+E007B (TAG LEFT CURLY BRACKET); COMMENT: Tag components
U+E007C (TAG VERTICAL LINE); COMMENT: Tag components
U+E007D (TAG RIGHT CURLY BRACKET); COMMENT: Tag components
U+E007E (TAG TILDE); COMMENT: Tag components
U+E007F (CANCEL TAG); COMMENT: Tag components
Thread Next
-
RFC: \p{Cntrl}
by karl williamson