Front page | perl.perl5.porters |
Postings from November 2009
PATCH #69018; revamped mktables
Thread Next
From:
karl williamson
Date:
November 21, 2009 23:07
Subject:
PATCH #69018; revamped mktables
Message ID:
4B08E32E.6060007@khwilliamson.com
A revised mktables is available, both at
git://github.com/khwilliamson/perl.git. (The branch is called mktables)
It fixes the minor bug #69018, concerning accepting the erroneous
\p{Script=InGreek}, and perhaps other bugs; I need to look. But it
fixes a number of things which I have not bothered to write bug reports
on, many of them have been aired on the p5p list over the last several
months.
SKIP: personal_narrative {
This is close to a complete rewrite of mktables. I did not set out to
do this, but as the work progressed, I discovered more and more things
wrong. Having really looked into the Unicode history now, it appears to
me that when the original mktables was written, it was not clear what
direction Unicode would go in, and it went in a different direction than
anticipated. It took me quite a while to understand the distinction
between some data structures that were muddled. After that, the code I
was writing got clearer. I learned a lot about Unicode; at some point I
came to the shocking realization, when wondering why in tarnation did
the code do that?, that I had come to know more about the Unicode
standard than some of the patchers did. The old version's tables are
mostly correct, but there are a number of problems with them, some
subtle, some not so. The combining class table is so wrong that it
could easily be the butt of jokes; I've thought of a few myself.
Another problem was that many of the newer Unicode tables are unreadable
by the old mktables without extensive munging of them.
} end SKIP
SKIP: design goals {
I eventually gave up trying to fit into the existing mktables, and just
rewrote things. I've tried to make this work on auto pilot, so that new
Unicode releases will require a minimum of fuss. pod and .t files are
generated from the data, so that other things don't have to be patched
to keep up. I've also added much more input validation, so that if a
new enum value is added to a field, we will know, instead of blindly
ignoring it. There are more goals, but it's getting late, and hard for
me to think.
} end SKIP
The last part of this email includes text about the changes, intended
for perldelta. I'm not sure who's supposed to patch that. In addition
to those, here are the other things that have changed, but aren't
notable enough to mention externally. Hence, the most important changes
are after the line of ### in this email.
I've change the main Makefile to call this differently (besides the
parameters telling it where to put the pod and test files). mktables
need not actually run very often. The inputs are pretty static, just
like Encode's. And the dependency is on far more files than Makefile
knows about. This patch fixes the bug in mktables wherein it wrongly
calculated whether it should run or not; so I've removed the -w option
to it. When called that way, mktables will check and do nothing,
quickly, if nothing is out-of-date. If people don't trust that, we
could change it so the apparent critical dependencies are still known to
Makefile, and to use the -w option to force mktables to run when those
dependencies trigger it, and then to have an unconditional call to
mktables as well, without the -w, so that it can check its own
dependency list. That said, mktables is currently running too often;
there is something in Makefile that is removing some of these output
files when I don't think it should; I haven't had a chance to
investigate this.
The files that are generated for case mapping and folding continue to
have two parts, the regular part and a hash for special cases. It turns
out that a number of the special case entries could be handled just as
well using the regular method; so they have been moved there.
All duplicate files have been eliminated. That means that if two
properties match the same exact set of code points, one file serves
both. This was not so much to save disk space, as to save memory, as
the same swash can now serve multiple properties.
There are new options to mktables:
-globlist is used to attempt to process all .txt files in the
directory structure. The ones it doesn't know how to handle are
processed assuming that they follow the typical .txt syntax.
-P dir tells mktables to create perluniprops.pod in dir. Makefile
has been changed so this goes in the standard pod directory.
-T path tells mktables to create a .t file as 'path'. Makefile has
been changed so this goes into t/re/uniprops.t
-p tells mktables to give progress information as it works.
-c tells mktables to not output range counts in the .pl files
it generates. These are by default output as comments; I have found
them helpful for debugging, and they don't add much disk space.
Canonical.pl and Exact.pl have been replaced by Heavy.pl, which allows
for more straightforward code in utf8_heavy.pl
There are several new features which lay the groundwork for fixing
charnames to know about all code points and named sequences; \X to match
more correctly; and to allow other tables such as To/Digit.pl to be read
by the Perl core.
I removed a test from re/pat_advanced which relied on the old erroneous
definition of \w which included superscripts as part of a word; and
changed another test in regexp_unicode_prop.t for again a changed
property definition.
I have compared the outputs of this version and the previous and am
confident that all the differences are correct.
I tried to be scrupulous about using File::Spec, but tested this only on
Linux and Windows boxes, so there may be mistakes that should be smoked out.
I tried running perlcritic on this, but it crashed, apparently at an
innocuous place. I did mostly use the Perl Best Practices.
The rest of this is text intended to be suitable for perldelta. NOTE
that this includes some anticipated documentation changes that haven't
been submitted yet.
######################################################
Perl can now handle every Unicode character property. A new pod,
perluniprops, lists all available non-Unihan character properties. By
default the Unihan properties and certain others (deprecated and Unicode
internal-only ones) are not exposed. See below for more details on
these; there is also a section in the pod listing them, and why they are
not exposed.
Perl now fully supports the Unicode compound-style of using '=' and ':'
in writing regular expressions: \p{property=value} and
\p{property:value} (both of which mean the same thing).
Perl now supports fully the Unicode loose matching rules for text
between the braces in \p{...} constructs. In addition, Perl also allows
underscores between digits of numbers
All the Unicode-defined synonyms for properties and property values are
now accepted.
\p{...} matches using the Canonical_Combining_Class property were
completely broken in previous Perls. This is now fixed.
In previous Perls, the Unicode Decomposition_Type=Compat property and a
Perl extension had the same name, which led to neither matching all the
correct values (with more than 100 mistakes in one, and several thousand
in the other). The Perl extension has now been renamed to be
Decomposition_Type=Noncanonical (short: dt=noncanon). It has the same
meaning as was previously intended, namely the union of all the
non-canonical Decomposition types, with Unicode Compat being just one of
those.
\p{Uppercase} and \p{Lowercase} have been brought into line with the
Unicode definitions. This means they each match a few more characters
than previously.
\p{Cntrl} now matches the same characters as \p{Control}. This means it
no longer will match Private Use (gc=co), Surrogates (gc=cs), nor Format
(gc=cf) code points. The Format code points represent the biggest
possible problem. All but 36 of them are either officially deprecated
or strongly discouraged from being used. Of those 36, likely the most
widely used are the soft hyphen (U+00AD), and BOM, ZWSP, ZWNJ, WJ, and
similar, plus Bi-directional controls.
\p{Alpha} now matches the same characters as \p{Alphabetic}. The Perl
definition included a number of things that aren't really alpha (all
marks), while omitting many that were. The Unicode definition is
clearly better, so we are switching to it. As a direct consequence, the
definitions of \p{Alnum} and \p{Word} which depend on Alpha also change.
\p{Word} also now doesn't match certain characters it wasn't supposed
to, such as fractions.
\p{Print} no longer matches the line control characters: tab, lf, cr,
ff, vt, and nel. This brings it in line with the documentation.
\p{Decomposition_Type=Canonical} now includes the Hangul syllables
The Numeric type property has been extended to include the Unihan
characters.
There is a new Perl extension, the 'Present_In', or simply 'In'
property. This is an extension of the Unicode Age property, but
\p{In=5.0} matches any code point whose usage has been determined as of
Unicode version 5.0. The \p{Age=5.0} only matches code points added in 5.0.
A number of properties did not have the correct values for unassigned
code points. This is now fixed. The affected properties are
Bidi_Class, East_Asian_Width, Joining_Type, Decomposition_Type,
Hangul_Syllable_Type, Numeric_Type, and Line_Break.
The Default_Ignorable_Code_Point, ID_Continue, and ID_Start properties
have been updated to their current definitions.
Certain properties that are supposed to be Unicode internal-only were
erroneously exposed by previous Perls. Use of these in regular
expressions will now generate a deprecated warning message, if those
warnings are enabled. The properties are: Other_Alphabetic,
Other_Default_Ignorable_Code_Point, Other_Grapheme_Extend,
Other_ID_Continue, Other_ID_Start, Other_Lowercase, Other_Math, and
Other_Uppercase.
An installation can now fairly easily change Perl to operate on any
Unicode release. Perl is shipped with the latest official release, but
an installation can now download any prior release, and Perl will work
with that. Instructions are in perlunicode.pod
An installation can now fairly easily change which Unicode properties
Perl understands. As mentioned above, certain properties are by default
turned off. These include all the Unihan properties (which should be
accessible via the CPAN module Unicode::Unihan) and any deprecated or
Unicode internal-only property that Perl has never exposed.
The files in the To directory are now more clearly marked as being
stable, directly usable by applications. New hash entries in them give
the format of the normal entries which allows for easier machine
parsing. Perl can generate files in this directory for any property,
though most are suppressed. An installation can choose to change which
get written. Instructions are in perluniprops.pod
Thread Next
-
PATCH #69018; revamped mktables
by karl williamson