Front page | perl.perl5.porters |
Postings from July 2005
RFC: index core pods with X<>
From: Ivan Tubert-Brohman
July 26, 2005 11:52
RFC: index core pods with X<>
Message ID: 42E6867F.email@example.com
Let's use the X<> POD formatting code for indexing the Perl core
documentation. This will allow easier searching of the documentation,
and can be incorporated into tools like perldoc or third-party websites.
One common problem with using the perl docs is to figure out where to
look for something (in perlop, perlre, perlfunc...?).
It is easy to search for a function in perlfunc using perldoc -f, or to
search questions from the faq using perldoc -q. But if one wants to
search for variables, operators, random topics, etc. there's really no
easy way to do it (there is perltoc.pod, but it's huge and usually not
very helpful for this purpose).
Some websites such as search.cpan.org and perldoc.perl.org provide
search engines for the perl documentation; however, search engines tend
to have a couple of limitations:
1) they usually don't handle "line noise" ;-) very well. That is,
searching for things like $*.
2) even if they are able to search for such things, they tend to return
too many results and have no way of telling which place is the best
"canonical definition" for a term.
This document proposes addressing this problem with a hand-curated index
to the perl documentation.
The little-known (or at least little-used) X<> code is described in perlpod:
"X<topic name>" -- an index entry
This is ignored by most formatters, but some may use it for build-
ing indexes. It always renders as empty-string. Example: "X<abso-
lutizing relative URLs>"
Currently it is used in only *one* place in the perl documentation:
pod/perlfunc.pod uses it for the "-X" filetest operators.
Recently I proposed a patch to Pod::Perldoc to allow users to search for
perl variables in perlvar with perldoc -a, similar to the use of perldoc
-f for functions. Now I have a more ambitious plan: allowing users to
search for arbitrary keywords, with, let's say, perldoc -k. The keywords
would come from an index created by extracting all X<> terms from the
The details of how perldoc would handle the proposed -k switch can be
discussed later. This document will focus on the basic infrastructure;
that is, the conventions for the use of X<> in the documentation. The
reason for keeping the discussion separate is that the two problems are
largely independent; it is expected that other documentation tools and
websites can also benefit from using the X<> data.
I. Placement of the X<> entries
First, a definition. By "scope", I mean the part of the document that is
deemed relevant to an index entry, and that may be extracted and shown
in isolation by a processing or display tool. For example, perldoc -f
considers the scope of a function to end at the beginning of the next
=item, or at the end of the enclosing =over.
The X<> entries should be added at the end of a command or textblock
paragraph (verbatim paragraphs are excluded). The scope of the index
entry starts at the beginning of the paragraph to which it was attached;
the end of the scope depends on the command type:
1) if the X<> is at the end of a textblock, the scope is that paragraph
and zero or more verbatim paragraphs immediately following it.
2) if the X<> is at the end of a command paragraph, it depends on the
type of command:
* =head1, head2, etc.: The scope ends right before the next
heading with equal or higher level. That is, a =head1 ends
at the next =head1, and a =head2 ends at the next =head2 or
* =item: the scope ends right before the next =item, or the =back
that terminates the containing list. Note: "empty" items are
not counted for terminating scopes, to allow for cases where
multiple =items head a block of text. For example,
C<function> and C<otherfunction> do the same thing,
even if they have different names...
Here the scope of the X<function> and X<otherfunction> entries starts
with "=item function", and ends right before "=item lemonade".
3) other command paragraphs, such as =back, =over, =begin, =end, and
=for should not be used for attaching X<> entries.
II. Content of the X<> entry.
* It should contain plain text without further formatting codes (with
the possible exception of E<>).
* It should be considered case-insensitive.
* Non-word characters are allowed, so one can list things like operators
and special variables.
* Use of synonyms is encouraged, to make things easier to find.
* To be consistent, words should be normalized to the singular whenever
possible. For example, use X<operator> instead of X<operators>.
* The use of a comma in an index entry has a special meaning: it
separates levels of hierarchy (or namespaces), as a way of classifying
entries in more specific ways. For example, "X<operator, logical>", or
"X<operator, logical, xor>". This information may be used by processing
programs to arrange the entries, or for listing results when a user
searches for a namespace that contains several entries.
* There's no limitation as to the number of times that a given entry can
appear in a document or collection of documents. That is, it is not an
error to have X<whatever> appear twice in the same file.
As an initial example of an indexed pod file, I'm attaching a patch for
perlop.pod that adds X<> entries ("perlop-index.diff"). This patch was
prepared as a quick example, so it should not be considered definitive.
The attached program "podindex.pl" is a simple example of how one can
extract the index entries from pod files to generate a sorted index file.
The sample output from running ./podindex.pl perlop.pod perlfunc.pod is
attached as "index.txt". This index file contains the filename and line
number for each entry, and it could be easily used for lookups by means
of the Search::Dict module, or loaded into a database.
PLAN OF ACTION
Perl comes with over 100 files in the pod/ directory, totaling over
100,000 lines of POD. Obviously, indexing all of it by hand is a very
large task, so the question arises as to who will do it. If people agree
that this is a good idea and are willing to apply the patches, I could
lead the project, and hope to attracting volunteers. In the worst case
(no one else is willing to help), I believe that even if I can't index
*all* of the pods, a partial index is better than no index at all. I
would start with the documents that I consider more important, such as
perlop, perlsub, perlre, perlobj, etc. Documents such as perldelta* and
the faqs probably don't need indexing that much.
Initially, this project will focus solely on the files in the pod/
directory. In the long, long term we can also consider indexing the pods
that come with the core modules.
Ivan Tubert-Brohman <firstname.lastname@example.org>
RFC: index core pods with X<>
by Ivan Tubert-Brohman