develooper Front page | perl.perl5.porters | Postings from July 2005

RFC: index core pods with X<>

Thread Next
From:
Ivan Tubert-Brohman
Date:
July 26, 2005 11:52
Subject:
RFC: index core pods with X<>
Message ID:
42E6867F.4030206@cpan.org
SYNOPSIS

Let's use the X<> POD formatting code for indexing the Perl core 
documentation. This will allow easier searching of the documentation, 
and can be incorporated into tools like perldoc or third-party websites.


REASON

One common problem with using the perl docs is to figure out where to 
look for something (in perlop, perlre, perlfunc...?).

It is easy to search for a function in perlfunc using perldoc -f, or to 
search questions from the faq using perldoc -q. But if one wants to 
search for variables, operators, random topics, etc. there's really no 
easy way to do it (there is perltoc.pod, but it's huge and usually not 
very helpful for this purpose).

Some websites such as search.cpan.org and perldoc.perl.org provide 
search engines for the perl documentation; however, search engines tend 
to have a couple of limitations:

1) they usually don't handle "line noise" ;-) very well. That is, 
searching for things like $*.

2) even if they are able to search for such things, they tend to return 
too many results and have no way of telling which place is the best 
"canonical definition" for a term.

This document proposes addressing this problem with a hand-curated index 
to the perl documentation.


BACKGROUND

The little-known (or at least little-used) X<> code is described in perlpod:

  "X<topic name>" -- an index entry
    This is ignored by most formatters, but some may use it for build-
    ing indexes.  It always renders as empty-string.  Example: "X<abso-
    lutizing relative URLs>"

Currently it is used in only *one* place in the perl documentation: 
pod/perlfunc.pod uses it for the "-X" filetest operators.

Recently I proposed a patch to Pod::Perldoc to allow users to search for 
perl variables in perlvar with perldoc -a, similar to the use of perldoc 
-f for functions. Now I have a more ambitious plan: allowing users to 
search for arbitrary keywords, with, let's say, perldoc -k. The keywords 
would come from an index created by extracting all X<> terms from the 
documentation.

The details of how perldoc would handle the proposed -k switch can be 
discussed later. This document will focus on the basic infrastructure; 
that is, the conventions for the use of X<> in the documentation. The 
reason for keeping the discussion separate is that the two problems are 
largely independent; it is expected that other documentation tools and 
websites can also benefit from using the X<> data.


PROPOSED CONVENTIONS

I. Placement of the X<> entries

First, a definition. By "scope", I mean the part of the document that is 
deemed relevant to an index entry, and that may be extracted and shown 
in isolation by a processing or display tool. For example, perldoc -f 
considers the scope of a function to end at the beginning of the next 
=item, or at the end of the enclosing =over.

The X<> entries should be added at the end of a command or textblock 
paragraph (verbatim paragraphs are excluded). The scope of the index 
entry starts at the beginning of the paragraph to which it was attached; 
the end of the scope depends on the command type:

1) if the X<> is at the end of a textblock, the scope is that paragraph 
and zero or more verbatim paragraphs immediately following it.

2) if the X<> is at the end of a command paragraph, it depends on the 
type of command:

   * =head1, head2, etc.: The scope ends right before the next
     heading with equal or higher level. That is, a =head1 ends
     at the next =head1, and a =head2 ends at the next =head2 or
     =head1.

   * =item: the scope ends right before the next =item, or the =back
     that terminates the containing list. Note: "empty" items are
     not counted for terminating scopes, to allow for cases where
     multiple =items head a block of text. For example,

       =item function
       X<function>
       X<otherfunction>

       =item otherfunction

       C<function> and C<otherfunction> do the same thing,
       even if they    have different names...

       =item lemonade

Here the scope of the X<function> and X<otherfunction> entries starts 
with "=item function", and ends right before "=item lemonade".

3) other command paragraphs, such as =back, =over, =begin, =end, and 
=for should not be used for attaching X<> entries.


II. Content of the X<> entry.

* It should contain plain text without further formatting codes (with 
the possible exception of E<>).

* It should be considered case-insensitive.

* Non-word characters are allowed, so one can list things like operators 
and special variables.

* Use of synonyms is encouraged, to make things easier to find.

* To be consistent, words should be normalized to the singular whenever 
possible. For example, use X<operator> instead of X<operators>.

* The use of a comma in an index entry has a special meaning: it 
separates levels of hierarchy (or namespaces), as a way of classifying 
entries in more specific ways. For example, "X<operator, logical>", or 
"X<operator, logical, xor>". This information may be used by processing 
programs to arrange the entries, or for listing results when a user 
searches for a namespace that contains several entries.

* There's no limitation as to the number of times that a given entry can 
appear in a document or collection of documents. That is, it is not an 
error to have X<whatever> appear twice in the same file.


EXAMPLE

As an initial example of an indexed pod file, I'm attaching a patch for 
perlop.pod that adds X<> entries ("perlop-index.diff"). This patch was 
prepared as a quick example, so it should not be considered definitive.

The attached program "podindex.pl" is a simple example of how one can 
extract the index entries from pod files to generate a sorted index file.

The sample output from running ./podindex.pl perlop.pod perlfunc.pod is 
attached as "index.txt". This index file contains the filename and line 
number for each entry, and it could be easily used for lookups by means 
of the Search::Dict module, or loaded into a database.


PLAN OF ACTION

Perl comes with over 100 files in the pod/ directory, totaling over 
100,000 lines of POD. Obviously, indexing all of it by hand is a very 
large task, so the question arises as to who will do it. If people agree 
that this is a good idea and are willing to apply the patches, I could 
lead the project, and hope to attracting volunteers. In the worst case 
(no one else is willing to help), I believe that even if I can't index 
*all* of the pods, a partial index is better than no index at all. I 
would start with the documents that I consider more important, such as 
perlop, perlsub, perlre, perlobj, etc. Documents such as perldelta* and 
the faqs probably don't need indexing that much.

Initially, this project will focus solely on the files in the pod/ 
directory. In the long, long term we can also consider indexing the pods 
that come with the core modules.


AUTHOR

Ivan Tubert-Brohman <itub@cpan.org>

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About