develooper Front page | perl.libwww | Postings from February 2001

Request For Advice: The Good Indexing Method

Thread Next
From:
John Indra
Date:
February 11, 2001 18:00
Subject:
Request For Advice: The Good Indexing Method
Message ID:
20010212084438.A2875@office.naver.co.id
Hi all...

First of all, please forgive me if this is the wrong group to talk about
this subject. But, this has something to do with Perl, and the web, so I
think I will take the shot.

I am trying to build my own search engine, from scratch, with Perl.
Currently I have finished building a robot, using libwww of course.

1. If I want to conform to this: "Build a friendly robot, don't run on other
web servers, just walk", I set my user agent to hit the remote web server
using delay = 1 minutes, then my robot is very slow (only one hit per
minute). What is the best and efficient way to make my robot still confomrs
to standard but having a better performance (can do parallel request per
minute)?

2. After my robot finishes crawling the web, I need to build an index.
Currently what I have in mind is to use B-Tree algorithm. So, after the
robot finishes it jobs, then my indexer will start chopping stopwords from
the document, do some word stemming maybe. Well that's what I currently have
in mind. Now I am rather confused of what structure is best to store the
index information. I am using standard Perl module DB_File. Advices are very
welcome.

I have a weak knowledge in Data Structure and Computer Science, so if you
can, please give me some guide of URLs to read if I have to be faced with
complex data structure manipulation.

Thanks...

/john


Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About