develooper Front page | perl.libwww | Postings from February 2001

Re: Request For Advice: The Good Indexing Method

Thread Previous | Thread Next
From:
Oyku Gencay
Date:
February 13, 2001 14:39
Subject:
Re: Request For Advice: The Good Indexing Method
Message ID:
002501c0960d$e63247d0$1401a8c0@oyku.net
Hi John,

Fist of all it is a very good choice to use Perl and LWP for developing a
search engine. Refering to what you have written it will take almost forever
to crawl with your settings. If you intend to be *polite* you will need a
large bandwith and be able to use current search engines to *guide* your
robot.

Generally there is a misconception issue with search engine development. The
crawler part is only at maximum 10% of the whole system. I don't know how
large you are intending to be but the bottleneck between indexing, storing
and searching parts is the storage part. It determines the search
performance and scalability.

There a lots of flavors of indexing and storage schemas each  favoring types
of searches. You should initially identfy your needs. Do not simply say "I
want multiple keyword with AND boolean searches and phrase search" The
requirement sometimes affects the crawler design. Actually this is the main
reason why you cannot build a search engine with out of the box software
even you may several 100K bucks.

Actualyl what you must know to build a search engine is in dept data
structure information. The search engine has been a magic art because there
is not much publicly available source, and search engine companies would not
reveal their (very simple) secrets. But let me give you a clue. Go and
search for WWW conference proceedings, you'll find invaluable information.
Also if you know java, check out www.lucene.org its an open source 100% java
search engine developed by chief architect of Excite.

I've been developing commerical large scale search engines for more than 2
years, and te bottom line is
1. Perl is a good choice
2. The secret is the storage :)

Hope this helps.

Oyku Gencay

----- Original Message -----
From: John Indra <john@office.naver.co.id>
To: <libwww@perl.org>
Sent: Monday, February 12, 2001 3:44 AM
Subject: Request For Advice: The Good Indexing Method


> Hi all...
>
> First of all, please forgive me if this is the wrong group to talk about
> this subject. But, this has something to do with Perl, and the web, so I
> think I will take the shot.
>
> I am trying to build my own search engine, from scratch, with Perl.
> Currently I have finished building a robot, using libwww of course.
>
> 1. If I want to conform to this: "Build a friendly robot, don't run on
other
> web servers, just walk", I set my user agent to hit the remote web server
> using delay = 1 minutes, then my robot is very slow (only one hit per
> minute). What is the best and efficient way to make my robot still
confomrs
> to standard but having a better performance (can do parallel request per
> minute)?
>
> 2. After my robot finishes crawling the web, I need to build an index.
> Currently what I have in mind is to use B-Tree algorithm. So, after the
> robot finishes it jobs, then my indexer will start chopping stopwords from
> the document, do some word stemming maybe. Well that's what I currently
have
> in mind. Now I am rather confused of what structure is best to store the
> index information. I am using standard Perl module DB_File. Advices are
very
> welcome.
>
> I have a weak knowledge in Data Structure and Computer Science, so if you
> can, please give me some guide of URLs to read if I have to be faced with
> complex data structure manipulation.
>
> Thanks...
>
> /john


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About