develooper Front page | perl.libwww | Postings from February 2001

Re: Request For Advice: The Good Indexing Method

Thread Previous
February 14, 2001 00:20
Re: Request For Advice: The Good Indexing Method
Message ID:
- I would go with mysql for storage. It would take care of the indexing job for small and simple search engine (Unless you want to crawl one billion web sites).
- When crawling the web with LWP, you will notice that secure web sites (https://) will cause errors unless you have openSSL and Crypt::SSL installed on your server. You can off course just ignore urls with https:// but this is not a full solution because of redirects.

*********** REPLY SEPARATOR  ***********

On 14.02.2001 at 00:39 Oyku Gencay wrote:

>Hi John,
>Fist of all it is a very good choice to use Perl and LWP for developing a
>search engine. Refering to what you have written it will take almost
>to crawl with your settings. If you intend to be *polite* you will need a
>large bandwith and be able to use current search engines to *guide* your
>Generally there is a misconception issue with search engine development.
>crawler part is only at maximum 10% of the whole system. I don't know how
>large you are intending to be but the bottleneck between indexing, storing
>and searching parts is the storage part. It determines the search
>performance and scalability.
>There a lots of flavors of indexing and storage schemas each  favoring
>of searches. You should initially identfy your needs. Do not simply say "I
>want multiple keyword with AND boolean searches and phrase search" The
>requirement sometimes affects the crawler design. Actually this is the main
>reason why you cannot build a search engine with out of the box software
>even you may several 100K bucks.
>Actualyl what you must know to build a search engine is in dept data
>structure information. The search engine has been a magic art because there
>is not much publicly available source, and search engine companies would
>reveal their (very simple) secrets. But let me give you a clue. Go and
>search for WWW conference proceedings, you'll find invaluable information.
>Also if you know java, check out its an open source 100%
>search engine developed by chief architect of Excite.
>I've been developing commerical large scale search engines for more than 2
>years, and te bottom line is
>1. Perl is a good choice
>2. The secret is the storage :)
>Hope this helps.
>Oyku Gencay
>----- Original Message -----
>From: John Indra <>
>To: <>
>Sent: Monday, February 12, 2001 3:44 AM
>Subject: Request For Advice: The Good Indexing Method
>> Hi all...
>> First of all, please forgive me if this is the wrong group to talk about
>> this subject. But, this has something to do with Perl, and the web, so I
>> think I will take the shot.
>> I am trying to build my own search engine, from scratch, with Perl.
>> Currently I have finished building a robot, using libwww of course.
>> 1. If I want to conform to this: "Build a friendly robot, don't run on
>> web servers, just walk", I set my user agent to hit the remote web server
>> using delay = 1 minutes, then my robot is very slow (only one hit per
>> minute). What is the best and efficient way to make my robot still
>> to standard but having a better performance (can do parallel request per
>> minute)?
>> 2. After my robot finishes crawling the web, I need to build an index.
>> Currently what I have in mind is to use B-Tree algorithm. So, after the
>> robot finishes it jobs, then my indexer will start chopping stopwords
>> the document, do some word stemming maybe. Well that's what I currently
>> in mind. Now I am rather confused of what structure is best to store the
>> index information. I am using standard Perl module DB_File. Advices are
>> welcome.
>> I have a weak knowledge in Data Structure and Computer Science, so if you
>> can, please give me some guide of URLs to read if I have to be faced with
>> complex data structure manipulation.
>> Thanks...
>> /john

Thread Previous Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About