develooper Front page | perl.pep | Postings from November 2017

Re: Any advice for a searchable web archiver ?

Thread Previous | Thread Next
From:
Bron Gondwana
Date:
November 20, 2017 14:08
Subject:
Re: Any advice for a searchable web archiver ?
Message ID:
1511144942.4024437.1177949680.0FF66812@webmail.messagingengine.com
On Mon, 20 Nov 2017, at 07:52, Marc Chantreux wrote:
> Hello,

Hi Marc

> As the sympa community (http://www.sympa.org) recently grown, we are
> thinking about revamping the whole UI and we would like to have
> a new web archiver based on:
> 
> * no default frontend but exposing the API through REST, websockets or>   whatever.
> * maximizing the interactions between Sympa and CPAN
> * trying to avoid other dynamic langage or jvm dependency
>   (or considering them as temporary solutions)
> * being JMAP friendly (we bet on it to become a very healthy
>   community)
I'm glad to see that you're interested in JMAP :)  We're also betting
very heavily on it at FastMail as I'm sure you're aware!
> My first idea was to use notmuch, PEP modules and Dancer on top of
> maildirs then i discover Dezi (inactive since 2015) and the use of
> Lucy (also used by the very active librecat project).
> 
> I know Dezi is a general search engine but i hope that taking care of> a good email support for it than reinvent the wheel.
> 
> Those are lot of things to look for if i want to have a clear opinion> on a good strategy. Any advice would be warmly welcome.

We're using Xapian as part of Cyrus IMAP, and it's quite useful for
what we're doing, though I'm sure any search engine would be fine.
There are some pitfalls to look out for, for example if you naively
index everything a search for "references" is going to return quite a
lot of messages.
Another problem with naive indexing is that Maildir allows message file
names to move as flags are added/removed, and you'll want your indexer
to avoid reindexing them every time.  I expect you might already have a
datastructure that handles that though.
In terms of search usefulness, most of our customers love the stemming
support, but it does have some exciting issues around languages and
diacritics and inability to match on anything other than word prefixes -
so you can't match partial strings inside a word.  That may or may not
be an issue for your usecase.
I don't know Dezi or Lucy, so I don't have a strong opinion there.

Regards,

Bron.

--
  Bron Gondwana
  brong@fastmail.fm



Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About