Revision as of 06:39, 28 November 2009

Points of Contact

Project Sponsor

Name: Mike McGrath
Fedora Account Name: mmcgrath
Group: Infrastructure
Infrastructure Sponsor: mmcgrath

Secondary Contact info

Name: Allen Kistler
Fedora Account Name: akistler
Group: Infrastructure

Name: Huzaifa Sidhpurwala
Fedora Account Name: huzaifas
Group: Infrastructure

Project Info

Project Name: Search Engine Enhancement
Target Audience: All users of Fedora web sites
Expiration/Delivery Date (required): F13

Description/Summary

Fedora needs a search engine^[1]

Requirements

Crawl the web sites (wiki and non-wiki)
Search the web sites (wiki and non-wiki)
Java, if any, must be the GCJ/OpenJDK versions in RHEL5; Sun/IBM/BEA Java is not acceptable

Preferences

Python-based
Programmable keywords to have control over what pages get displayed for certain keywords
XML or library interface so other applications can use it

Project Plan

Investigate and evaluate existing open source search engines
Select candidate software
Create public test instances of candidate software
Test for functionality, performance, and impact (re-evaluate, if necessary)
Create capacity and deployment plans
Deploy

Resources Needed

Public Test for testing candidate software
Permanent home(s) for deployment
- Web server(s)
- Database server(s) (maybe)

Software Investigation and Evaluation

In Progress

CLucene ^[2]

C++ port of Lucene

in Fedora already

described as beta by the developers

DataparkSearch ^[3]

written in C

Egothor ^[4]

written in Java

Ferret ^[5]

Ruby port of Lucene

KinoSearch and Ferret intend to merge as Lucy ^[6]

Indri ^[7]

written in C/C++

Isearch ^[8]

written in C++

KinoSearch ^[9] - akistler examined

Description

Perl/C port of Lucene

in Fedora already

maintainer Rectangular Research appears to be just one guy, who considers KinoSearch to be alpha software

KinoSearch and Ferret intend to merge as Lucy ^[6]

Evaluation

Search engine library with sample indexer and search page rather than fully-functional application. Stores indices in Berkeley DB files with JSON interfaces. Allows custom-designed indices, including categories (exact match) to fulfill "programmable keywords" requirement. Each document index on each document source is a single write-once file collection (BDB and JSON) in a unique directory. Rerunning the indexer creates a new directory, obsoleting the old directory if all the old documents are included. The old directory then needs to be cleaned up. Postings can, however, be deleted from an index. Additionally, only the new documents can be indexed, but that's not efficient.

Requirements

buildrequires

gcc
(EPEL) perl-Module-Build

requires

(EPEL) perl-JSON-XS

Problem: Desires 1.53, but EPEL has 1.43

Note: http://web.archive.org/web/20071122035408/search.cpan.org/src/MLEHMANN/JSON-XS-1.53/

Note: works with 1.43, anyway

(EPEL) perl-Lingua-Stem-Snowball
(EPEL) perl-Lingua-StopWords
(EPEL) perl-Parse-RecDescent

sample indexer reads files from the file system and requires

(EPEL) perl-HTML-Tree

sample cgi search script requires

(CPAN) Data::Pageset (which requires Data::Page)
(EPEL) perl-Test-Exception
(EPEL) perl-Class-Accessor-Chained

Lucene ^[10] - akistler examined

Description

written in Java, but ported to others ^[11]

in Fedora already

PyLucene ^[12] is a Python wrapper around Java Lucene

Evaluation

Search engine library meant to be integrated into applications.

Requirements

buildrequires (based on 1.4.3-f7)

ant
ant-junit
java-1.4.2-gcj-compat-devel
javacc
jpackage-utils
junit
make

requires (based on 1.4.3-f7)

java-1.4.2-gcj-compat

mnoGoSearch ^[13] - Allen investigating

written in C

Namazu ^[14] - Huzaifa investigating

written in Perl

in Fedora already

Nutch ^[15]

written in Java
based on Lucene

Solr ^[16] - akistler examined

Description

written in Java

based on Lucene

Evaluation

The documentation describes installing Sun Java to run Solr, but OpenJDK 1.5 or later is fine. Solr needs a Java servlet container in which to run. It comes with Jetty, but other containers should work, as well (e.g., Tomcat). Currently only supports UTF-8 characters. Basically Solr provides an HTTP admin GUI for a search engine that uses a superset of the Lucene query syntax. The schema is very flexible. Set-up is essentially entirely through XML files. Applications can query the servlet port and get XML or JSON responses.

Requirements

buildrequires

ant (note that ant currently pulls in java-gcj-compat, too, but it appears not to be a problem)
ant-junit
java-1.6.0-openjdk-devel
junit

requires

java-1.6.0-openjdk

Swish-e ^[17] - akistler examined

Description

written in C

Note: Swish++ is a rewrite in C++ (not evaluated here)

Evaluation

Search engine with a built-in web crawler, a built-in file system crawler, and an interface for an external crawler. The distribution includes sample search pages which use the Perl API. There is also a C API. The index is not customizable, but does include a facility for including metawords (exact match) and the path in the index for each document. The documentation acknowledges that the software only supports ASCII, but some MBCS may also work.

Requirements

buildrequires

gcc
make
libxml2-devel
zlib-devel

requires

libxml2
zlib
perl-libwww-perl (for the built-in spider)
others as desired to index documents (pdf, etc.)

Terrier (TERabyte RetrIEveR) ^[18]

written in Java

Xapian ^[19] - akistler examined

Description

written in C++

bindings to Python, Ruby, and Perl XS

Omega is a CGI application that provides a Xapian front-end for indexing and searching

xapian-core, xapian-bindings, and perl-Search-Xapian in Fedora already; xapian-omega is not

additional bindings to PHP, Java, and more (?)

Omega provides glue scripts for ht://Dig, mbox files, and perl DBI

Flax ^[20] is another search engine built on top of Xapian and CherryPy

Evaluation

Xapian is a search engine library. Omega adds functionality on top of Xapian. The Xapian database is very flexible, supporting an entirely user-designed schema. Usage through Omega loses very little, if any, of that flexibility. Omega requires the database to be named "default." Database columns are of type field or index. Fields are stored verbatim (e.g., URL, date, MIME type, keywords). Indices are input as blocks of text or other content to be indexed, but not stored (e.g., the corpus of a file or web page). The Omega scriptindex utility can possibly be combined with an external web crawler (e.g., Swish-e's spider.pl) for HTML.

Requirements

xapian-core buildrequires

gcc gcc-c++
make
zlib-devel

xapian-bindings buildrequires (not including gcc gcc-c++ make)

python python-devel
ruby ruby-devel
xapian-core-devel

perl-Search-Xapian buildrequires (not including gcc gcc-c++ make)

perl
xapian-core-devel

xapian-omega buildrequires (not including gcc gcc-c++ make)

libtool
xapian-core-devel

xapian-core requires

<tbd>

xapian-bindings requires

<tbd>

perl-Search-Xapian requries

<tbd>

xapian-omega requires

httpd
perl-DBI
xapian-core-libs

Zettair ^[21]

written in C

Not Suitable

Gonzui ^[22]

written in Ruby
specializes in source code search
not actively maintained

Grub ^[23]

written in C#

Heritrix ^[24]

written in Java
archives content rather than indexing it

ht://Dig ^[25]

written in C++
not actively maintained

HtdigSearch ^[26]

It's just a MediaWiki plugin, not suitable for searching non-wiki sites

MWSearch ^[27]

Requires EzMwLucene (Java, not desirable) to be running on the servers to be searched
EzMwLucene is wiki-only, therefore MWSearch is wiki-only

OpenFTS ^[28]

written in Perl or TCL on top of PostgreSQL
Python interface available
not actively maintained

Plucene ^[29]

Perl port of Lucene
not actively maintained

RigorousSearch ^[30]

Crawls the MediaWiki database, not the web site

Doesn't work for non-MediaWiki web sites, including any non-wiki web site

Sphinx ^[31]

written in C++

designed to index SQL tables, not web pages

SphinxSearch ^[32]

Written in C++

MediaWiki plug-in, so it's wiki-only

YaCy ^[33] - huzaifa examined

written in Java, but requires Sun Java
well maintained
support for peer search engine database exchanges
customized search parameters
fast indexing and web interface for querying the back-end db

Comparison by Requirements

Engine Name	Source Language	Integrated Web Crawler	Integrated Web Front-End	Programmable Categories	Application Interface
DataparkSearch	C
Egothor	Java
Ferret	Ruby
Indri	C/C++
Isearch	C++
KinoSearch	Perl/C	No (sample file crawler included)	No (sample included)	Yes	Yes (BDB/JSON)
Lucene	Java (GCJ)	No	No	Yes	Yes
mnoGoSearch	C
Namazu	Perl
Nutch	Java
Solr	Java (OpenJDK)	No	No (admin GUI only)	Yes	Yes
Swish-e	C/Perl	Yes (Perl)	No (sample included, but has problems)	No (but can search on META tags)	Yes (Perl and C APIs)
Terrier	Java
Xapian	C++	No (Combine with Swish-e?)	Yes (Omega CGI)	Yes	Yes (C++, Perl, Python, Ruby)
Zettair	C
Engine Name	Source Language	Integrated Web Crawler	Integrated Web Front-End	Programmable Categories	Application Interface

Public Testing

<tbd>

Deployment Plan

<tbd>

References

↑ "Fedora Search Engine". Infrastructure Trac. https://fedorahosted.org/fedora-infrastructure/ticket/1055.
↑ "CLucene". CLucene Project. http://sourceforge.net/projects/clucene/.
↑ "DataparkSearch". DataparkSearch. http://www.dataparksearch.org/.
↑ "Egothor". Egothor. http://www.egothor.org/.
↑ "Ferret". David Balmain. http://ferret.davebalmain.com/.
↑ ^6.0 ^6.1 "Lucy". Apache Software Foundation. http://lucene.apache.org/lucy/.
↑ "Indri". The Lemur Project. http://www.lemurproject.org/indri/.
↑ "Isearch". Isite. http://isite.awcubed.com/.
↑ "KinoSearch". Rectangular Research. http://www.rectangular.com/kinosearch/.
↑ "Lucene". Apache Software Foundation. http://lucene.apache.org/.
↑ "Lucene Implementations". Apache Software Foundation. http://wiki.apache.org/lucene-java/LuceneImplementations.
↑ "PyLucene". Apache Software Foundation. http://lucene.apache.org/pylucene/.
↑ "mnoGoSearch". LavTech. http://www.mnogosearch.org/.
↑ "Namazu". Namazu Project. http://www.namazu.org/.
↑ "Nutch". Apache Software Foundation. http://lucene.apache.org/nutch/.
↑ "Solr". Apache Software Foundation. http://lucene.apache.org/solr/.
↑ "Swish-e". Swish-e. http://swish-e.org/.
↑ "Terrier". Terrier Project. http://ir.dcs.gla.ac.uk/terrier/.
↑ "Xapian". Xapian Project. http://xapian.org/.
↑ "Flax". Flax. http://www.flax.co.uk/products.shtml.
↑ "Zettair". Search Engine Group, Royal Melbourne Institute of Technology. http://www.seg.rmit.edu.au/zettair/.
↑ "Gonzui". SourceForge. http://gonzui.sourceforge.net/.
↑ "Grub". Wikia, Inc.. http://grub.org/.
↑ "Heritrix". Internet Archive. http://crawler.archive.org/.
↑ "ht://Dig". The ht://Dig Group. http://www.htdig.org/.
↑ "HtdigSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:HtdigSearch.
↑ "MWSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:MWSearch.
↑ "OpenFTS". XWare. http://www.astronet.ru/xware/#fts.
↑ "Plucene". CPAN. http://search.cpan.org/~tmtm/Plucene-1.25.
↑ "RigorousSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:RigorousSearch.
↑ "Sphinx". Sphinx Technologies. http://sphinxsearch.com/.
↑ "SphinxSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:SphinxSearch.
↑ "YaCy". Karlsruhe Institute of Technology. http://yacy.net/.

[Trac-1] "Fedora Search Engine". Infrastructure Trac. https://fedorahosted.org/fedora-infrastructure/ticket/1055.

[CLucene-2] "CLucene". CLucene Project. http://sourceforge.net/projects/clucene/.

[DataparkSearch-3] "DataparkSearch". DataparkSearch. http://www.dataparksearch.org/.

[Egothor-4] "Egothor". Egothor. http://www.egothor.org/.

[Ferret-5] "Ferret". David Balmain. http://ferret.davebalmain.com/.

[Lucy-6] 6.0 ^6.1 "Lucy". Apache Software Foundation. http://lucene.apache.org/lucy/.

[Indri-7] "Indri". The Lemur Project. http://www.lemurproject.org/indri/.

[Isearch-8] "Isearch". Isite. http://isite.awcubed.com/.

[KinoSearch-9] "KinoSearch". Rectangular Research. http://www.rectangular.com/kinosearch/.

[Lucene-10] "Lucene". Apache Software Foundation. http://lucene.apache.org/.

[LuceneImplementations-11] "Lucene Implementations". Apache Software Foundation. http://wiki.apache.org/lucene-java/LuceneImplementations.

[PyLucene-12] "PyLucene". Apache Software Foundation. http://lucene.apache.org/pylucene/.

[mnoGoSearch-13] "mnoGoSearch". LavTech. http://www.mnogosearch.org/.

[Namazu-14] "Namazu". Namazu Project. http://www.namazu.org/.

[Nutch-15] "Nutch". Apache Software Foundation. http://lucene.apache.org/nutch/.

[Solr-16] "Solr". Apache Software Foundation. http://lucene.apache.org/solr/.

[Swish-e-17] "Swish-e". Swish-e. http://swish-e.org/.

[Terrier-18] "Terrier". Terrier Project. http://ir.dcs.gla.ac.uk/terrier/.

[Xapian-19] "Xapian". Xapian Project. http://xapian.org/.

[Flax-20] "Flax". Flax. http://www.flax.co.uk/products.shtml.

[Zettair-21] "Zettair". Search Engine Group, Royal Melbourne Institute of Technology. http://www.seg.rmit.edu.au/zettair/.

[Gonzui-22] "Gonzui". SourceForge. http://gonzui.sourceforge.net/.

[Grub-23] "Grub". Wikia, Inc.. http://grub.org/.

[Heritrix-24] "Heritrix". Internet Archive. http://crawler.archive.org/.

[htDig-25] "ht://Dig". The ht://Dig Group. http://www.htdig.org/.

[HtdigSearch-26] "HtdigSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:HtdigSearch.

[MWSearch-27] "MWSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:MWSearch.

[OpenFTS-28] "OpenFTS". XWare. http://www.astronet.ru/xware/#fts.

[Plucene-29] "Plucene". CPAN. http://search.cpan.org/~tmtm/Plucene-1.25.

[RigorousSearch-30] "RigorousSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:RigorousSearch.

[Sphinx-31] "Sphinx". Sphinx Technologies. http://sphinxsearch.com/.

[SphinxSearch-32] "SphinxSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:SphinxSearch.

[YaCy-33] "YaCy". Karlsruhe Institute of Technology. http://yacy.net/.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

@@ Line 125: / Line 125: @@
 ::* java-1.4.2-gcj-compat
-* mnoGoSearch <ref name="mnoGoSearch">{{cite web|url=http://www.mnogosearch.org/|title=mnoGoSearch|publisher=LavTech}}</ref>
+* mnoGoSearch <ref name="mnoGoSearch">{{cite web|url=http://www.mnogosearch.org/|title=mnoGoSearch|publisher=LavTech}}</ref> - '''Allen investigating'''
 : written in C
@@ Line 172: / Line 172: @@
 : written in Java
-* Xapian <ref name="Xapian">{{cite web|url=http://xapian.org/|title=Xapian|publisher=Xapian Project}}</ref> - '''Allen investigating'''
+* Xapian <ref name="Xapian">{{cite web|url=http://xapian.org/|title=Xapian|publisher=Xapian Project}}</ref> - akistler examined
 :* Description
 :: written in C++
@@ Line 182: / Line 182: @@
 :: Flax <ref name="Flax">{{cite web|url=http://www.flax.co.uk/products.shtml|title=Flax|publisher=Flax}}</ref> is another search engine built on top of Xapian and CherryPy
 :* Evaluation
-:: Xapian is a search engine library.  Omega adds functionality on top of Xapian.  The Xapian database is very flexible, supporting an entirely user-designed schema.  Usage through Omega loses very little, if any, of that flexibility.  Omega requires the database to be named "default."  Database columns are of type field or index.  Fields are stored verbatim (e.g., URL, date, MIME type, keywords).  Indices are input as blocks of text or other content to be indexed, but not stored (e.g., the corpus of a file or web page).  [In progress: The Omega scriptindex utility can probably be combined with the Swish-e external web crawler fairly easily.]
+:: Xapian is a search engine library.  Omega adds functionality on top of Xapian.  The Xapian database is very flexible, supporting an entirely user-designed schema.  Usage through Omega loses very little, if any, of that flexibility.  Omega requires the database to be named "default."  Database columns are of type field or index.  Fields are stored verbatim (e.g., URL, date, MIME type, keywords).  Indices are input as blocks of text or other content to be indexed, but not stored (e.g., the corpus of a file or web page).  The Omega scriptindex utility can possibly be combined with an external web crawler (e.g., Swish-e's spider.pl) for HTML.
 :* Requirements
 :: xapian-core buildrequires

Search

Infrastructure/Search: Difference between revisions

Revision as of 06:39, 28 November 2009

Contents

Points of Contact

Project Sponsor

Secondary Contact info

Project Info

Description/Summary

Requirements

Preferences

Project Plan

Resources Needed

Software Investigation and Evaluation

In Progress

Not Suitable

Comparison by Requirements

Public Testing

Deployment Plan

References