From Fedora Project Wiki
(→‎Software Investigation and Evaluation: Add Sun Java qualification to YaCy; move YaCy to "Not Suitable")
(→‎Software Investigation and Evaluation: Moved Java and Ruby engines from "Not Suitable" to "In Progress")
Line 60: Line 60:
* DataparkSearch <ref name="DataparkSearch">{{cite web|url=http://www.dataparksearch.org/|title=DataparkSearch|publisher=DataparkSearch}}</ref>
* DataparkSearch <ref name="DataparkSearch">{{cite web|url=http://www.dataparksearch.org/|title=DataparkSearch|publisher=DataparkSearch}}</ref>
: written in C
: written in C
* Egothor <ref name="Egothor">{{cite web|url=http://www.egothor.org/|title=Egothor|publisher=Egothor}}</ref>
: written in Java
* Ferret <ref name="Ferret">{{cite web|url=http://ferret.davebalmain.com/|title=Ferret|publisher=David Balmain}}</ref>
: Ruby port of Lucene
: KinoSearch and Ferret intend to merge as Lucy <ref name="Lucy"/>


* Indri <ref name="Indri">{{cite web|url=http://www.lemurproject.org/indri/|title=Indri|publisher=The Lemur Project}}</ref>
* Indri <ref name="Indri">{{cite web|url=http://www.lemurproject.org/indri/|title=Indri|publisher=The Lemur Project}}</ref>
Line 92: Line 99:
::* (EPEL) perl-Test-Exception
::* (EPEL) perl-Test-Exception
::* (EPEL) perl-Class-Accessor-Chained
::* (EPEL) perl-Class-Accessor-Chained
* Lucene <ref name="Lucene">{{cite web|url=http://lucene.apache.org/|title=Lucene|publisher=Apache Software Foundation}}</ref>
: written in Java, but ported to others <ref name="LuceneImplementations">{{cite web|url=http://wiki.apache.org/lucene-java/LuceneImplementations|title=Lucene Implementations|publisher=Apache Software Foundation}}</ref>


* mnoGoSearch <ref name="mnoGoSearch">{{cite web|url=http://www.mnogosearch.org/|title=mnoGoSearch|publisher=LavTech}}</ref>
* mnoGoSearch <ref name="mnoGoSearch">{{cite web|url=http://www.mnogosearch.org/|title=mnoGoSearch|publisher=LavTech}}</ref>
Line 98: Line 108:
* Namazu <ref name="Namazu">{{cite web|url=http://www.namazu.org/|title=Namazu|publisher=Namazu Project}}</ref> '''- Huzaifa investigating'''
* Namazu <ref name="Namazu">{{cite web|url=http://www.namazu.org/|title=Namazu|publisher=Namazu Project}}</ref> '''- Huzaifa investigating'''
: written in Perl
: written in Perl
* Nutch <ref name="Nutch">{{cite web|url=http://lucene.apache.org/nutch/|title=Nutch|publisher=Apache Software Foundation}}</ref>
:* written in Java
:* based on Lucene


* Swish-e <ref name="Swish-e">{{cite web|url=http://swish-e.org/|title=Swish-e|publisher=Swish-e}}</ref>
* Swish-e <ref name="Swish-e">{{cite web|url=http://swish-e.org/|title=Swish-e|publisher=Swish-e}}</ref>
: written in C
: written in C
: Swish++ is a rewrite in C++
: Swish++ is a rewrite in C++
* Terrier (TERabyte RetrIEveR) <ref name="Terrier">{{cite web|url=http://ir.dcs.gla.ac.uk/terrier/|title=Terrier|publisher=Terrier Project}}</ref>
: written in Java


* Xapian <ref name="Xapian">{{cite web|url=http://xapian.org/|title=Xapian|publisher=Xapian Project}}</ref>
* Xapian <ref name="Xapian">{{cite web|url=http://xapian.org/|title=Xapian|publisher=Xapian Project}}</ref>
Line 114: Line 131:


=== Not Suitable ===
=== Not Suitable ===
* Egothor <ref name="Egothor">{{cite web|url=http://www.egothor.org/|title=Egothor|publisher=Egothor}}</ref>
: written in Java
* Ferret <ref name="Ferret">{{cite web|url=http://ferret.davebalmain.com/|title=Ferret|publisher=David Balmain}}</ref>
: Ruby port of Lucene
: KinoSearch and Ferret intend to merge as Lucy <ref name="Lucy"/>


* Grub <ref name="Grub">{{cite web|url=http://grub.org/|title=Grub|publisher=Wikia, Inc.}}</ref>
* Grub <ref name="Grub">{{cite web|url=http://grub.org/|title=Grub|publisher=Wikia, Inc.}}</ref>
Line 135: Line 145:
* HtdigSearch <ref name="HtdigSearch">{{cite web|url=https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:HtdigSearch|title=HtdigSearch Extension|publisher=[[MediaWiki]]}}</ref>  
* HtdigSearch <ref name="HtdigSearch">{{cite web|url=https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:HtdigSearch|title=HtdigSearch Extension|publisher=[[MediaWiki]]}}</ref>  
: It's just a MediaWiki plugin, not suitable for searching non-wiki sites
: It's just a MediaWiki plugin, not suitable for searching non-wiki sites
* Lucene <ref name="Lucene">{{cite web|url=http://lucene.apache.org/|title=Lucene|publisher=Apache Software Foundation}}</ref>
: written in Java, but ported to others <ref name="LuceneImplementations">{{cite web|url=http://wiki.apache.org/lucene-java/LuceneImplementations|title=Lucene Implementations|publisher=Apache Software Foundation}}</ref>


* MWSearch <ref name="MWSearch">{{cite web|url=https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:MWSearch|title=MWSearch Extension|publisher=[[MediaWiki]]}}</ref>
* MWSearch <ref name="MWSearch">{{cite web|url=https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:MWSearch|title=MWSearch Extension|publisher=[[MediaWiki]]}}</ref>
:* Requires EzMwLucene (Java, not desirable) to be running on the servers to be searched
:* Requires EzMwLucene (Java, not desirable) to be running on the servers to be searched
:* EzMwLucene is wiki-only, therefore MWSearch is wiki-only
:* EzMwLucene is wiki-only, therefore MWSearch is wiki-only
* Nutch <ref name="Nutch">{{cite web|url=http://lucene.apache.org/nutch/|title=Nutch|publisher=Apache Software Foundation}}</ref>
:* written in Java
:* based on Lucene


* OpenFTS <ref name="OpenFTS">{{cite web|url=http://openfts.sourceforge.net/|title=OpenFTS|publisher=SourceForge}}</ref>
* OpenFTS <ref name="OpenFTS">{{cite web|url=http://openfts.sourceforge.net/|title=OpenFTS|publisher=SourceForge}}</ref>
Line 162: Line 165:
* Sphinx <ref name="Sphinx">{{cite web|url=http://sphinxsearch.com/|title=Sphinx|publisher=Sphinx Technologies}}</ref>
* Sphinx <ref name="Sphinx">{{cite web|url=http://sphinxsearch.com/|title=Sphinx|publisher=Sphinx Technologies}}</ref>
: written in C++
: written in C++
: designed to index SQL tables, not web pages.
: designed to index SQL tables, not web pages


* SphinxSearch <ref name="SphinxSearch">{{cite web|url=https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:SphinxSearch|title=SphinxSearch Extension|publisher=[[MediaWiki]]}}</ref>  
* SphinxSearch <ref name="SphinxSearch">{{cite web|url=https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:SphinxSearch|title=SphinxSearch Extension|publisher=[[MediaWiki]]}}</ref>  
: Written in C++
: Written in C++
: Wiki-only (?)
: MediaWiki plug-in, so it's wiki-only
 
* Terrier (TERabyte RetrIEveR) <ref name="Terrier">{{cite web|url=http://ir.dcs.gla.ac.uk/terrier/|title=Terrier|publisher=Terrier Project}}</ref>
: written in Java


* YaCy <ref name="YaCy">{{cite web|url=http://yacy.net/|title=YaCy|publisher=Karlsruhe Institute of Technology}}</ref> - Examined by Huzaifa
* YaCy <ref name="YaCy">{{cite web|url=http://yacy.net/|title=YaCy|publisher=Karlsruhe Institute of Technology}}</ref> - Examined by Huzaifa

Revision as of 20:46, 29 October 2009

Points of Contact

Project Sponsor

Name: Mike McGrath
Fedora Account Name: mmcgrath
Group: Infrastructure
Infrastructure Sponsor: mmcgrath

Secondary Contact info

Name: Huzaifa Sidhpurwala
Fedora Account Name: huzaifas
Group: Infrastructure

Name: Allen Kistler
Fedora Account Name: akistler
Group: Infrastructure

Project Info

Project Name: Search Engine Enhancement
Target Audience: All users of Fedora web sites
Expiration/Delivery Date (required): F13

Description/Summary

Fedora needs a search engine[1]

Requirements

  • Crawl the web sites (wiki and non-wiki)
  • Search the web sites (wiki and non-wiki)

Preferences

  • Python-based
Note: Other languages are permitted, but Java must be the GCJ/OpenJDK versions in RHEL5. Sun/IBM/BEA Java is not acceptable.
  • Programmable keywords to have control over what pages get displayed for certain keywords
  • XML or library interface so other applications can use it

Project Plan

  1. Investigate and evaluate existing open source search engines
  2. Select candidate software
  3. Create public test instance of candidate software
  4. Test for functionality, performance, and impact (re-evaluate, if necessary)
  5. Create capacity and deployment plans
  6. Deploy

Resources Needed

  • Public Test for testing candidate software
  • Permanent home(s) for deployment
    • Web server(s)
    • Database server(s)

Software Investigation and Evaluation

In Progress

  • DataparkSearch [2]
written in C
written in Java
Ruby port of Lucene
KinoSearch and Ferret intend to merge as Lucy [5]
written in C/C++
written in C++
  • KinoSearch [8] - Allen investigating
  • Description
Perl/C port of Lucene
Maintainer Rectangular Research appears to be just one guy, who considers KinoSearch to be alpha software
KinoSearch and Ferret intend to merge as Lucy [5]
  • Evaluation
Search engine library with sample indexer and search page rather than fully-functional application. Stores indices in Berkeley DB files with JSON interfaces. Allows custom-designed indices, including categories (exact match) to fulfill "programmable keywords" requirement. Each document index on each document source is a single write-once file collection (BDB and JSON) in a unique directory. Rerunning the indexer creates a new directory, obsoleting the old directory if all the old documents are included. The old directory then needs to be cleaned up. Postings can, however, be deleted from an index. Additionally, only the new documents can be indexed, but that's not efficient.
  • Requirements
buildrequires
  • gcc
  • (EPEL) perl-Module-Build
requires
  • (EPEL) perl-JSON-XS
Problem: Desires 1.53, but EPEL has 1.43
Note: http://web.archive.org/web/20071122035408/search.cpan.org/src/MLEHMANN/JSON-XS-1.53/
Note: works with 1.43, anyway
  • (EPEL) perl-Lingua-Stem-Snowball
  • (EPEL) perl-Lingua-StopWords
  • (EPEL) perl-Parse-RecDescent
sample indexer reads files from the file system and requires
  • (EPEL) perl-HTML-Tree
sample cgi search script requires
  • (CPAN) Data::Pageset (which requires Data::Page)
  • (EPEL) perl-Test-Exception
  • (EPEL) perl-Class-Accessor-Chained
written in Java, but ported to others [10]
written in C
  • Namazu [12] - Huzaifa investigating
written in Perl
  • written in Java
  • based on Lucene
written in C
Swish++ is a rewrite in C++
  • Terrier (TERabyte RetrIEveR) [15]
written in Java
written in C++
Bindings to Python, Perl, PHP, Ruby, Java, and more
Omega (builtin) provides a more complete search engine experience on top of core Xapian
Built-in web crawler is a script that interfaces with ht://Dig
Flax [17] is another search engine built on top of Xapian and CherryPy
written in C

Not Suitable

written in C#
  • written in Java
  • archives content rather than indexing it
  • written in C++
  • not actively maintained
It's just a MediaWiki plugin, not suitable for searching non-wiki sites
  • Requires EzMwLucene (Java, not desirable) to be running on the servers to be searched
  • EzMwLucene is wiki-only, therefore MWSearch is wiki-only
  • written in Perl or TCL on top of PostgreSQL
  • Python interface available
  • not actively maintained
  • Perl port of Lucene
  • not actively maintained
Crawls the MediaWiki database, not the web site
Doesn't work for non-MediaWiki web sites, including any non-wiki web site
written in C++
designed to index SQL tables, not web pages
Written in C++
MediaWiki plug-in, so it's wiki-only
  • YaCy [29] - Examined by Huzaifa
  • written in Java, but requires Sun Java
  • Well maintained
  • Support for peer search engine database exchanges
  • Customized search parameters
  • Fast indexing and web interface for querying the back end db.

Public Testing

<tbd>

Deployment Plan

<tbd>

References

  1. "Fedora Search Engine". Infrastructure/Tickets. https://fedorahosted.org/fedora-infrastructure/ticket/1055. 
  2. "DataparkSearch". DataparkSearch. http://www.dataparksearch.org/. 
  3. "Egothor". Egothor. http://www.egothor.org/. 
  4. "Ferret". David Balmain. http://ferret.davebalmain.com/. 
  5. 5.0 5.1 "Lucy". Apache Software Foundation. http://lucene.apache.org/lucy/. 
  6. "Indri". The Lemur Project. http://www.lemurproject.org/indri/. 
  7. "Isearch". Isite. http://isite.awcubed.com/. 
  8. "KinoSearch". Rectangular Research. http://www.rectangular.com/kinosearch/. 
  9. "Lucene". Apache Software Foundation. http://lucene.apache.org/. 
  10. "Lucene Implementations". Apache Software Foundation. http://wiki.apache.org/lucene-java/LuceneImplementations. 
  11. "mnoGoSearch". LavTech. http://www.mnogosearch.org/. 
  12. "Namazu". Namazu Project. http://www.namazu.org/. 
  13. "Nutch". Apache Software Foundation. http://lucene.apache.org/nutch/. 
  14. "Swish-e". Swish-e. http://swish-e.org/. 
  15. "Terrier". Terrier Project. http://ir.dcs.gla.ac.uk/terrier/. 
  16. "Xapian". Xapian Project. http://xapian.org/. 
  17. "Flax". Flax. http://www.flax.co.uk/products.shtml. 
  18. "Zettair". Search Engine Group, Royal Melbourne Institute of Technology. http://www.seg.rmit.edu.au/zettair/. 
  19. "Grub". Wikia, Inc.. http://grub.org/. 
  20. "Heritrix". Internet Archive. http://crawler.archive.org/. 
  21. "ht://Dig". The ht://Dig Group. http://www.htdig.org/. 
  22. "HtdigSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:HtdigSearch. 
  23. "MWSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:MWSearch. 
  24. "OpenFTS". SourceForge. http://openfts.sourceforge.net/. 
  25. "Plucene". CPAN. http://search.cpan.org/~tmtm/Plucene-1.25. 
  26. "RigorousSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:RigorousSearch. 
  27. "Sphinx". Sphinx Technologies. http://sphinxsearch.com/. 
  28. "SphinxSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:SphinxSearch. 
  29. "YaCy". Karlsruhe Institute of Technology. http://yacy.net/.