From Fedora Project Wiki
(→‎Software Investigation and Evaluation: Added references for additional engines; Sorted Perl & Ruby to the top)
Line 55: Line 55:
* HtdigSearch <ref name="HtdigSearch">{{cite web|url=https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:HtdigSearch|title=HtdigSearch Extension|publisher=[[MediaWiki]]}}</ref>
* HtdigSearch <ref name="HtdigSearch">{{cite web|url=https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:HtdigSearch|title=HtdigSearch Extension|publisher=[[MediaWiki]]}}</ref>
: Huzaifa (in progress)
: Huzaifa (in progress)
* SphinxSearch <ref name="SphinxSearch">{{cite web|url=https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:SphinxSearch|title=SphinxSearch Extension|publisher=[[MediaWiki]]}}</ref>
* SphinxSearch <ref name="SphinxSearch">{{cite web|url=https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:SphinxSearch|title=SphinxSearch Extension|publisher=[[MediaWiki]]}}</ref>
: Huzaifa (in progress)
: Huzaifa (in progress)
* MWSearch <ref name="MWSearch">{{cite web|url=https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:MWSearch|title=MWSearch Extension|publisher=[[MediaWiki]]}}</ref>
 
* Ferret
: Ruby port of Lucene
 
* Gonzui <ref name="Gonzui">{{cite web|url=http://gonzui.sourceforge.net/|title=Gonzui|publisher=SourceForge}}</ref> (specializes in source code search)
:* written in Ruby
:* not actively maintained
 
* KinoSearch
: Perl port of Lucene
 
* Namazu <ref name="Namazu">{{cite web|url=http://www.namazu.org/|title=Namazu|publisher=Namazu Project}}</ref>
: written in Perl
 
* OpenFTS <ref name="OpenFTS">{{cite web|url=http://openfts.sourceforge.net/|title=OpenFTS|publisher=SourceForge}}</ref>
: Not suitable
: Not suitable
:* MWSearch requires EzMwLucene to be running on the servers to be searched.  EzMwLucene is Java, therefore not preferable.
:* written in Perl or TCL on top of PostgreSQL
:* MWSearch is a client to EzMwLucene, which is wiki-only, therefore MWSearch is wiki-only.
:* Python interface available
* RigorousSearch <ref name="RigorousSearch">{{cite web|url=https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:RigorousSearch|title=RigorousSearch Extension|publisher=[[MediaWiki]]}}</ref>
:* not actively maintained
 
* Plucene
: Perl port of Lucene
 
* DataparkSearch <ref name="DataparkSearch">{{cite web|url=http://www.dataparksearch.org/|title=DataparkSearch|publisher=DataparkSearch}}</ref>
: Not suitable
: Not suitable
: RigorousSearch crawls the MediaWiki database, not the web site.  It doesn't work for non-MediaWiki web sites, including any non-wiki web site.
* DataparkSearch
: written in C
: written in C
* Egothor
 
* Egothor <ref name="Egothor">{{cite web|url=http://www.egothor.org/|title=Egothor|publisher=Egothor}}</ref>
: Not suitable
: written in Java
: written in Java
* Gonzui (specializes in source code search)
 
:* ? written in Ruby
* Grub <ref name="Grub">{{cite web|url=http://grub.org/|title=Grub|publisher=Wikia, Inc.}}</ref>
:* ? not actively maintained
: Not suitable
* Grub
: written in C#
: ? written in C#
 
* Ht://dig
* ht://dig <ref name="htDig">{{cite web|url=http://www.htdig.org/|title=ht://Dig|publisher=The ht://Dig Group}}</ref>
: Not suitable
:* written in C++
:* written in C++
:* not actively maintained
:* not actively maintained
* Isearch
 
* Indri <ref name="Indri">{{cite web|url=http://www.lemurproject.org/indri/|title=Indri|publisher=The Lemur Project}}</ref>
: Not suitable
: written in C/C++
 
* Isearch <ref name="Isearch">{{cite web|url=http://isite.awcubed.com/|title=Isearch|publisher=Isite}}</ref>
: Not suitable
: written in C++
: written in C++
* Lucene
 
* Lucene <ref name="Lucene">{{cite web|url=http://lucene.apache.org/|title=Lucene|publisher=Apache Software Foundation}}</ref>
: Not suitable
:* originally in Java, ported to others
:* originally in Java, ported to others
:* Perl ports are Plucene and KinoSearch
:* Perl ports are Plucene and KinoSearch; Ruby port is Ferret
:* Ruby port is Ferret
:* see Lucene Implementations <ref name="LuceneImplementations">{{cite web|url=http://wiki.apache.org/lucene-java/LuceneImplementations|title=Lucene Implementations|publisher=Apache Software Foundation}}</ref>
:* see http://wiki.apache.org/lucene-java/LuceneImplementations
 
* Lemur Toolkit & Indri Search Engine
* mnoGoSearch <ref name="mnoGoSearch">{{cite web|url=http://www.mnogosearch.org/|title=mnoGoSearch|publisher=LavTech}}</ref>
:* written in C/C++
: Not suitable
:* not really a search engine, more like a toolkit
* mnoGoSearch
: written in C
: written in C
* Namazu
 
: written in Perl
* MWSearch <ref name="MWSearch">{{cite web|url=https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:MWSearch|title=MWSearch Extension|publisher=[[MediaWiki]]}}</ref>
* Nutch
: Not suitable
:* Requires EzMwLucene (Java, not desirable) to be running on the servers to be searched
:* EzMwLucene is wiki-only, therefore MWSearch is wiki-only
 
* Nutch <ref name="Nutch">{{cite web|url=http://lucene.apache.org/nutch/|title=Nutch|publisher=Apache Software Foundation}}</ref>
: Not suitable
:* written in Java
:* written in Java
:* based on Lucene
:* based on Lucene
* OpenFTS
 
:* written in Perl or TCL on top of PostgreSQL
* RigorousSearch <ref name="RigorousSearch">{{cite web|url=https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:RigorousSearch|title=RigorousSearch Extension|publisher=[[MediaWiki]]}}</ref>
:* Python interface available
: Not suitable
:* not actively maintained
: Crawls the MediaWiki database, not the web site.  It doesn't work for non-MediaWiki web sites, including any non-wiki web site.
* Sciencenet (based on YaCy)
 
: written in Java
* Sphinx <ref name="Sphinx">{{cite web|url=http://sphinxsearch.com/|title=Sphinx|publisher=Sphinx Technologies}}</ref>
* Sphinx
: Not suitable
: written in C++
: written in C++
* SWISH-E
 
* Swish-e <ref name="Swish-e">{{cite web|url=http://swish-e.org/|title=Swish-e|publisher=Swish-e}}</ref>
: Not suitable
: written in C
: written in C
* Terrier Search Engine
: Swish++ is a rewrite in C++
 
* Terrier (TERabyte RetrIEveR) <ref name="Terrier">{{cite web|url=http://ir.dcs.gla.ac.uk/terrier/|title=Terrier|publisher=Terrier Project}}</ref>
: Not suitable
: written in Java
: written in Java
* Wikia Search
 
: shut down
* Xapian <ref name="Xapian">{{cite web|url=http://xapian.org/|title=Xapian|publisher=Xapian Project}}</ref>
* Xapian
: Not suitable
: written in C++
: written in C++
* YaCy
 
* YaCy <ref name="YaCy">{{cite web|url=http://yacy.net/|title=YaCy|publisher=Karlsruhe Institute of Technology}}</ref>
: Not suitable
: written in C
: written in C
* Zettair
 
* Zettair <ref name="Zettair">{{cite web|url=http://www.seg.rmit.edu.au/zettair/|title=Zettair|publisher=Search Engine Group, Royal Melbourne Institute of Technology}}</ref>
: Not suitable
: written in C
: written in C
<references />


== Public Testing ==
== Public Testing ==

Revision as of 22:39, 12 October 2009

Project Sponsor

Name: Mike McGrath
Fedora Account Name: mmcgrath
Group: Infrastructure
Infrastructure Sponsor: mmcgrath

Secondary Contact info

Name: Huzaifa Sidhpurwala
Fedora Account Name: huzaifas
Group: Infrastructure

Name: Allen Kistler
Fedora Account Name: akistler
Group: Infrastructure

Name: Keiran Smith
Fedora Account Name: affix
Group: Infrastructure

Project Info

Project Name: Search Engine Enhancement
Target Audience: All users of Fedora web sites
Expiration/Delivery Date (required): F13

Description/Summary: Fedora needs a search engine[1]

Requirements:

  • Crawl the web sites (wiki and non-wiki)
  • Search the web sites (wiki and non-wiki)

Preferences:

  • Python-based (no Java)
  • Programmable keywords to have control over what pages get displayed for certain keywords
  • XML or library interface so other applications can use it

Project plan (Detailed):

  1. Investigate and evaluate existing open source search engines
  2. Select candidate software
  3. Create public test instance of candidate software
  4. Test for functionality, performance, and impact (re-evaluate, if necessary)
  5. Create capacity and deployment plans
  6. Deploy

Specific resources needed

  • Public Test for testing candidate software
  • Permanent home(s) for deployment
    • Web server(s)
    • Database server(s)

Software Investigation and Evaluation

  • HtdigSearch [2]
Huzaifa (in progress)
  • SphinxSearch [3]
Huzaifa (in progress)
  • Ferret
Ruby port of Lucene
  • Gonzui [4] (specializes in source code search)
  • written in Ruby
  • not actively maintained
  • KinoSearch
Perl port of Lucene
written in Perl
Not suitable
  • written in Perl or TCL on top of PostgreSQL
  • Python interface available
  • not actively maintained
  • Plucene
Perl port of Lucene
  • DataparkSearch [7]
Not suitable
written in C
Not suitable
written in Java
Not suitable
written in C#
Not suitable
  • written in C++
  • not actively maintained
Not suitable
written in C/C++
Not suitable
written in C++
Not suitable
  • originally in Java, ported to others
  • Perl ports are Plucene and KinoSearch; Ruby port is Ferret
  • see Lucene Implementations [14]
Not suitable
written in C
Not suitable
  • Requires EzMwLucene (Java, not desirable) to be running on the servers to be searched
  • EzMwLucene is wiki-only, therefore MWSearch is wiki-only
Not suitable
  • written in Java
  • based on Lucene
Not suitable
Crawls the MediaWiki database, not the web site. It doesn't work for non-MediaWiki web sites, including any non-wiki web site.
Not suitable
written in C++
Not suitable
written in C
Swish++ is a rewrite in C++
  • Terrier (TERabyte RetrIEveR) [21]
Not suitable
written in Java
Not suitable
written in C++
Not suitable
written in C
Not suitable
written in C
  1. "Fedora Search Engine". Infrastructure/Tickets. https://fedorahosted.org/fedora-infrastructure/ticket/1055. 
  2. "HtdigSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:HtdigSearch. 
  3. "SphinxSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:SphinxSearch. 
  4. "Gonzui". SourceForge. http://gonzui.sourceforge.net/. 
  5. "Namazu". Namazu Project. http://www.namazu.org/. 
  6. "OpenFTS". SourceForge. http://openfts.sourceforge.net/. 
  7. "DataparkSearch". DataparkSearch. http://www.dataparksearch.org/. 
  8. "Egothor". Egothor. http://www.egothor.org/. 
  9. "Grub". Wikia, Inc.. http://grub.org/. 
  10. "ht://Dig". The ht://Dig Group. http://www.htdig.org/. 
  11. "Indri". The Lemur Project. http://www.lemurproject.org/indri/. 
  12. "Isearch". Isite. http://isite.awcubed.com/. 
  13. "Lucene". Apache Software Foundation. http://lucene.apache.org/. 
  14. "Lucene Implementations". Apache Software Foundation. http://wiki.apache.org/lucene-java/LuceneImplementations. 
  15. "mnoGoSearch". LavTech. http://www.mnogosearch.org/. 
  16. "MWSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:MWSearch. 
  17. "Nutch". Apache Software Foundation. http://lucene.apache.org/nutch/. 
  18. "RigorousSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:RigorousSearch. 
  19. "Sphinx". Sphinx Technologies. http://sphinxsearch.com/. 
  20. "Swish-e". Swish-e. http://swish-e.org/. 
  21. "Terrier". Terrier Project. http://ir.dcs.gla.ac.uk/terrier/. 
  22. "Xapian". Xapian Project. http://xapian.org/. 
  23. "YaCy". Karlsruhe Institute of Technology. http://yacy.net/. 
  24. "Zettair". Search Engine Group, Royal Melbourne Institute of Technology. http://www.seg.rmit.edu.au/zettair/. 

Public Testing

<tbd>

Deployment Plan

<tbd>

References