From Fedora Project Wiki
(→In Progress: Update Xapian with existing pkg info) |
(→In Progress: Add/update Xapian info) |
||
Line 173: | Line 173: | ||
* Xapian <ref name="Xapian">{{cite web|url=http://xapian.org/|title=Xapian|publisher=Xapian Project}}</ref> - '''Allen investigating''' | * Xapian <ref name="Xapian">{{cite web|url=http://xapian.org/|title=Xapian|publisher=Xapian Project}}</ref> - '''Allen investigating''' | ||
: written in C++ | :* Description | ||
: bindings to Python | :: written in C++ | ||
: Omega provides a | :: bindings to Python, Ruby, and Perl XS | ||
: xapian-core, xapian-bindings, and perl- | :: Omega is a CGI application that provides a Xapian front-end for indexing and searching | ||
: built-in web crawler is a script that interfaces with ht://Dig | :: xapian-core, xapian-bindings, and perl-Search-Xapian in Fedora already; xapian-omega is not | ||
: Flax <ref name="Flax">{{cite web|url=http://www.flax.co.uk/products.shtml|title=Flax|publisher=Flax}}</ref> is another search engine built on top of Xapian and CherryPy | :: additional bindings to PHP, Java, and more (?) | ||
:: built-in web crawler is a script that interfaces with ht://Dig (?) | |||
:: Flax <ref name="Flax">{{cite web|url=http://www.flax.co.uk/products.shtml|title=Flax|publisher=Flax}}</ref> is another search engine built on top of Xapian and CherryPy | |||
:* Evaluation | |||
:: <tbd> | |||
:* Requirements | |||
:: xapian-core buildrequires | |||
::* gcc gcc-c++ | |||
::* make | |||
::* zlib-devel | |||
:: xapian-bindings buildrequires (not including gcc gcc-c++ make) | |||
::* python python-devel | |||
::* ruby ruby-devel | |||
::* xapian-core-devel | |||
:: perl-Search-Xapian buildrequires (not including gcc gcc-c++ make) | |||
::* perl | |||
::* xapian-core-devel | |||
:: xapian-omega buildrequires (not including gcc gcc-c++ make) | |||
::* libtool | |||
::* xapian-core-devel | |||
* Zettair <ref name="Zettair">{{cite web|url=http://www.seg.rmit.edu.au/zettair/|title=Zettair|publisher=Search Engine Group, Royal Melbourne Institute of Technology}}</ref> | * Zettair <ref name="Zettair">{{cite web|url=http://www.seg.rmit.edu.au/zettair/|title=Zettair|publisher=Search Engine Group, Royal Melbourne Institute of Technology}}</ref> |
Revision as of 03:42, 27 November 2009
Points of Contact
Project Sponsor
Name: Mike McGrath
Fedora Account Name: mmcgrath
Group: Infrastructure
Infrastructure Sponsor: mmcgrath
Secondary Contact info
Name: Allen Kistler
Fedora Account Name: akistler
Group: Infrastructure
Name: Huzaifa Sidhpurwala
Fedora Account Name: huzaifas
Group: Infrastructure
Project Info
Project Name: Search Engine Enhancement
Target Audience: All users of Fedora web sites
Expiration/Delivery Date (required): F13
Description/Summary
Fedora needs a search engine[1]
Requirements
- Crawl the web sites (wiki and non-wiki)
- Search the web sites (wiki and non-wiki)
- Java, if any, must be the GCJ/OpenJDK versions in RHEL5; Sun/IBM/BEA Java is not acceptable
Preferences
- Python-based
- Programmable keywords to have control over what pages get displayed for certain keywords
- XML or library interface so other applications can use it
Project Plan
- Investigate and evaluate existing open source search engines
- Select candidate software
- Create public test instances of candidate software
- Test for functionality, performance, and impact (re-evaluate, if necessary)
- Create capacity and deployment plans
- Deploy
Resources Needed
- Public Test for testing candidate software
- Permanent home(s) for deployment
- Web server(s)
- Database server(s) (maybe)
Software Investigation and Evaluation
In Progress
- CLucene [2]
- C++ port of Lucene
- in Fedora already
- described as beta by the developers
- DataparkSearch [3]
- written in C
- Egothor [4]
- written in Java
- Ferret [5]
- Ruby port of Lucene
- KinoSearch and Ferret intend to merge as Lucy [6]
- Indri [7]
- written in C/C++
- Isearch [8]
- written in C++
- KinoSearch [9] - akistler examined
- Description
- Perl/C port of Lucene
- in Fedora already
- maintainer Rectangular Research appears to be just one guy, who considers KinoSearch to be alpha software
- KinoSearch and Ferret intend to merge as Lucy [6]
- Evaluation
- Search engine library with sample indexer and search page rather than fully-functional application. Stores indices in Berkeley DB files with JSON interfaces. Allows custom-designed indices, including categories (exact match) to fulfill "programmable keywords" requirement. Each document index on each document source is a single write-once file collection (BDB and JSON) in a unique directory. Rerunning the indexer creates a new directory, obsoleting the old directory if all the old documents are included. The old directory then needs to be cleaned up. Postings can, however, be deleted from an index. Additionally, only the new documents can be indexed, but that's not efficient.
- Requirements
- buildrequires
- gcc
- (EPEL) perl-Module-Build
- requires
- (EPEL) perl-JSON-XS
- Problem: Desires 1.53, but EPEL has 1.43
- Note: http://web.archive.org/web/20071122035408/search.cpan.org/src/MLEHMANN/JSON-XS-1.53/
- Note: works with 1.43, anyway
- (EPEL) perl-Lingua-Stem-Snowball
- (EPEL) perl-Lingua-StopWords
- (EPEL) perl-Parse-RecDescent
- sample indexer reads files from the file system and requires
- (EPEL) perl-HTML-Tree
- sample cgi search script requires
- (CPAN) Data::Pageset (which requires Data::Page)
- (EPEL) perl-Test-Exception
- (EPEL) perl-Class-Accessor-Chained
- Lucene [10] - akistler examined
- Description
- written in Java, but ported to others [11]
- in Fedora already
- PyLucene [12] is a Python wrapper around Java Lucene
- Evaluation
- Search engine library meant to be integrated into applications.
- Requirements
- buildrequires (based on 1.4.3-f7)
- ant
- ant-junit
- java-1.4.2-gcj-compat-devel
- javacc
- jpackage-utils
- junit
- make
- requires (based on 1.4.3-f7)
- java-1.4.2-gcj-compat
- mnoGoSearch [13]
- written in C
- Namazu [14] - Huzaifa investigating
- written in Perl
- in Fedora already
- Nutch [15]
- written in Java
- based on Lucene
- Solr [16] - akistler examined
- Description
- written in Java
- based on Lucene
- Evaluation
- The documentation describes installing Sun Java to run Solr, but OpenJDK 1.5 or later is fine. Solr needs a Java servlet container in which to run. It comes with Jetty, but other containers should work, as well (e.g., Tomcat). Currently only supports UTF-8 characters. Basically Solr provides an HTTP admin GUI for a search engine that uses a superset of the Lucene query syntax. The schema is very flexible. Set-up is essentially entirely through XML files. Applications can query the servlet port and get XML or JSON responses.
- Requirements
- buildrequires
- ant (note that ant currently pulls in java-gcj-compat, too, but it appears not to be a problem)
- ant-junit
- java-1.6.0-openjdk-devel
- junit
- requires
- java-1.6.0-openjdk
- Swish-e [17] - akistler examined
- Description
- written in C
- Note: Swish++ is a rewrite in C++ (not evaluated here)
- Evaluation
- Search engine with a built-in web crawler, a built-in file system crawler, and an interface for an external crawler. The distribution includes sample search pages which use the Perl API. There is also a C API. The index is not customizable, but does include a facility for including metawords (exact match) and the path in the index for each document. The documentation acknowledges that the software only supports ASCII, but some MBCS may also work.
- Requirements
- buildrequires
- gcc
- make
- libxml2-devel
- zlib-devel
- requires
- libxml2
- zlib
- perl-libwww-perl (for the built-in spider)
- others as desired to index documents (pdf, etc.)
- Terrier (TERabyte RetrIEveR) [18]
- written in Java
- Xapian [19] - Allen investigating
- Description
- written in C++
- bindings to Python, Ruby, and Perl XS
- Omega is a CGI application that provides a Xapian front-end for indexing and searching
- xapian-core, xapian-bindings, and perl-Search-Xapian in Fedora already; xapian-omega is not
- additional bindings to PHP, Java, and more (?)
- built-in web crawler is a script that interfaces with ht://Dig (?)
- Flax [20] is another search engine built on top of Xapian and CherryPy
- Evaluation
- <tbd>
- Requirements
- xapian-core buildrequires
- gcc gcc-c++
- make
- zlib-devel
- xapian-bindings buildrequires (not including gcc gcc-c++ make)
- python python-devel
- ruby ruby-devel
- xapian-core-devel
- perl-Search-Xapian buildrequires (not including gcc gcc-c++ make)
- perl
- xapian-core-devel
- xapian-omega buildrequires (not including gcc gcc-c++ make)
- libtool
- xapian-core-devel
- Zettair [21]
- written in C
Not Suitable
- Gonzui [22]
- written in Ruby
- specializes in source code search
- not actively maintained
- Grub [23]
- written in C#
- Heritrix [24]
- written in Java
- archives content rather than indexing it
- ht://Dig [25]
- written in C++
- not actively maintained
- HtdigSearch [26]
- It's just a MediaWiki plugin, not suitable for searching non-wiki sites
- MWSearch [27]
- Requires EzMwLucene (Java, not desirable) to be running on the servers to be searched
- EzMwLucene is wiki-only, therefore MWSearch is wiki-only
- OpenFTS [28]
- written in Perl or TCL on top of PostgreSQL
- Python interface available
- not actively maintained
- Plucene [29]
- Perl port of Lucene
- not actively maintained
- RigorousSearch [30]
- Crawls the MediaWiki database, not the web site
- Doesn't work for non-MediaWiki web sites, including any non-wiki web site
- Sphinx [31]
- written in C++
- designed to index SQL tables, not web pages
- SphinxSearch [32]
- Written in C++
- MediaWiki plug-in, so it's wiki-only
- YaCy [33] - huzaifa examined
- written in Java, but requires Sun Java
- well maintained
- support for peer search engine database exchanges
- customized search parameters
- fast indexing and web interface for querying the back-end db
Comparison by Requirements
Engine Name | Source Language | Integrated Web Crawler | Integrated Web Front-End | Programmable Categories | Application Interface |
---|---|---|---|---|---|
DataparkSearch | C | ||||
Egothor | Java | ||||
Ferret | Ruby | ||||
Indri | C/C++ | ||||
Isearch | C++ | ||||
KinoSearch | Perl/C | No (sample file crawler included) |
No (sample included) |
Yes | Yes (BDB/JSON) |
Lucene | Java (GCJ) |
No | No | Yes | Yes |
mnoGoSearch | C | ||||
Namazu | Perl | ||||
Nutch | Java | ||||
Solr | Java (OpenJDK) |
No | No (admin GUI only) |
Yes | Yes |
Swish-e | C/Perl | Yes (Perl) |
No (sample included, but has problems) |
No (but can search on META tags) |
Yes (Perl and C APIs) |
Terrier | Java | ||||
Xapian | C++ | ||||
Zettair | C | ||||
Engine Name | Source Language | Integrated Web Crawler | Integrated Web Front-End | Programmable Categories | Application Interface |
Public Testing
<tbd>
Deployment Plan
<tbd>
References
- ↑ "Fedora Search Engine". Infrastructure Trac. https://fedorahosted.org/fedora-infrastructure/ticket/1055.
- ↑ "CLucene". CLucene Project. http://sourceforge.net/projects/clucene/.
- ↑ "DataparkSearch". DataparkSearch. http://www.dataparksearch.org/.
- ↑ "Egothor". Egothor. http://www.egothor.org/.
- ↑ "Ferret". David Balmain. http://ferret.davebalmain.com/.
- ↑ 6.0 6.1 "Lucy". Apache Software Foundation. http://lucene.apache.org/lucy/.
- ↑ "Indri". The Lemur Project. http://www.lemurproject.org/indri/.
- ↑ "Isearch". Isite. http://isite.awcubed.com/.
- ↑ "KinoSearch". Rectangular Research. http://www.rectangular.com/kinosearch/.
- ↑ "Lucene". Apache Software Foundation. http://lucene.apache.org/.
- ↑ "Lucene Implementations". Apache Software Foundation. http://wiki.apache.org/lucene-java/LuceneImplementations.
- ↑ "PyLucene". Apache Software Foundation. http://lucene.apache.org/pylucene/.
- ↑ "mnoGoSearch". LavTech. http://www.mnogosearch.org/.
- ↑ "Namazu". Namazu Project. http://www.namazu.org/.
- ↑ "Nutch". Apache Software Foundation. http://lucene.apache.org/nutch/.
- ↑ "Solr". Apache Software Foundation. http://lucene.apache.org/solr/.
- ↑ "Swish-e". Swish-e. http://swish-e.org/.
- ↑ "Terrier". Terrier Project. http://ir.dcs.gla.ac.uk/terrier/.
- ↑ "Xapian". Xapian Project. http://xapian.org/.
- ↑ "Flax". Flax. http://www.flax.co.uk/products.shtml.
- ↑ "Zettair". Search Engine Group, Royal Melbourne Institute of Technology. http://www.seg.rmit.edu.au/zettair/.
- ↑ "Gonzui". SourceForge. http://gonzui.sourceforge.net/.
- ↑ "Grub". Wikia, Inc.. http://grub.org/.
- ↑ "Heritrix". Internet Archive. http://crawler.archive.org/.
- ↑ "ht://Dig". The ht://Dig Group. http://www.htdig.org/.
- ↑ "HtdigSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:HtdigSearch.
- ↑ "MWSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:MWSearch.
- ↑ "OpenFTS". XWare. http://www.astronet.ru/xware/#fts.
- ↑ "Plucene". CPAN. http://search.cpan.org/~tmtm/Plucene-1.25.
- ↑ "RigorousSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:RigorousSearch.
- ↑ "Sphinx". Sphinx Technologies. http://sphinxsearch.com/.
- ↑ "SphinxSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:SphinxSearch.
- ↑ "YaCy". Karlsruhe Institute of Technology. http://yacy.net/.