(Add Xapianinfo pointing to Flax) |
(→In Progress: KinoSearch - Add basic evaluation; Add info on sample requirements) |
||
Line 72: | Line 72: | ||
* KinoSearch <ref name="KinoSearch">{{cite web|url=http://www.rectangular.com/kinosearch/|title=KinoSearch|publisher=Rectangular Research}}</ref> '''- Allen investigating''' | * KinoSearch <ref name="KinoSearch">{{cite web|url=http://www.rectangular.com/kinosearch/|title=KinoSearch|publisher=Rectangular Research}}</ref> '''- Allen investigating''' | ||
:* Perl/C port of Lucene | :* Perl/C port of Lucene | ||
: buildrequires | :* Evaluation | ||
:* gcc | ::Search engine library with sample indexer and search page rather than fully-functional application. Stores indices in Berkeley DB files with JSON format specifications. Allows custom-designed indices. Each document source is indexed as a single write-once index, which is a directory. Rerunning the indexer creates a new directory, obsoleting the old directory if all the old documents are included. The old directory then needs to be cleaned up. Postings can, however, be deleted from an index. Additionally, only the new documents can be indexed, but that's not efficient. | ||
:* (EPEL) perl-Module-Build | :* Requirements | ||
: requires | :: buildrequires | ||
:* (EPEL) perl-JSON-XS | ::* gcc | ||
:: Problem: EPEL has 1.43 | ::* (EPEL) perl-Module-Build | ||
: | :: requires | ||
::* (EPEL) perl-JSON-XS | |||
::: Problem: Desires 1.53, but EPEL has 1.43 | |||
::: Note: <nowiki>http://web.archive.org/web/20071122035408/search.cpan.org/src/MLEHMANN/JSON-XS-1.53/</nowiki> | :::: Note: <nowiki>http://web.archive.org/web/20071122035408/search.cpan.org/src/MLEHMANN/JSON-XS-1.53/</nowiki> | ||
:* (EPEL) perl-Lingua-Stem-Snowball | :::: Note: works with 1.43, anyway | ||
:* (EPEL) perl-Lingua-StopWords | ::* (EPEL) perl-Lingua-Stem-Snowball | ||
:* (EPEL) perl-Parse-RecDescent | ::* (EPEL) perl-Lingua-StopWords | ||
::* (EPEL) perl-Parse-RecDescent | |||
:: sample indexer reads files from the file system and requires | |||
::* (EPEL) perl-HTML-Tree | |||
:: sample cgi search script requires | |||
::* (CPAN) Data::Pageset (which requires Data::Page) | |||
::* (EPEL) perl-Test-Exception | |||
::* (EPEL) perl-Class-Accessor-Chained | |||
* mnoGoSearch <ref name="mnoGoSearch">{{cite web|url=http://www.mnogosearch.org/|title=mnoGoSearch|publisher=LavTech}}</ref> | * mnoGoSearch <ref name="mnoGoSearch">{{cite web|url=http://www.mnogosearch.org/|title=mnoGoSearch|publisher=LavTech}}</ref> |
Revision as of 22:23, 22 October 2009
Points of Contact
Project Sponsor
Name: Mike McGrath
Fedora Account Name: mmcgrath
Group: Infrastructure
Infrastructure Sponsor: mmcgrath
Secondary Contact info
Name: Huzaifa Sidhpurwala
Fedora Account Name: huzaifas
Group: Infrastructure
Name: Allen Kistler
Fedora Account Name: akistler
Group: Infrastructure
Name: Keiran Smith
Fedora Account Name: affix
Group: Infrastructure
Project Info
Project Name: Search Engine Enhancement
Target Audience: All users of Fedora web sites
Expiration/Delivery Date (required): F13
Description/Summary
Fedora needs a search engine[1]
Requirements
- Crawl the web sites (wiki and non-wiki)
- Search the web sites (wiki and non-wiki)
Preferences
- Python-based (no Java)
- Programmable keywords to have control over what pages get displayed for certain keywords
- XML or library interface so other applications can use it
Project Plan
- Investigate and evaluate existing open source search engines
- Select candidate software
- Create public test instance of candidate software
- Test for functionality, performance, and impact (re-evaluate, if necessary)
- Create capacity and deployment plans
- Deploy
Resources Needed
- Public Test for testing candidate software
- Permanent home(s) for deployment
- Web server(s)
- Database server(s)
Software Investigation and Evaluation
In Progress
- DataparkSearch [2]
- written in C
- Indri [3]
- written in C/C++
- Isearch [4]
- written in C++
- KinoSearch [5] - Allen investigating
- Perl/C port of Lucene
- Evaluation
- Search engine library with sample indexer and search page rather than fully-functional application. Stores indices in Berkeley DB files with JSON format specifications. Allows custom-designed indices. Each document source is indexed as a single write-once index, which is a directory. Rerunning the indexer creates a new directory, obsoleting the old directory if all the old documents are included. The old directory then needs to be cleaned up. Postings can, however, be deleted from an index. Additionally, only the new documents can be indexed, but that's not efficient.
- Requirements
- buildrequires
- gcc
- (EPEL) perl-Module-Build
- requires
- (EPEL) perl-JSON-XS
- Problem: Desires 1.53, but EPEL has 1.43
- Note: http://web.archive.org/web/20071122035408/search.cpan.org/src/MLEHMANN/JSON-XS-1.53/
- Note: works with 1.43, anyway
- (EPEL) perl-Lingua-Stem-Snowball
- (EPEL) perl-Lingua-StopWords
- (EPEL) perl-Parse-RecDescent
- sample indexer reads files from the file system and requires
- (EPEL) perl-HTML-Tree
- sample cgi search script requires
- (CPAN) Data::Pageset (which requires Data::Page)
- (EPEL) perl-Test-Exception
- (EPEL) perl-Class-Accessor-Chained
- mnoGoSearch [6]
- written in C
- Namazu [7] - Huzaifa investigating
- written in Perl
- Swish-e [8]
- written in C
- Swish++ is a rewrite in C++
- Xapian [9]
- written in C++
- Bindings to Python, Perl, PHP, Ruby, Java, and more
- It looks to me like we need a web crawler built on top of this. Searching via google turned up Flax[10]
- YaCy [11]
- written in C
- Zettair [12]
- written in C
Not Suitable
- Egothor [13]
- written in Java
- Grub [14]
- written in C#
- Heritrix [15]
- written in Java
- archives content rather than simply indexing it
- ht://Dig [16]
- written in C++
- not actively maintained
- HtdigSearch [17]
- It's just a MediaWiki plugin, not suitable for searching non-wiki sites
- Lucene [18]
- written in Java, but ported to others [19]
- MWSearch [20]
- Requires EzMwLucene (Java, not desirable) to be running on the servers to be searched
- EzMwLucene is wiki-only, therefore MWSearch is wiki-only
- Nutch [21]
- written in Java
- based on Lucene
- OpenFTS [22]
- written in Perl or TCL on top of PostgreSQL
- Python interface available
- not actively maintained
- Plucene [23]
- Perl port of Lucene
- not actively maintained
- RigorousSearch [24]
- Crawls the MediaWiki database, not the web site
- Doesn't work for non-MediaWiki web sites, including any non-wiki web site
- Sphinx [25]
- written in C++
- designed to index SQL tables, not web pages.
- SphinxSearch [26]
- Written in C++
- Wiki-only (?)
- Terrier (TERabyte RetrIEveR) [27]
- written in Java
Public Testing
<tbd>
Deployment Plan
<tbd>
References
- ↑ "Fedora Search Engine". Infrastructure/Tickets. https://fedorahosted.org/fedora-infrastructure/ticket/1055.
- ↑ "DataparkSearch". DataparkSearch. http://www.dataparksearch.org/.
- ↑ "Indri". The Lemur Project. http://www.lemurproject.org/indri/.
- ↑ "Isearch". Isite. http://isite.awcubed.com/.
- ↑ "KinoSearch". Rectangular Research. http://www.rectangular.com/kinosearch/.
- ↑ "mnoGoSearch". LavTech. http://www.mnogosearch.org/.
- ↑ "Namazu". Namazu Project. http://www.namazu.org/.
- ↑ "Swish-e". Swish-e. http://swish-e.org/.
- ↑ "Xapian". Xapian Project. http://xapian.org/.
- ↑ "Template:Citation error". Flax. http://www.flax.co.uk/products.shtml.
- ↑ "YaCy". Karlsruhe Institute of Technology. http://yacy.net/.
- ↑ "Zettair". Search Engine Group, Royal Melbourne Institute of Technology. http://www.seg.rmit.edu.au/zettair/.
- ↑ "Egothor". Egothor. http://www.egothor.org/.
- ↑ "Grub". Wikia, Inc.. http://grub.org/.
- ↑ "Heritrix". Internet Archive. http://crawler.archive.org/.
- ↑ "ht://Dig". The ht://Dig Group. http://www.htdig.org/.
- ↑ "HtdigSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:HtdigSearch.
- ↑ "Lucene". Apache Software Foundation. http://lucene.apache.org/.
- ↑ "Lucene Implementations". Apache Software Foundation. http://wiki.apache.org/lucene-java/LuceneImplementations.
- ↑ "MWSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:MWSearch.
- ↑ "Nutch". Apache Software Foundation. http://lucene.apache.org/nutch/.
- ↑ "OpenFTS". SourceForge. http://openfts.sourceforge.net/.
- ↑ "Plucene". CPAN. http://search.cpan.org/~tmtm/Plucene-1.25.
- ↑ "RigorousSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:RigorousSearch.
- ↑ "Sphinx". Sphinx Technologies. http://sphinxsearch.com/.
- ↑ "SphinxSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:SphinxSearch.
- ↑ "Terrier". Terrier Project. http://ir.dcs.gla.ac.uk/terrier/.