Latest revision as of 09:05, 1 February 2011

Points of Contact

Project Sponsor

Name: Mike McGrath
Fedora Account Name: mmcgrath
Group: Infrastructure
Infrastructure Sponsor: mmcgrath

Secondary Contact info

Name: Allen Kistler
Fedora Account Name: akistler
Group: Infrastructure

Name: Huzaifa Sidhpurwala
Fedora Account Name: huzaifas
Group: Infrastructure

Project Info

Project Name: Search Engine Enhancement
Target Audience: All users of Fedora web sites
Expiration/Delivery Date (required): F13

Description/Summary

Fedora needs a search engine^[1]

Requirements

Crawl the web sites (wiki and non-wiki)
Search the web sites (wiki and non-wiki)
Java, if any, must be the GCJ/OpenJDK versions in RHEL5; Sun/IBM/BEA Java is not acceptable

Preferences

Python-based
Programmable keywords to have control over what pages get displayed for certain keywords
XML or library interface so other applications can use it
Support multiple-language text (i.e., Unicode)

Project Plan

Investigate and evaluate existing open source search engines
Select candidate software
Create public test instances of candidate software
Test for functionality, performance, and impact (re-evaluate, if necessary)
Create capacity and deployment plans
Deploy

Resources Needed

Public Test for testing candidate software
Permanent home(s) for deployment
- Web server(s)
- Database server(s) (maybe)

Software Investigation and Evaluation

Comparison by Requirements

Engine Name	Source Language	Integrated Crawler	Integrated Search Tool	Programmable Categories	Application Interface
DataparkSearch	C	Yes	Yes	Yes (Tags)	Yes (Native C API)
Egothor	Java
Ferret	Ruby
Indri	C/C++
KinoSearch	Perl/C	No (sample file crawler included)	No (sample included)	Yes	Yes (BDB/JSON)
mnoGoSearch	C	Yes	Yes	Yes (Tags and Hierarchical categories)	Yes (Native C API)
Nutch	Java	Yes (OpenJDK command line)	Yes (Tomcat servlet)	No	Yes (Java)
Swish-e	C/Perl	Yes (Perl)	Sort Of (sample included, but has problems)	No (but can search on META tags)	Yes (Perl and C APIs)
Xapian	C++	Sort Of (combined Omega with custom Perl)	Yes (rudimentary Omega CGI)	Yes	Yes (C++, Perl, Python, Ruby)
Engine Name	Source Language	Integrated Crawler	Integrated Search Tool	Programmable Categories	Application Interface

In Progress

DataparkSearch ^[2] - akistler examined
- Description
written in C

forked from mnoGoSearch in 2003 when mnoGoSearch went semi-commercial

Indices are stored in a database; Supported databases include MySQL and PostgreSQL (SQLite not advertised, but listed in documentation)

Support for all SBCS and some DBCS

Search tool supports Booleans (AND, OR, NOT, NEAR)
- Evaluation
  - Evaluated after mnoGoSearch.
  - Database modes are single and multi, like mnoGoSearch, but include hashed modes, which does not support string and substring searches, and cached mode, which stores only URI indices in the database, but stores word data in disk files through an additional daemon. The absence of blob mode from mnoGoSearch, but the presence of cached mode in DataparkSearch, appears to be a major difference between the two.
  - As written, the setup expects that the application process logs in to the database as the schema owner, however with additional manual steps it can be made to work as not the schema owner. The call handler setup and custom stored procedure language definition present in mnoGoSearch is commented out in the setup of DataparkSearch, so PostgreSQL superuser is not required, as written. (Their presence in mnoGoSearch is questionable, anyway.) In general, DataparkSearch does appear to be a more slowly developed version of mnoGoSearch.
  - Multiple character set support is not the default, but is specified explicitly before compilation.
  - Bugs in create.multi.sql
    - 185: ERROR: relation "cachedchk2" already exists
    - 186: ERROR: column "url_id" does not exist
    Caused by duplicate CREATE statements in the script. Comment out one set to resolve the bug.
  - Bugs in drop.multi.sql
    - 16: ERROR: sequence "url_rec_id_seq" does not exist
    - 17: ERROR: sequence "categories_rec_id_seq" does not exist
    - 18: ERROR: sequence "qtrack_rec_id_seq" does not exist
    - 19: ERROR: sequence "server" does not exist
    Caused by extraneous DROP statements. Comment them out or ignore them. They're harmless.
  - Extended search mode appears broken. Only the first result is returned. All other results are lost. This might be only an error in the search form, in which case it can be easily debugged and fixed.
- Requirements
buildrequires
- gcc make
- postgresql-devel (for PostgreSQL support)
- zlib-devel
requires
- postgresql-libs (for PostgreSQL support)
- httpd
- zlib
- others as desired to index documents (pdf, etc.)
- Setup Notes
  - The tarball was compiled into /opt/dpsearch, although it should be possible to create an RPM for more conventional locations (/bin, /etc, /sbin, etc.)
  - The code compiles and runs fine on x86_64 (compare to mnoGoSearch)
  - Create database user (dp/search) different from database owner (dbowner/dbowner) different from superuser (postgres)
  - Run create.multi.sql as dbowner, then grant privileges to dp
  GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE cachedchk TO dp;
  
  GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE cachedchk2 TO dp;
  
  GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE categories TO dp;
  
  GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE cookies TO dp;
  
  GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE crossdict TO dp;
  
  GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE dict TO dp;
  
  GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE dict10 TO dp;
  
  GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE dict11 TO dp;
  
  GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE dict12 TO dp;
  
  GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE dict16 TO dp;
  
  GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE dict2 TO dp;
  
  GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE dict3 TO dp;
  
  GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE dict32 TO dp;
  
  GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE dict4 TO dp;
  
  GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE dict5 TO dp;
  
  GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE dict6 TO dp;
  
  GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE dict7 TO dp;
  
  GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE dict8 TO dp;
  
  GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE dict9 TO dp;
  
  GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE links TO dp;
  
  GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE qinfo TO dp;
  
  GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE qtrack TO dp;
  
  GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE robots TO dp;
  
  GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE server TO dp;
  
  GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE srvinfo TO dp;
  
  GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE storedchk TO dp;
  
  GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE url TO dp;
  
  GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE urlinfo TO dp;
  
  GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE categories_rec_id_seq TO dp;
  
  GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE qtrack_rec_id_seq TO dp;
  
  GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE url_rec_id_seq TO dp;
  - cp /opt/dpsearch/etc/indexer.conf-dist /opt/dpsearch/etc/indexer.conf
  chmod and chgrp to protect passwords!
  - vi /opt/dpsearch/etc/indexer.conf
  set DBAddr, including mode=multi
  
  set LocalCharset to UTF-8 for full Unicode support
  
  set MaxHops to the maximum depth (default is 256), if desired/required
  
  add CrawlDelay, if desired/required
  
  add "Server Disallow http://fedoraproject.org/w/" and "Server Disallow http://fedoraproject.org/wikiold/"
  
  set Server to http://fedoraproject.org/ (the URI to be crawled and indexed)
  - cp /opt/dpsearch/etc/stopwords.conf-dist /opt/dpsearch/etc/stopwords.conf
  Note that there are no Chinese stopwords, so don't uncomment the line to configure them
  - cp /opt/dpsearch/etc/langmap.conf-dist /opt/dpsearch/etc/langmap.conf
  - vi /opt/dpsearch/etc/langmap.conf
    - Comment out the lines for ca.latin1.lit.lm, ga.latin1.lit.lm, ko.utf8.lit.lm, and pt-pt.latin1.lm, because the files for them don't exist
    - Uncomment the lines for ja.euc-jp.lm, ja.sjis.lm, ta.tscii.lm, zh.big5.lm, and zh.gb2312.lm, because the files for them do exist
  - cp /opt/dpsearch/etc/sections.conf-dist /opt/dpsearch/etc/sections.conf
  - cp /opt/dpsearch/etc/search.htm-dist /opt/dpsearch/etc/search.htm (htm name must not change; compare to mnoGoSearch)
  chmod and chgrp to protect passwords!
  - vi /opt/dpsearch/etc/dpsearch.htm
  set DBAddr and LocalCharset to correspond to /opt/dpsearch/etc/indexer.conf

Egothor ^[3]
- Description
written in Java

Ferret ^[4]
- Description
Ruby port of Lucene

KinoSearch and Ferret intend to merge as Lucy ^[5]

Indri ^[6]
- Description
Written in C/C++

Indexes text and HTML, can be extended with custom parsers

Provides Java, PHP, and C++ APIs and, if desired, a SOAP interface

Search tool is a local Java application or a PHP CGI application

KinoSearch ^[7] - akistler examined
- Description
Perl/C port of Lucene

in Fedora already

maintainer Rectangular Research appears to be just one guy, who considers KinoSearch to be alpha software

KinoSearch and Ferret intend to merge as Lucy ^[5]
- Evaluation
Search engine library with sample indexer and search page rather than fully-functional application. Stores indices in Berkeley DB files with JSON interfaces. Allows custom-designed indices, including categories (exact match) to fulfill "programmable keywords" requirement. Each document index on each document source is a single write-once file collection (BDB and JSON) in a unique directory. Rerunning the indexer creates a new directory, obsoleting the old directory if all the old documents are included. The old directory then needs to be cleaned up. Postings can, however, be deleted from an index. Additionally, only the new documents can be indexed, but that's not efficient.
- Requirements
buildrequires
- gcc
- (EPEL) perl-Module-Build
requires
- (EPEL) perl-JSON-XS
Problem: Desires 1.53, but EPEL has 1.43
Note: http://web.archive.org/web/20071122035408/search.cpan.org/src/MLEHMANN/JSON-XS-1.53/

Note: works with 1.43, anyway
- (EPEL) perl-Lingua-Stem-Snowball
- (EPEL) perl-Lingua-StopWords
- (EPEL) perl-Parse-RecDescent
sample indexer reads files from the file system and requires
- (EPEL) perl-HTML-Tree
sample cgi search script requires
- (CPAN) Data::Pageset (which requires Data::Page)
- (EPEL) perl-Test-Exception
- (EPEL) perl-Class-Accessor-Chained

mnoGoSearch ^[8] - akistler examined
- Description
written in C

UNIX/Linux source code is GPL; Windows binaries are commercial, likely based on the GPL UNIX/Linux code, and lag a few versions behind

Indices are stored in a database; Supported databases include MySQL, PostgreSQL, and SQLite (among others)

HTTP, FTP, and NNTP crawling

C, PHP, and Perl APIs (advertised, apparently only C API really included)

SBCS and most MBCS supported, including most eastern Asian languages
- Evaluation
  - The supplied install.pl script generates a configure command, but does not support SQLite.
  - Adding --with-sqlite3 to the generated command adds SQLite support. An empty database must be created manually. A URI in the indexer.conf file specifies the location of the database. According to the documentation, sqlite:/path/to/db/file should work, but doesn't. According to the message boards on mnoGoSearch.org, sqlite://localhost/path/to/db/file should work, but doesn't.
  - No problems compiling with PostgreSQL support. Configure and Makefile may need significant rewriting to make good RPMs. There is no support for cross-compiling to 32-bit architectures from 64-bit machines.
  - As written, the setup expects that the application process logs in to the database as a PostgreSQL superuser and schema owner, however with additional manual steps it can be made to work as neither the superuser nor the schema owner. Several schemata are available (single, multi, and blob). LavTech recommends blob for sites indexing more than 50k documents. The crawler is very flexible with quite a complex configuration file. The CGI search page also has nice features for "advanced" searching, although it can be customized to suit each site. Tags are labels configured within the crawler, usually by URI server component. Categories are numerical hierarchies, up to 6 levels deep, also specified in the crawler configuration.
  - Bugs in pgsql/drop.blob.sql
    1. drop function clean_srvinfo(); (the () is omitted, but needs to be included)
    2. DROP LANGUAGE plpgsql; (missing)
    3. DROP FUNCTION plpgsql_call_handler(); (missing, has to be run twice, once for postgres, once for dbowner?)
      Does this definition in postgres (dangerously) shortcut anything inherent for other DBs on the same server?
- Requirements
buildrequires
- gcc make
- sqlite-devel (for SQLite support)
- postgresql-devel (for PostgreSQL support)
- zlib-devel
requires
- sqlite (for SQLite support)
- postgresql-libs (for PostgreSQL support)
- httpd
- zlib
- others as desired to index documents (pdf, etc.)
- Setup Notes
  - The tarball was compiled into /opt/mnoGoSearch, although it should be possible to create an RPM for more conventional locations (/bin, /etc, /sbin, etc.)
  - RPM for Fedora currently awaiting review
  - On x86_64 architecture, the x86_64 binary fails when indexing a crawl with the error "indexer[21272]: PQexecPrepared: ERROR: incorrect binary data format in bind parameter 2." The tarball refuses to cross-compile to 32-bit architecture, despite tweaking the ./configure options. Compiling on a 32-bit machine and moving the binaries to the 64-bit machine works.
  - Create database user (mno/search) different from database owner (dbowner/dbowner) different from superuser (postgres)
  - /opt/mnoGoSearch/sbin/indexer -Ecreate, to create the tables, etc., and /opt/mnoGoSearch/sbin/indexer -Edrop, to drop tables, etc., just runs a script from /opt/mnoGoSearch/share/<db-type>/, but needs to be run as postgres
  - Run create.blob.sql as postgres, change owners to dbowner, and grant privileges to mno
  ALTER TABLE bdict OWNER TO dbowner;
  
  ALTER TABLE bdicti OWNER TO dbowner;
  
  ALTER TABLE categories OWNER TO dbowner;
  
  ALTER TABLE crossdict OWNER TO dbowner;
  
  ALTER TABLE dict OWNER TO dbowner;
  
  ALTER TABLE links OWNER TO dbowner;
  
  ALTER TABLE qcache OWNER TO dbowner;
  
  ALTER TABLE qinfo OWNER TO dbowner;
  
  ALTER TABLE qtrack OWNER TO dbowner;
  
  ALTER TABLE server OWNER TO dbowner;
  
  ALTER TABLE srvinfo OWNER TO dbowner;
  
  ALTER TABLE url OWNER TO dbowner;
  
  ALTER TABLE urlinfo OWNER TO dbowner;
  
  ALTER TABLE wrdstat OWNER TO dbowner;
  
  GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE bdict TO mno;
  
  GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE bdicti TO mno;
  
  GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE categories TO mno;
  
  GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE crossdict TO mno;
  
  GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE dict TO mno;
  
  GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE links TO mno;
  
  GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE qcache TO mno;
  
  GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE qinfo TO mno;
  
  GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE qtrack TO mno;
  
  GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE server TO mno;
  
  GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE srvinfo TO mno;
  
  GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE url TO mno;
  
  GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE urlinfo TO mno;
  
  GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE wrdstat TO mno;
  
  GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE categories_rec_id_seq TO mno;
  
  GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE qtrack_rec_id_seq TO mno;
  
  GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE url_rec_id_seq TO mno;
  - cp /opt/mnoGoSearch/etc/indexer.conf-dist /opt/mnoGoSearch/etc/indexer.conf
  chmod and chgrp to protect passwords!
  - vi /opt/mnoGoSearch/etc/indexer.conf
  set DBAddr, including mode=blob for > 50k documents
  
  set LocalCharset to UTF-8 for full Unicode support
  
  set MaxHops to the maximum depth (default is 256), if desired/required
  
  add CrawlDelay, if desired/required
  
  add "Server Disallow http://fedoraproject.org/w/" and "Server Disallow http://fedoraproject.org/wikiold/"
  
  set Server to http://fedoraproject.org/ (the URI to be crawled and indexed)
  - cp /opt/mnoGoSearch/etc/search.htm-dist /opt/mnoGoSearch/etc/mnoGoSearch.htm (htm name must match cgi name below)
  chmod and chgrp to protect passwords!
  - vi /opt/mnoGoSearch/etc/mnoGoSearch.htm
  set DBAddr and LocalCharset to correspond to /opt/mnoGoSearch/etc/indexer.conf
  - cp /opt/mnoGoSearch/bin/search.cgi to /var/www/cgi-bin/mnoGoSearch
  - To crawl the sites configured, /opt/mnoGoSearch/sbin/indexer
  - To index the data collected, /opt/mnoGoSearch/sbin/indexer -Eblob
  - To display database statistics, /opt/mnoGoSearch/sbin/indexer -S
  - To clear the database, /opt/mnoGoSearch/sbin/indexer -C

Nutch ^[9] - akistler examined
- Description
written in Java

based on Lucene

the crawler/indexer is a Java command line application; the default depth is 5; the default number of threads is 10

the search tool runs in a Java servelet container, e.g., Tomcat
- Evaluation
There's nothing to build. Simply configure the crawler (which is actually the indexer, too) and deploy/configure the searcher. The crawler caches the pages it indexes, making the cache available to the search tool. The search interface is extremely simple and is multi-lingual, but is almost entirely an advertisement for the Nutch project. It doesn't look particularly easy to rebrand. Overall, the polish of the finished product means it's less flexible to custom modifications, like programmable keywords. After creating a new index (i.e., after a new crawl), the search application must be reloaded in Tomcat manager. The crawler is more flexible than a brief investigation could reveal. The official documentation leaves a lot to be desired. Searches are for single terms only, no multiple terms or +/- Booleans.
- Requirements
java-1.6.0-openjdk, java-1.6.0-openjdk-devel (explicit specification short-circuits yum's attempt to install BEA for Tomcat)

tomcat5, tomcat5-webapps, tomcat5-admin-webapps (tomcat5-admin-webapps is probably not actually required, but is very handy)
- Set-up Notes
The best "tutorial" is http://wiki.apache.org/nutch/NutchTutorial

Create /etc/profile.d/java.sh for "export JAVA_HOME=/etc/alternatives/jre" (required for the crawler, architecture independent definition)

In /opt (in this example, though anyplace will do) "tar xzf /path/to/tar/file/nutch-1.0.tar.gz"

Set http.agent.name in /opt/nutch-1.0/conf/nutch-site.xml (e.g., FedoraProject)

Create a flat file of starting URLs in /opt/nutch-1.0/urls/crawl-start.txt (it can be any file name in the directory)

Edit /opt/nutch-1.0/conf/crawl-urlfilter.txt to set regular expressions which keep or discard links for processing (e.g., the domain of the servers crawled)

If not running the crawler as root (Why would you?), create directories /opt/nutch-1.0/crawl and /opt/nutch-1.0/logs writable by whatever uid/gid used to crawl
Note: The crawler also needs to be able to create temporary files in its working directory

Run the crawler, putting the database in /opt/nutch-1.0/crawl

Deploy the WAR file from /opt/nutch-1.0/nutch-1.0.war as /nutch in Tomcat manager

Set searcher.dir in /var/lib/tomcat5/webapps/nutch/WEB-INF/classes/nutch-site.xml to /opt/nutch-1.0/crawl

Swish-e ^[10] - akistler examined
- Description
written in C

Note: Swish++ is a rewrite in C++ (not evaluated here)
- Evaluation
Search engine with a built-in web crawler, a built-in file system crawler, and an interface for an external crawler. The distribution includes sample search pages which use the Perl API. There is also a C API. The index is not customizable, but does include a facility for including metawords (exact match) and the path in the index for each document. The documentation acknowledges that the software only supports ASCII, but some MBCS may also work.
- Requirements
buildrequires
- gcc
- make
- libxml2-devel
- zlib-devel
requires
- libxml2
- zlib
- perl-libwww-perl (for the built-in spider)
- others as desired to index documents (pdf, etc.)

Xapian ^[11] - akistler examined
- Description
written in C++

bindings to Python, Ruby, and Perl XS

xapian-core, xapian-bindings, and perl-Search-Xapian in Fedora already; xapian-omega is not

additional bindings to PHP, Java, and more (?)

Omega provides a Xapian front-end for indexing (via script) and searching (command line or CGI)

Omega provides glue scripts for ht://Dig, mbox files, and perl DBI

Flax ^[12] is another search engine built on top of Xapian and CherryPy
- Evaluation
Xapian is a search engine library. Omega adds functionality on top of Xapian. The Xapian database is very flexible, supporting an entirely user-designed schema. Usage through Omega loses very little, if any, of that flexibility, however the supplied Omega CGI is extremely rudimentary. The supplied Omega CGI also requires the database to be named "default," although that can be changed. Database columns are of type field or index. Fields are stored verbatim (e.g., URL, date, MIME type, keywords). Indices are input as blocks of text or other content to be indexed, but not stored (e.g., the corpus of a file or web page). The Omega scriptindex utility can be combined with an external web crawler for HTML. Making Omega work with Apache requires relabeling /var/lib/omega as httpd_sys_content, or moving /var/lib/omega to /var/www/omega and using the default context there. In this evaluation, /var/lib/omega was moved to /var/www/omega. Xapian only works with UTF-8.
- Requirements
xapian-core buildrequires
- gcc gcc-c++
- make
- zlib-devel
xapian-bindings buildrequires (not including gcc gcc-c++ make)
- python python-devel
- ruby ruby-devel
- xapian-core-devel
perl-Search-Xapian buildrequires (not including gcc gcc-c++ make)
- perl
- xapian-core-devel
xapian-omega buildrequires (not including gcc gcc-c++ make)
- libtool
- xapian-core-devel
xapian-core requires
- xapian-core-libs
xapian-bindings requires
- coreutils
- python
- xapian-core-libs
perl-Search-Xapian requries
- perl
- xapian-core-libs
xapian-omega requires
- httpd
- perl
- perl-DBI
- xapian-core-libs
- Set-up Notes
Using custom crawler that requires perl-libwww-perl

Updated Omega rpm .spec to move /var/lib/omega to /var/www/omega, including updating /etc/omega.conf

Use default database at /var/www/omega/data/default

/var/www/omega/data/default.index is
URI : field boolean=Q unique=Q

ContentType : field=MIMEtype index=T

LastModified : field=Date date=unix

Title : field index=S

Content : index

To index, run (as someone who can write to /var/www/omega/data/default)
./crawl.pl http://fedoraproject.org | scriptindex /var/www/omega/data/default /var/www/omega/data/default.index

Note that "Content : unhtml index" would be preferable in the index, but unhtml apparently has bugs

Eliminated from Consideration

CLucene ^[13]

Reason for elimination

It has no crawling/spidering facility. It is a library toolkit (API) only.

Description

C++ port of Lucene

in Fedora already

described as beta by the developers

Isearch ^[14]
- Reason for elimination
It has no crawling/spidering facility. It is intended for local file indexing only.

The web site appears to have fallen out of maintenance.
- Description
written in C++

Lucene ^[15] - akistler examined

Reason for elimination

It has no crawling/spidering facility. It has no user query interface. There are no samples.

Description

written in Java, but ported to others ^[16]

Requires/uses GCJ

in Fedora already

PyLucene ^[17] is a Python wrapper around Java Lucene

Evaluation

Search engine library meant to be integrated into applications

Requirements

buildrequires (based on 1.4.3-f7)

ant
ant-junit
java-1.4.2-gcj-compat-devel
javacc
jpackage-utils
junit
make

requires (based on 1.4.3-f7)

java-1.4.2-gcj-compat

Namazu ^[18]

Reason for elimination

It has no crawling/spidering facility. It indexes local documents only.

Description

written in Perl

in Fedora already

Solr ^[19] - akistler examined

Reason for elimination

It has no crawling/spidering facility. It has no user query interface. There are no samples.

Description

written in Java

based on Lucene

Evaluation

The documentation describes installing Sun Java to run Solr, but OpenJDK 1.5 or later is fine. Solr needs a Java servlet container in which to run. It comes with Jetty, but other containers should work, as well (e.g., Tomcat). Currently only supports UTF-8 characters.

Basically Solr provides an HTTP admin GUI for a search engine that uses a superset of the Lucene query syntax. The schema is very flexible. Set-up is essentially entirely through XML files. Applications can query the servlet port and get XML or JSON responses.

Requirements

buildrequires

ant (note that ant currently pulls in java-gcj-compat, too, but it appears not to be a problem)
ant-junit
java-1.6.0-openjdk-devel
junit

requires

java-1.6.0-openjdk

Terrier (TERabyte RetrIEveR) ^[20]

Reason for elimination

It has no crawler or user search tool. It does not run as a service (as provided), only interactively.

Description

written in Java

runs from the command line (i.e., not a Tomcat servlet)

YaCy ^[21] - huzaifa examined

Reason for elimination

Requires Sun Java

written in Java
well maintained
support for peer search engine database exchanges
customized search parameters
fast indexing and web interface for querying the back-end db

Zettair ^[22]

Reason for elimination

No crawling capability, only indexes local documents

User search/retrieval tool is command-line only, no web interface

Description

written in C

Never Considered

Gonzui ^[23]

written in Ruby
specializes in source code search
not actively maintained

Grub ^[24]

written in C#

Heritrix ^[25]

written in Java
archives content rather than indexing it

ht://Dig ^[26]

written in C++
not actively maintained

HtdigSearch ^[27]

It's just a MediaWiki plugin, not suitable for searching non-wiki sites

MWSearch ^[28]

Requires EzMwLucene (Java) to be running on the servers to be searched
EzMwLucene is wiki-only, therefore MWSearch is wiki-only

OpenFTS ^[29]

written in Perl or TCL on top of PostgreSQL
Python interface available
not actively maintained

Plucene ^[30]

Perl port of Lucene
not actively maintained

RigorousSearch ^[31]

Crawls the MediaWiki database, not the web site
Doesn't work for non-MediaWiki web sites, including any non-wiki web site

Sphinx ^[32]

written in C++
designed to index SQL tables, not web pages

SphinxSearch ^[33]

Written in C++
MediaWiki plug-in, so it's wiki-only

Whoosh ^[34]

written in Python
inspired by Lucene, but closer to a Python port of parts of KinoSearch combined with some features of Terrier
toolkit only, not even sample crawlers and user interfaces

Public Testing

Public Testing is taking place on publictest3.

Search Engines

DataparkSearch (http://publictest3.fedoraproject.org/cgi-bin/dpsearch)

PostgreSQL installed (postgresql, postgresql-devel, postgresql-libs, postgresql-server)

SELinux (not present on publictest3, but needed eventually) needs:

"setsebool -P httpd_can_network_connect=1" to connect to PostgreSQL

Crawling trials (with database cleared each time, i.e., not incremental)

Memory and CPU utilization are modest, less than 10% each. Most CPU time is spent in I/O Wait for the database.

Depth=4, 2.5 hrs crawling, 2k documents, db = 320M (700M with clone detection off)

Depth=5, 16.5 hrs crawling, 12k documents, db = 1.1G

mnoGoSearch (http://publictest3.fedoraproject.org/cgi-bin/mnoGoSearch)

PostgreSQL installed (postgresql, postgresql-devel, postgresql-libs, postgresql-server)

SELinux (not present on publictest3, but needed eventually) needs:

"setsebool -P httpd_can_network_connect=1" to connect to PostgreSQL

Crawling trials (with database cleared each time, i.e., not incremental)

Memory and CPU utilization are quite modest, about 1% each. Most CPU time is spent in I/O Wait for the database.

Depth=4, 2 hrs crawling, 1.5 min indexing, 11k documents

Depth=5, 4.5 hrs crawling, 12 min indexing, 25k documents

Depth=6, 6.5 hrs crawling, 16 min indexing, 34k documents

Depth=7, 12 hrs crawling, 23 min indexing, 40k documents, db = 2.6G

Nutch (http://publictest3.fedoraproject.org/nutch)

The Nutch tarball was unpacked in /opt/nutch-1.0, just as in preliminary local testing

Tomcat is reverse proxied through Apache (see notes below)

Nutch's definition/conception of depth appears to be unusual. The crawler must be directed to spider much more deeply than should be necessary.

Crawls are executed as (e.g.) "/opt/nutch-1.0/bin/nutch crawl /opt/nutch-1.0/urls -dir /opt/nutch-1.0/crawl -depth 5 -threads 2"

Crawling trials

java process uses about 18% of 6G of memory (4G RAM, 2G swap), regardless of depth

Depth=4, 2 threads, 1.5k documents

Depth=5, 2 threads, 3 hrs, 8k documents

Depth=6, 1 thread, 8.5 hrs, 23k documents

Depth=7, 1 thread, 14.5 hrs, 37k documents, db = 400M

Depth=8, 1 thread, 16.5 hrs, 44k documents, db = 440M

Xapian (http://publictest3.fedoraproject.org/cgi-bin/omega)

At this time, only installed xapian-core-libs, xapian-core, and xapian-omega (i.e., no xapian-bindings or perl-Search-xapian)

Enabled cgi-bin in /etc/httpd/conf.d/cgi-bin.conf (see notes below)

Omega bombs on http://fedoraproject.org/wiki/Overview (and would possibly on others later) with "unhtml index"

Resolution: Don't use "unhtml"

Omega bombs on long URIs (longer than 244 chars)

Example: http://fedoraproject.org/wiki/Special:WhatLinksHere/Ru_RU/%D0%9F%D0%BB%D0%B0%D0%BD_%D1%80%D0%B0%D0%B1%D0%BE%D1%82%D1%8B_%D0%BF%D0%BE_%D0%BF%D0%B5%D1%80%D0%B5%D0%B2%D0%BE%D0%B4%D1%83_web-%D1%81%D0%B0%D0%B9%D1%82%D0%B0_%D0%BF%D1%80%D0%BE%D0%B5%D0%BA%D1%82%D0%B0_Fedora

Resolution: Enhanced custom crawler to filter URIs better (fedoraproject.org/w/ from the wiki); added capability to discard URIs that are too long (mostly hex URIs translated from other DBCS)

Perl custom crawler prints warnings for (and refuses to translate) URIs with Unicode characters outside the Latin 1 range

Resolution: None. This issue is known for URI.pm.^[35]

Crawling trials

Failed, Depth=5, scriptindex used 70% of 4G of memory (2G RAM, 2G swap, 0% free)

terminated, system sluggish with swap I/O

Depth=4, 4 hrs, 15218 documents, index is about 500M on disk, scriptindex used 40% of 4G of memory (2G RAM, 2G swap)

Depth=5, 8 hrs, 41171 documents, index is about 1G on disk, scriptindex used 20% of 6G of memory (4G RAM, 2G swap)

Apache Configuration Notes

CGI for Xapian Omega and mnoGoSearch

In /etc/httpd/conf.d/cgi-bin.conf, just use the default configuration normally commented out in httpd.conf.

 # ScriptAlias: This controls which directories contain server scripts.
 # ScriptAliases are essentially the same as Aliases, except that
 # documents in the realname directory are treated as applications and
 # run by the server when requested rather than as documents sent to the client.
 # The same rules about trailing "/" apply to ScriptAlias directives as to
 # Alias.
 #
 ScriptAlias /cgi-bin/ "/var/www/cgi-bin/"
 #
 # "/var/www/cgi-bin" should be changed to whatever your ScriptAliased
 # CGI directory exists, if you have that configured.
 #
 <Directory "/var/www/cgi-bin">
     AllowOverride None
     Options None
     Order allow,deny
     Allow from all
 </Directory>

Reverse proxy for Tomcat

In /etc/httpd/conf.d/tomcat5.conf:

 <Location /admin>
   Order Allow,Deny
   Allow from ...
 </Location>
 ProxyPass        /admin   http://localhost:8082/admin
 ProxyPassReverse /admin   http://localhost:8082/admin
 #
 <Location /manager>
   Order Allow,Deny
   Allow from ...
 </Location>
 ProxyPass        /manager http://localhost:8082/manager
 ProxyPassReverse /manager http://localhost:8082/manager
 #
 ProxyPass        /nutch   http://localhost:8082/nutch
 ProxyPassReverse /nutch   http://localhost:8082/nutch

Tomcat Configuration Notes

In /etc/tomcat5/server.xml:

comment out the AJP connector on port 8009

comment out the HTTP connector on port 8080

uncomment the proxied HTTP connector on 8082

add proxyName to the HTTP connector on 8082

could alternatively define proxyName and proxyPort and undefine redirectPort in the HTTP connector on port 8080

SELinux (not present on publictest3, but needed eventually)

For port 8082, SELinux needs "setsebool -P httpd_can_network_connect=1"

Alternatively, for port 8080, SELinux needs "setsebool -P httpd_can_network_relay=1"

manager and admin (in fact, all users) defined in /etc/tomcat5/tomcat-users.xml

Recommended, but not done here, change the shutdown password in server.xml (default is SHUTDOWN)

Deployment Plan

<tbd>

References

↑ "Fedora Search Engine". Infrastructure Trac. https://fedorahosted.org/fedora-infrastructure/ticket/1055.
↑ "DataparkSearch". DataparkSearch. http://www.dataparksearch.org/.
↑ "Egothor". Egothor. http://www.egothor.org/.
↑ "Ferret". David Balmain. http://ferret.davebalmain.com/.
↑ ^5.0 ^5.1 "Lucy". Apache Software Foundation. http://lucene.apache.org/lucy/.
↑ "Indri". The Lemur Project. http://www.lemurproject.org/indri/.
↑ "KinoSearch". Rectangular Research. http://www.rectangular.com/kinosearch/.
↑ "mnoGoSearch". LavTech. http://www.mnogosearch.org/.
↑ "Nutch". Apache Software Foundation. http://lucene.apache.org/nutch/.
↑ "Swish-e". Swish-e. http://swish-e.org/.
↑ "Xapian". Xapian Project. http://xapian.org/.
↑ "Flax". Flax. http://www.flax.co.uk/products.shtml.
↑ "CLucene". CLucene Project. http://sourceforge.net/projects/clucene/.
↑ "Isearch". Isite. http://isite.awcubed.com/.
↑ "Lucene". Apache Software Foundation. http://lucene.apache.org/.
↑ "Lucene Implementations". Apache Software Foundation. http://wiki.apache.org/lucene-java/LuceneImplementations.
↑ "PyLucene". Apache Software Foundation. http://lucene.apache.org/pylucene/.
↑ "Namazu". Namazu Project. http://www.namazu.org/.
↑ "Solr". Apache Software Foundation. http://lucene.apache.org/solr/.
↑ "Terrier". Terrier Project. http://ir.dcs.gla.ac.uk/terrier/.
↑ "YaCy". Karlsruhe Institute of Technology. http://yacy.net/.
↑ "Zettair". Search Engine Group, Royal Melbourne Institute of Technology. http://www.seg.rmit.edu.au/zettair/.
↑ "Gonzui". SourceForge. http://gonzui.sourceforge.net/.
↑ "Grub". Wikia, Inc. http://grub.org/.
↑ "Heritrix". Internet Archive. http://crawler.archive.org/.
↑ "ht://Dig". The ht://Dig Group. http://www.htdig.org/.
↑ "HtdigSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:HtdigSearch.
↑ "MWSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:MWSearch.
↑ "OpenFTS". XWare. http://www.astronet.ru/xware/#fts.
↑ "Plucene". CPAN. http://search.cpan.org/~tmtm/Plucene-1.25.
↑ "RigorousSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:RigorousSearch.
↑ "Sphinx". Sphinx Technologies. http://sphinxsearch.com/.
↑ "SphinxSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:SphinxSearch.
↑ "Whoosh". Matt Chaput. http://whoosh.ca/.
↑ ""URI.pm error"". Usenet. http://www.nntp.perl.org/group/perl.libwww/2006/01/msg6540.html.

[Trac-1] "Fedora Search Engine". Infrastructure Trac. https://fedorahosted.org/fedora-infrastructure/ticket/1055.

[DataparkSearch-2] "DataparkSearch". DataparkSearch. http://www.dataparksearch.org/.

[Egothor-3] "Egothor". Egothor. http://www.egothor.org/.

[Ferret-4] "Ferret". David Balmain. http://ferret.davebalmain.com/.

[Lucy-5] 5.0 ^5.1 "Lucy". Apache Software Foundation. http://lucene.apache.org/lucy/.

[Indri-6] "Indri". The Lemur Project. http://www.lemurproject.org/indri/.

[KinoSearch-7] "KinoSearch". Rectangular Research. http://www.rectangular.com/kinosearch/.

[mnoGoSearch-8] "mnoGoSearch". LavTech. http://www.mnogosearch.org/.

[Nutch-9] "Nutch". Apache Software Foundation. http://lucene.apache.org/nutch/.

[Swish-e-10] "Swish-e". Swish-e. http://swish-e.org/.

[Xapian-11] "Xapian". Xapian Project. http://xapian.org/.

[Flax-12] "Flax". Flax. http://www.flax.co.uk/products.shtml.

[CLucene-13] "CLucene". CLucene Project. http://sourceforge.net/projects/clucene/.

[Isearch-14] "Isearch". Isite. http://isite.awcubed.com/.

[Lucene-15] "Lucene". Apache Software Foundation. http://lucene.apache.org/.

[LuceneImplementations-16] "Lucene Implementations". Apache Software Foundation. http://wiki.apache.org/lucene-java/LuceneImplementations.

[PyLucene-17] "PyLucene". Apache Software Foundation. http://lucene.apache.org/pylucene/.

[Namazu-18] "Namazu". Namazu Project. http://www.namazu.org/.

[Solr-19] "Solr". Apache Software Foundation. http://lucene.apache.org/solr/.

[Terrier-20] "Terrier". Terrier Project. http://ir.dcs.gla.ac.uk/terrier/.

[YaCy-21] "YaCy". Karlsruhe Institute of Technology. http://yacy.net/.

[Zettair-22] "Zettair". Search Engine Group, Royal Melbourne Institute of Technology. http://www.seg.rmit.edu.au/zettair/.

[Gonzui-23] "Gonzui". SourceForge. http://gonzui.sourceforge.net/.

[Grub-24] "Grub". Wikia, Inc. http://grub.org/.

[Heritrix-25] "Heritrix". Internet Archive. http://crawler.archive.org/.

[htDig-26] "ht://Dig". The ht://Dig Group. http://www.htdig.org/.

[HtdigSearch-27] "HtdigSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:HtdigSearch.

[MWSearch-28] "MWSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:MWSearch.

[OpenFTS-29] "OpenFTS". XWare. http://www.astronet.ru/xware/#fts.

[Plucene-30] "Plucene". CPAN. http://search.cpan.org/~tmtm/Plucene-1.25.

[RigorousSearch-31] "RigorousSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:RigorousSearch.

[Sphinx-32] "Sphinx". Sphinx Technologies. http://sphinxsearch.com/.

[SphinxSearch-33] "SphinxSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:SphinxSearch.

[Whoosh-34] "Whoosh". Matt Chaput. http://whoosh.ca/.

[URI.pm-35] ""URI.pm error"". Usenet. http://www.nntp.perl.org/group/perl.libwww/2006/01/msg6540.html.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

Search

Infrastructure/Search: Difference between revisions

Latest revision as of 09:05, 1 February 2011

Contents

Points of Contact

Project Sponsor

Secondary Contact info

Project Info

Description/Summary

Requirements

Preferences

Project Plan

Resources Needed

Software Investigation and Evaluation

Comparison by Requirements

In Progress

Eliminated from Consideration

Never Considered

Public Testing

Search Engines

Apache Configuration Notes

Tomcat Configuration Notes

Deployment Plan

References

@@ Line 38: / Line 38: @@
 * Programmable keywords to have control over what pages get displayed for certain keywords
 * XML or library interface so other applications can use it
+* Support multiple-language text (i.e., Unicode)
 === Project Plan ===
@@ Line 60: / Line 61: @@
 {| class="wikitable sortable" style="width: 100%; text-align: center; font-size: smaller; table-layout: fixed;"
 |-
-!{{rh}}| Engine Name
+! Engine Name
-!{{rh}}| Source Language
+! Source Language
-!{{rh}}| Integrated Crawler
+! Integrated Crawler
-!{{rh}}| Integrated Search Tool
+! Integrated Search Tool
-!{{rh}}| Programmable Categories
+! Programmable Categories
-!{{rh}}| Application Interface
+! Application Interface
-|-
-!{{rh}}| CLucene
-| C++
-|
-|
-|
-|
 |-
 !{{rh}}| DataparkSearch
-| C
+| C || {{Yes}} || {{Yes}} || {{Yes}} <br/> (Tags) || {{Yes}} <br/> (Native C API)
-|
-|
-|
-|
 |-
 !{{rh}}| Egothor
@@ Line 97: / Line 87: @@
 !{{rh}}| Indri
 | C/C++
-|
-|
-|
-|
-|-
-!{{rh}}| Isearch
-| C++
 |
 |
@@ Line 110: / Line 93: @@
 |-
 !{{rh}}| KinoSearch
-| Perl/C
+| Perl/C || {{No}} <br/> (sample file crawler included) || {{No}} <br/> (sample included) || {{Yes}} || {{Yes}} <br/> (BDB/JSON)
-| {{No}} <br/> (sample file crawler included)
-| {{No}} <br/> (sample included)
-| {{Yes}}
-| {{Yes}} <br/> (BDB/JSON)
 |-
-!{{rh}}| Namazu
+!{{rh}}| mnoGoSearch
-| Perl
+| C || {{Yes}} || {{Yes}} || {{Yes}} <br/> (Tags and Hierarchical categories) || {{Yes}} <br/> (Native C API)
-|
-|
-|
-|
 |-
 !{{rh}}| Nutch
-| Java
+| Java || {{Yes}} <br/> (OpenJDK command line) || {{Yes}} <br/> (Tomcat servlet) || {{No}} || {{Yes}} <br/> (Java)
-| {{Yes}} <br/> (OpenJDK command line)
-| {{Yes}} <br/> (Tomcat servlet)
-| {{No}}
-| {{Yes}} <br/> (Java)
 |-
 !{{rh}}| Swish-e
-| C/Perl
+| C/Perl || {{Yes}} <br/> (Perl) || {{Maybe|Sort Of}} <br/> (sample included, but has problems) || {{No}} <br/> (but can search on META tags) || {{Yes}} <br/> (Perl and C APIs)
-| {{Yes}} <br/> (Perl)
-| {{Maybe|Sort Of}} <br/> (sample included, but has problems)
-| {{No}} <br/> (but can search on META tags)
-| {{Yes}} <br/> (Perl and C APIs)
-|-
-!{{rh}}| Terrier
-| Java
-|
-|
-|
-|
 |-
 !{{rh}}| Xapian
-| C++
+| C++ || {{Maybe|Sort Of}} <br/> (combined Omega with custom Perl) || {{Yes}} <br/> (rudimentary Omega CGI) || {{Yes}} || {{Yes}} <br/> (C++, Perl, Python, Ruby)
-| {{Maybe|Sort Of}} <br/> (combined Omega with custom Perl)
-| {{Yes}} <br/> (rudimentary Omega CGI)
-| {{Yes}}
-| {{Yes}} <br/> (C++, Perl, Python, Ruby)
-|-
-!{{rh}}| Zettair
-| C
-|
-|
-|
-|
 |-class="sortbottom"
 ! Engine Name
@@ Line 168: / Line 117: @@
 === In Progress ===
-* CLucene <ref name="CLucene">{{cite web|url=http://sourceforge.net/projects/clucene/|title=CLucene|publisher=CLucene Project}}</ref>
+* DataparkSearch <ref name="DataparkSearch">{{cite web|url=http://www.dataparksearch.org/|title=DataparkSearch|publisher=DataparkSearch}}</ref> - akistler examined
-: C++ port of Lucene
+** Description
-: in Fedora already
+*: written in C
-: described as beta by the developers
+*: forked from mnoGoSearch in 2003 when mnoGoSearch went semi-commercial
+*: Indices are stored in a database; Supported databases include MySQL and PostgreSQL (SQLite not advertised, but listed in documentation)
-* DataparkSearch <ref name="DataparkSearch">{{cite web|url=http://www.dataparksearch.org/|title=DataparkSearch|publisher=DataparkSearch}}</ref>
+*: Support for all SBCS and some DBCS
-: written in C
+*: Search tool supports Booleans (AND, OR, NOT, NEAR)
+** Evaluation
+*** Evaluated after mnoGoSearch.
+*** Database modes are single and multi, like mnoGoSearch, but include hashed modes, which does not support string and substring searches, and cached mode, which stores only URI indices in the database, but stores word data in disk files through an additional daemon. The absence of blob mode from mnoGoSearch, but the presence of cached mode in DataparkSearch, appears to be a major difference between the two.
+*** As written, the setup expects that the application process logs in to the database as the schema owner, however with additional manual steps it can be made to work as not the schema owner. The call handler setup and custom stored procedure language definition present in mnoGoSearch is commented out in the setup of DataparkSearch, so PostgreSQL superuser is not required, as written. (Their presence in mnoGoSearch is questionable, anyway.) In general, DataparkSearch does appear to be a more slowly developed version of mnoGoSearch.
+*** Multiple character set support is not the default, but is specified explicitly before compilation.
+*** Bugs in create.multi.sql
+**** 185: ERROR: relation "cachedchk2" already exists
+**** 186: ERROR: column "url_id" does not exist
+***: Caused by duplicate CREATE statements in the script. Comment out one set to resolve the bug.
+*** Bugs in drop.multi.sql
+**** 16: ERROR: sequence "url_rec_id_seq" does not exist
+**** 17: ERROR: sequence "categories_rec_id_seq" does not exist
+**** 18: ERROR: sequence "qtrack_rec_id_seq" does not exist
+**** 19: ERROR: sequence "server" does not exist
+***: Caused by extraneous DROP statements. Comment them out or ignore them. They're harmless.
+*** Extended search mode appears broken. Only the first result is returned. All other results are lost. This might be only an error in the search form, in which case it can be easily debugged and fixed.
+** Requirements
+*: buildrequires
+*:* gcc make
+*:* postgresql-devel (for PostgreSQL support)
+*:* zlib-devel
+*: requires
+*:* postgresql-libs (for PostgreSQL support)
+*:* httpd
+*:* zlib
+*:* others as desired to index documents (pdf, etc.)
+** Setup Notes
+*** The tarball was compiled into /opt/dpsearch, although it should be possible to create an RPM for more conventional locations (/bin, /etc, /sbin, etc.)
+*** The code compiles and runs fine on x86_64 (compare to mnoGoSearch)
+*** Create database user (dp/search) different from database owner (dbowner/dbowner) different from superuser (postgres)
+*** Run create.multi.sql as dbowner, then grant privileges to dp
+**:<pre>
+**:GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE cachedchk  TO dp;
+**:GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE cachedchk2 TO dp;
+**:GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE categories TO dp;
+**:GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE cookies    TO dp;
+**:GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE crossdict  TO dp;
+**:GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE dict       TO dp;
+**:GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE dict10     TO dp;
+**:GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE dict11     TO dp;
+**:GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE dict12     TO dp;
+**:GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE dict16     TO dp;
+**:GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE dict2      TO dp;
+**:GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE dict3      TO dp;
+**:GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE dict32     TO dp;
+**:GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE dict4      TO dp;
+**:GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE dict5      TO dp;
+**:GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE dict6      TO dp;
+**:GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE dict7      TO dp;
+**:GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE dict8      TO dp;
+**:GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE dict9      TO dp;
+**:GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE links      TO dp;
+**:GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE qinfo      TO dp;
+**:GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE qtrack     TO dp;
+**:GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE robots     TO dp;
+**:GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE server     TO dp;
+**:GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE srvinfo    TO dp;
+**:GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE storedchk  TO dp;
+**:GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE url        TO dp;
+**:GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE urlinfo    TO dp;
+**:GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE categories_rec_id_seq TO dp;
+**:GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE qtrack_rec_id_seq     TO dp;
+**:GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE url_rec_id_seq        TO dp;
+**:</pre>
+*** cp /opt/dpsearch/etc/indexer.conf-dist /opt/dpsearch/etc/indexer.conf
+**: chmod and chgrp to protect passwords!
+*** vi /opt/dpsearch/etc/indexer.conf
+**: set DBAddr, including mode=multi
+**: set LocalCharset to UTF-8 for full Unicode support
+**: set MaxHops to the maximum depth (default is 256), if desired/required
+**: add CrawlDelay, if desired/required
+**: add "Server Disallow <nowiki>http://fedoraproject.org/w/</nowiki>" and "Server Disallow <nowiki>http://fedoraproject.org/wikiold/</nowiki>"
+**: set Server to <nowiki>http://fedoraproject.org/</nowiki> (the URI to be crawled and indexed)
+*** cp /opt/dpsearch/etc/stopwords.conf-dist /opt/dpsearch/etc/stopwords.conf
+**: Note that there are no Chinese stopwords, so don't uncomment the line to configure them
+*** cp /opt/dpsearch/etc/langmap.conf-dist /opt/dpsearch/etc/langmap.conf
+*** vi /opt/dpsearch/etc/langmap.conf
+**** Comment out the lines for ca.latin1.lit.lm, ga.latin1.lit.lm, ko.utf8.lit.lm, and pt-pt.latin1.lm, because the files for them don't exist
+**** Uncomment the lines for ja.euc-jp.lm, ja.sjis.lm, ta.tscii.lm, zh.big5.lm, and zh.gb2312.lm, because the files for them do exist
+*** cp /opt/dpsearch/etc/sections.conf-dist /opt/dpsearch/etc/sections.conf
+*** cp /opt/dpsearch/etc/search.htm-dist /opt/dpsearch/etc/search.htm (htm name must not change; compare to mnoGoSearch)
+**: chmod and chgrp to protect passwords!
+*** vi /opt/dpsearch/etc/dpsearch.htm
+**: set DBAddr and LocalCharset to correspond to /opt/dpsearch/etc/indexer.conf
-* Egothor <ref name="Egothor">{{cite web|url=http://www.egothor.org/|title=Egothor|publisher=Egothor}}</ref> - '''Allen investigating'''
+* Egothor <ref name="Egothor">{{cite web|url=http://www.egothor.org/|title=Egothor|publisher=Egothor}}</ref>
-: written in Java
+** Description
+*: written in Java
 * Ferret <ref name="Ferret">{{cite web|url=http://ferret.davebalmain.com/|title=Ferret|publisher=David Balmain}}</ref>
-: Ruby port of Lucene
+** Description
-: KinoSearch and Ferret intend to merge as Lucy <ref name="Lucy"/>
+*: Ruby port of Lucene
+*: KinoSearch and Ferret intend to merge as Lucy <ref name="Lucy"/>
 * Indri <ref name="Indri">{{cite web|url=http://www.lemurproject.org/indri/|title=Indri|publisher=The Lemur Project}}</ref>
-: written in C/C++
+** Description
+*: Written in C/C++
-* Isearch <ref name="Isearch">{{cite web|url=http://isite.awcubed.com/|title=Isearch|publisher=Isite}}</ref>
+*: Indexes text and HTML, can be extended with custom parsers
-: written in C++
+*: Provides Java, PHP, and C++ APIs and, if desired, a SOAP interface
+*: Search tool is a local Java application or a PHP CGI application
 * KinoSearch <ref name="KinoSearch">{{cite web|url=http://www.rectangular.com/kinosearch/|title=KinoSearch|publisher=Rectangular Research}}</ref> - akistler examined
-:* Description
+** Description
-:: Perl/C port of Lucene
+*: Perl/C port of Lucene
-:: in Fedora already
+*: in Fedora already
-:: maintainer Rectangular Research appears to be just one guy, who considers KinoSearch to be alpha software
+*: maintainer Rectangular Research appears to be just one guy, who considers KinoSearch to be alpha software
-:: KinoSearch and Ferret intend to merge as Lucy <ref name="Lucy">{{cite web|url=http://lucene.apache.org/lucy/|title=Lucy|publisher=Apache Software Foundation}}</ref>
+*: KinoSearch and Ferret intend to merge as Lucy <ref name="Lucy">{{cite web|url=http://lucene.apache.org/lucy/|title=Lucy|publisher=Apache Software Foundation}}</ref>
-:* Evaluation
+** Evaluation
-:: Search engine library with sample indexer and search page rather than fully-functional application.  Stores indices in Berkeley DB files with JSON interfaces.  Allows custom-designed indices, including categories (exact match) to fulfill "programmable keywords" requirement.  Each document index on each document source is a single write-once file collection (BDB and JSON) in a unique directory.  Rerunning the indexer creates a new directory, obsoleting the old directory if all the old documents are included.  The old directory then needs to be cleaned up.  Postings can, however, be deleted from an index.  Additionally, only the new documents can be indexed, but that's not efficient.
+*: Search engine library with sample indexer and search page rather than fully-functional application.  Stores indices in Berkeley DB files with JSON interfaces.  Allows custom-designed indices, including categories (exact match) to fulfill "programmable keywords" requirement.  Each document index on each document source is a single write-once file collection (BDB and JSON) in a unique directory.  Rerunning the indexer creates a new directory, obsoleting the old directory if all the old documents are included.  The old directory then needs to be cleaned up.  Postings can, however, be deleted from an index.  Additionally, only the new documents can be indexed, but that's not efficient.
-:* Requirements
+** Requirements
-:: buildrequires
+*: buildrequires
-::* gcc
+*:* gcc
-::* (EPEL) perl-Module-Build
+*:* (EPEL) perl-Module-Build
-:: requires
+*: requires
-::* (EPEL) perl-JSON-XS
+*:* (EPEL) perl-JSON-XS
-:::  Problem: Desires 1.53, but EPEL has 1.43
+*::  Problem: Desires 1.53, but EPEL has 1.43
-:::: Note: <nowiki>http://web.archive.org/web/20071122035408/search.cpan.org/src/MLEHMANN/JSON-XS-1.53/</nowiki>
+*::: Note: <nowiki>http://web.archive.org/web/20071122035408/search.cpan.org/src/MLEHMANN/JSON-XS-1.53/</nowiki>
-:::: Note: works with 1.43, anyway
+*::: Note: works with 1.43, anyway
-::* (EPEL) perl-Lingua-Stem-Snowball
+*:* (EPEL) perl-Lingua-Stem-Snowball
-::* (EPEL) perl-Lingua-StopWords
+*:* (EPEL) perl-Lingua-StopWords
-::* (EPEL) perl-Parse-RecDescent
+*:* (EPEL) perl-Parse-RecDescent
-:: sample indexer reads files from the file system and requires
+*: sample indexer reads files from the file system and requires
-::* (EPEL) perl-HTML-Tree
+*:* (EPEL) perl-HTML-Tree
-:: sample cgi search script requires
+*: sample cgi search script requires
-::* (CPAN) Data::Pageset (which requires Data::Page)
+*:* (CPAN) Data::Pageset (which requires Data::Page)
-::* (EPEL) perl-Test-Exception
+*:* (EPEL) perl-Test-Exception
-::* (EPEL) perl-Class-Accessor-Chained
+*:* (EPEL) perl-Class-Accessor-Chained
-* Namazu <ref name="Namazu">{{cite web|url=http://www.namazu.org/|title=Namazu|publisher=Namazu Project}}</ref> - '''Huzaifa investigating'''
+* mnoGoSearch <ref name="mnoGoSearch">{{cite web|url=http://www.mnogosearch.org/|title=mnoGoSearch|publisher=LavTech}}</ref> - akistler examined
-: written in Perl
+** Description
-: in Fedora already
+*: written in C
+*: UNIX/Linux source code is GPL; Windows binaries are commercial, likely based on the GPL UNIX/Linux code, and lag a few versions behind
+*: Indices are stored in a database; Supported databases include MySQL, PostgreSQL, and SQLite (among others)
+*: HTTP, FTP, and NNTP crawling
+*: C, PHP, and Perl APIs (advertised, apparently only C API really included)
+*: SBCS and most MBCS supported, including most eastern Asian languages
+** Evaluation
+*** The supplied install.pl script generates a configure command, but does not support SQLite.
+*** Adding --with-sqlite3 to the generated command adds SQLite support.  An empty database must be created manually.  A URI in the indexer.conf file specifies the location of the database.  According to the documentation, sqlite:/path/to/db/file should work, but doesn't.  According to the message boards on mnoGoSearch.org, sqlite://localhost/path/to/db/file should work, but doesn't.
+*** No problems compiling with PostgreSQL support. Configure and Makefile may need significant rewriting to make good RPMs. There is no support for cross-compiling to 32-bit architectures from 64-bit machines.
+*** As written, the setup expects that the application process logs in to the database as a PostgreSQL superuser and schema owner, however with additional manual steps it can be made to work as neither the superuser nor the schema owner. Several schemata are available (single, multi, and blob). LavTech recommends blob for sites indexing more than 50k documents. The crawler is very flexible with quite a complex configuration file. The CGI search page also has nice features for "advanced" searching, although it can be customized to suit each site. Tags are labels configured within the crawler, usually by URI server component. Categories are numerical hierarchies, up to 6 levels deep, also specified in the crawler configuration.
+*** Bugs in pgsql/drop.blob.sql
+***# drop function clean_srvinfo(); (the () is omitted, but needs to be included)
+***# DROP LANGUAGE plpgsql; (missing)
+***# DROP FUNCTION plpgsql_call_handler(); (missing, has to be run twice, once for postgres, once for dbowner?)
+***#: ''Does this definition in postgres (dangerously) shortcut anything inherent for other DBs on the same server?''
+** Requirements
+*: buildrequires
+*:* gcc make
+*:* sqlite-devel (for SQLite support)
+*:* postgresql-devel (for PostgreSQL support)
+*:* zlib-devel
+*: requires
+*:* sqlite (for SQLite support)
+*:* postgresql-libs (for PostgreSQL support)
+*:* httpd
+*:* zlib
+*:* others as desired to index documents (pdf, etc.)
+** Setup Notes
+*** The tarball was compiled into /opt/mnoGoSearch, although it should be possible to create an RPM for more conventional locations (/bin, /etc, /sbin, etc.)
+*** RPM for Fedora [https://bugzilla.redhat.com/show_bug.cgi?id=673175 currently awaiting review]
+*** On x86_64 architecture, the x86_64 binary fails when indexing a crawl with the error "indexer[21272]: PQexecPrepared: ERROR:  incorrect binary data format in bind parameter 2." The tarball refuses to cross-compile to 32-bit architecture, despite tweaking the ./configure options. Compiling on a 32-bit machine and moving the binaries to the 64-bit machine works.
+*** Create database user (mno/search) different from database owner (dbowner/dbowner) different from superuser (postgres)
+*** /opt/mnoGoSearch/sbin/indexer -Ecreate, to create the tables, etc., and /opt/mnoGoSearch/sbin/indexer -Edrop, to drop tables, etc., just runs a script from /opt/mnoGoSearch/share/<db-type>/, but needs to be run as postgres
+*** Run create.blob.sql as postgres, change owners to dbowner, and grant privileges to mno
+**:<pre>
+**:ALTER TABLE bdict      OWNER TO dbowner;
+**:ALTER TABLE bdicti     OWNER TO dbowner;
+**:ALTER TABLE categories OWNER TO dbowner;
+**:ALTER TABLE crossdict  OWNER TO dbowner;
+**:ALTER TABLE dict       OWNER TO dbowner;
+**:ALTER TABLE links      OWNER TO dbowner;
+**:ALTER TABLE qcache     OWNER TO dbowner;
+**:ALTER TABLE qinfo      OWNER TO dbowner;
+**:ALTER TABLE qtrack     OWNER TO dbowner;
+**:ALTER TABLE server     OWNER TO dbowner;
+**:ALTER TABLE srvinfo    OWNER TO dbowner;
+**:ALTER TABLE url        OWNER TO dbowner;
+**:ALTER TABLE urlinfo    OWNER TO dbowner;
+**:ALTER TABLE wrdstat    OWNER TO dbowner;
+**:GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE bdict      TO mno;
+**:GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE bdicti     TO mno;
+**:GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE categories TO mno;
+**:GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE crossdict  TO mno;
+**:GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE dict       TO mno;
+**:GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE links      TO mno;
+**:GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE qcache     TO mno;
+**:GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE qinfo      TO mno;
+**:GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE qtrack     TO mno;
+**:GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE server     TO mno;
+**:GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE srvinfo    TO mno;
+**:GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE url        TO mno;
+**:GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE urlinfo    TO mno;
+**:GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE wrdstat    TO mno;
+**:GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE categories_rec_id_seq TO mno;
+**:GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE qtrack_rec_id_seq     TO mno;
+**:GRANT SELECT,INSERT,UPDATE,DELETE ON TABLE url_rec_id_seq        TO mno;
+**:</pre>
+*** cp /opt/mnoGoSearch/etc/indexer.conf-dist /opt/mnoGoSearch/etc/indexer.conf
+**: chmod and chgrp to protect passwords!
+*** vi /opt/mnoGoSearch/etc/indexer.conf
+**: set DBAddr, including mode=blob for > 50k documents
+**: set LocalCharset to UTF-8 for full Unicode support
+**: set MaxHops to the maximum depth (default is 256), if desired/required
+**: add CrawlDelay, if desired/required
+**: add "Server Disallow <nowiki>http://fedoraproject.org/w/</nowiki>" and "Server Disallow <nowiki>http://fedoraproject.org/wikiold/</nowiki>"
+**: set Server to <nowiki>http://fedoraproject.org/</nowiki> (the URI to be crawled and indexed)
+*** cp /opt/mnoGoSearch/etc/search.htm-dist /opt/mnoGoSearch/etc/mnoGoSearch.htm (htm name must match cgi name below)
+**: chmod and chgrp to protect passwords!
+*** vi /opt/mnoGoSearch/etc/mnoGoSearch.htm
+**: set DBAddr and LocalCharset to correspond to /opt/mnoGoSearch/etc/indexer.conf
+*** cp /opt/mnoGoSearch/bin/search.cgi to /var/www/cgi-bin/mnoGoSearch
+*** To crawl the sites configured, /opt/mnoGoSearch/sbin/indexer
+*** To index the data collected, /opt/mnoGoSearch/sbin/indexer -Eblob
+*** To display database statistics, /opt/mnoGoSearch/sbin/indexer -S
+*** To clear the database, /opt/mnoGoSearch/sbin/indexer -C
 * Nutch <ref name="Nutch">{{cite web|url=http://lucene.apache.org/nutch/|title=Nutch|publisher=Apache Software Foundation}}</ref> - akistler examined
-:* Description
+** Description
-:: written in Java
+*: written in Java
-:: based on Lucene
+*: based on Lucene
-:: the crawler/indexer is a Java command line application; the default depth is 5; the default number of threads is 10
+*: the crawler/indexer is a Java command line application; the default depth is 5; the default number of threads is 10
-:: the search tool runs in a Java servelet container, e.g., Tomcat
+*: the search tool runs in a Java servelet container, e.g., Tomcat
-:* Evaluation
+** Evaluation
-:: There's nothing to build.  Simply configure the crawler (which is actually the indexer, too) and deploy/configure the searcher.  The crawler caches the pages it indexes, making the cache available to the search tool.  The search interface is extremely simple and is multi-lingual, but is almost entirely an advertisement for the Nutch project.  It doesn't look particularly easy to rebrand.  Overall, the polish of the finished product means it's less flexible to custom modifications, like programmable keywords.  After creating a new index (i.e., after a new crawl), the search application must be reloaded in Tomcat manager.  The crawler is more flexible than a brief investigation could reveal.  The official documentation leaves a lot to be desired.  Searches are for single terms only, no multiple terms or +/- Booleans.
+*: There's nothing to build.  Simply configure the crawler (which is actually the indexer, too) and deploy/configure the searcher.  The crawler caches the pages it indexes, making the cache available to the search tool.  The search interface is extremely simple and is multi-lingual, but is almost entirely an advertisement for the Nutch project.  It doesn't look particularly easy to rebrand.  Overall, the polish of the finished product means it's less flexible to custom modifications, like programmable keywords.  After creating a new index (i.e., after a new crawl), the search application must be reloaded in Tomcat manager.  The crawler is more flexible than a brief investigation could reveal.  The official documentation leaves a lot to be desired.  Searches are for single terms only, no multiple terms or +/- Booleans.
-:* Requirements
+** Requirements
-:: java-1.6.0-openjdk, java-1.6.0-openjdk-devel (explicit specification short-circuits yum's attempt to install BEA for Tomcat)
+*: java-1.6.0-openjdk, java-1.6.0-openjdk-devel (explicit specification short-circuits yum's attempt to install BEA for Tomcat)
-:: tomcat5, tomcat5-webapps, tomcat5-admin-webapps (tomcat5-admin-webapps is probably not actually required, but is very handy)
+*: tomcat5, tomcat5-webapps, tomcat5-admin-webapps (tomcat5-admin-webapps is probably not actually required, but is very handy)
-:* Set-up Notes
+** Set-up Notes
-:: The best "tutorial" is <nowiki>http://wiki.apache.org/nutch/NutchTutorial</nowiki>
+*: The best "tutorial" is <nowiki>http://wiki.apache.org/nutch/NutchTutorial</nowiki>
-:: Create /etc/profile.d/java.sh for "export JAVA_HOME=/etc/alternatives/jre" (required for the crawler, architecture independent definition)
+*: Create /etc/profile.d/java.sh for "export JAVA_HOME=/etc/alternatives/jre" (required for the crawler, architecture independent definition)
-:: In /opt (in this example, though anyplace will do) "tar xzf /path/to/tar/file/nutch-1.0.tar.gz"
+*: In /opt (in this example, though anyplace will do) "tar xzf /path/to/tar/file/nutch-1.0.tar.gz"
-:: Set http.agent.name in /opt/nutch-1.0/conf/nutch-site.xml (e.g., FedoraProject)
+*: Set http.agent.name in /opt/nutch-1.0/conf/nutch-site.xml (e.g., FedoraProject)
-:: Create a flat file of starting URLs in /opt/nutch-1.0/urls/crawl-start.txt (it can be any file name in the directory)
+*: Create a flat file of starting URLs in /opt/nutch-1.0/urls/crawl-start.txt (it can be any file name in the directory)
-:: Edit /opt/nutch-1.0/conf/crawl-urlfilter.txt to set regular expressions which keep or discard links for processing (e.g., the domain of the servers crawled)
+*: Edit /opt/nutch-1.0/conf/crawl-urlfilter.txt to set regular expressions which keep or discard links for processing (e.g., the domain of the servers crawled)
-:: If not running the crawler as root (Why would you?), create directories /opt/nutch-1.0/crawl and /opt/nutch-1.0/logs writable by whatever uid/gid used to crawl
+*: If not running the crawler as root (Why would you?), create directories /opt/nutch-1.0/crawl and /opt/nutch-1.0/logs writable by whatever uid/gid used to crawl
-::: Note: The crawler also needs to be able to create temporary files in its working directory
+*:: Note: The crawler also needs to be able to create temporary files in its working directory
-:: Run the crawler, putting the database in /opt/nutch-1.0/crawl
+*: Run the crawler, putting the database in /opt/nutch-1.0/crawl
-:: Deploy the WAR file from /opt/nutch-1.0/nutch-1.0.war as /nutch in Tomcat manager
+*: Deploy the WAR file from /opt/nutch-1.0/nutch-1.0.war as /nutch in Tomcat manager
-:: Set searcher.dir in /var/lib/tomcat5/webapps/nutch/WEB-INF/classes/nutch-site.xml to /opt/nutch-1.0/crawl
+*: Set searcher.dir in /var/lib/tomcat5/webapps/nutch/WEB-INF/classes/nutch-site.xml to /opt/nutch-1.0/crawl
 * Swish-e <ref name="Swish-e">{{cite web|url=http://swish-e.org/|title=Swish-e|publisher=Swish-e}}</ref> - akistler examined
-:* Description
+** Description
-:: written in C
+*: written in C
-:: Note: Swish++ is a rewrite in C++ (not evaluated here)
+*: Note: Swish++ is a rewrite in C++ (not evaluated here)
-:* Evaluation
+** Evaluation
-:: Search engine with a built-in web crawler, a built-in file system crawler, and an interface for an external crawler.  The distribution includes sample search pages which use the Perl API.  There is also a C API.  The index is not customizable, but does include a facility for including metawords (exact match) and the path in the index for each document.  The documentation acknowledges that the software only supports ASCII, but some MBCS may also work.
+*: Search engine with a built-in web crawler, a built-in file system crawler, and an interface for an external crawler.  The distribution includes sample search pages which use the Perl API.  There is also a C API.  The index is not customizable, but does include a facility for including metawords (exact match) and the path in the index for each document.  The documentation acknowledges that the software only supports ASCII, but some MBCS may also work.
-:* Requirements
+** Requirements
-:: buildrequires
+*: buildrequires
-::* gcc
+*:* gcc
-::* make
+*:* make
-::* libxml2-devel
+*:* libxml2-devel
-::* zlib-devel
+*:* zlib-devel
-:: requires
+*: requires
-::* libxml2
+*:* libxml2
-::* zlib
+*:* zlib
-::* perl-libwww-perl (for the built-in spider)
+*:* perl-libwww-perl (for the built-in spider)
-::* others as desired to index documents (pdf, etc.)
+*:* others as desired to index documents (pdf, etc.)
+* Xapian <ref name="Xapian">{{cite web|url=http://xapian.org/|title=Xapian|publisher=Xapian Project}}</ref> - akistler examined
+** Description
+*: written in C++
+*: bindings to Python, Ruby, and Perl XS
+*: xapian-core, xapian-bindings, and perl-Search-Xapian in Fedora already; xapian-omega is not
+*: additional bindings to PHP, Java, and more (?)
+*: Omega provides a Xapian front-end for indexing (via script) and searching (command line or CGI)
+*: Omega provides glue scripts for ht://Dig, mbox files, and perl DBI
+*: Flax <ref name="Flax">{{cite web|url=http://www.flax.co.uk/products.shtml|title=Flax|publisher=Flax}}</ref> is another search engine built on top of Xapian and CherryPy
+** Evaluation
+*: Xapian is a search engine library.  Omega adds functionality on top of Xapian.  The Xapian database is very flexible, supporting an entirely user-designed schema.  Usage through Omega loses very little, if any, of that flexibility, however the supplied Omega CGI is extremely rudimentary.  The supplied Omega CGI also requires the database to be named "default," although that can be changed.  Database columns are of type field or index.  Fields are stored verbatim (e.g., URL, date, MIME type, keywords).  Indices are input as blocks of text or other content to be indexed, but not stored (e.g., the corpus of a file or web page).  The Omega scriptindex utility can be combined with an external web crawler for HTML.  Making Omega work with Apache requires relabeling /var/lib/omega as httpd_sys_content, or moving /var/lib/omega to /var/www/omega and using the default context there.  In this evaluation, /var/lib/omega was moved to /var/www/omega.  Xapian only works with UTF-8.
+** Requirements
+*: xapian-core buildrequires
+*:* gcc gcc-c++
+*:* make
+*:* zlib-devel
+*: xapian-bindings buildrequires (not including gcc gcc-c++ make)
+*:* python python-devel
+*:* ruby ruby-devel
+*:* xapian-core-devel
+*: perl-Search-Xapian buildrequires (not including gcc gcc-c++ make)
+*:* perl
+*:* xapian-core-devel
+*: xapian-omega buildrequires (not including gcc gcc-c++ make)
+*:* libtool
+*:* xapian-core-devel
+*: xapian-core requires
+*:* xapian-core-libs
+*: xapian-bindings requires
+*:* coreutils
+*:* python
+*:* xapian-core-libs
+*: perl-Search-Xapian requries
+*:* perl
+*:* xapian-core-libs
+*: xapian-omega requires
+*:* httpd
+*:* perl
+*:* perl-DBI
+*:* xapian-core-libs
+** Set-up Notes
+*: Using custom crawler that requires perl-libwww-perl
+*: Updated Omega rpm .spec to move /var/lib/omega to /var/www/omega, including updating /etc/omega.conf
+*: Use default database at /var/www/omega/data/default
+*: /var/www/omega/data/default.index is
+*:: URI : field boolean=Q unique=Q
+*:: ContentType : field=MIMEtype index=T
+*:: LastModified : field=Date date=unix
+*:: Title : field index=S
+*:: Content : index
+*: To index, run (as someone who can write to /var/www/omega/data/default)
+*:: ./crawl.pl <nowiki>http://fedoraproject.org</nowiki> | scriptindex /var/www/omega/data/default /var/www/omega/data/default.index
+*: Note that "Content : unhtml index" would be preferable in the index, but unhtml apparently has bugs
-* Terrier (TERabyte RetrIEveR) <ref name="Terrier">{{cite web|url=http://ir.dcs.gla.ac.uk/terrier/|title=Terrier|publisher=Terrier Project}}</ref>
+=== Eliminated from Consideration ===
-: written in Java
-* Xapian <ref name="Xapian">{{cite web|url=http://xapian.org/|title=Xapian|publisher=Xapian Project}}</ref> - akistler examined
+* CLucene <ref name="CLucene">{{cite web|url=http://sourceforge.net/projects/clucene/|title=CLucene|publisher=CLucene Project}}</ref>
+:* Reason for elimination
+:: It has no crawling/spidering facility.  It is a library toolkit (API) only.
 :* Description
-:: written in C++
+:: C++ port of Lucene
-:: bindings to Python, Ruby, and Perl XS
+:: in Fedora already
-:: xapian-core, xapian-bindings, and perl-Search-Xapian in Fedora already; xapian-omega is not
+:: described as beta by the developers
-:: additional bindings to PHP, Java, and more (?)
-:: Omega provides a Xapian front-end for indexing (via script) and searching (command line or CGI)
-:: Omega provides glue scripts for ht://Dig, mbox files, and perl DBI
-:: Flax <ref name="Flax">{{cite web|url=http://www.flax.co.uk/products.shtml|title=Flax|publisher=Flax}}</ref> is another search engine built on top of Xapian and CherryPy
-:* Evaluation
-:: Xapian is a search engine library.  Omega adds functionality on top of Xapian.  The Xapian database is very flexible, supporting an entirely user-designed schema.  Usage through Omega loses very little, if any, of that flexibility, however the supplied Omega CGI is extremely rudimentary.  The supplied Omega CGI also requires the database to be named "default," although that can be changed.  Database columns are of type field or index.  Fields are stored verbatim (e.g., URL, date, MIME type, keywords).  Indices are input as blocks of text or other content to be indexed, but not stored (e.g., the corpus of a file or web page).  The Omega scriptindex utility can be combined with an external web crawler for HTML.  Making Omega work with Apache requires relabeling /var/lib/omega as httpd_sys_content, or moving /var/lib/omega to /var/www/omega and using the default context there.  In this evaluation, /var/lib/omega was moved to /var/www/omega.  Xapian only works with UTF-8.
-:* Requirements
-:: xapian-core buildrequires
-::* gcc gcc-c++
-::* make
-::* zlib-devel
-:: xapian-bindings buildrequires (not including gcc gcc-c++ make)
-::* python python-devel
-::* ruby ruby-devel
-::* xapian-core-devel
-:: perl-Search-Xapian buildrequires (not including gcc gcc-c++ make)
-::* perl
-::* xapian-core-devel
-:: xapian-omega buildrequires (not including gcc gcc-c++ make)
-::* libtool
-::* xapian-core-devel
-:: xapian-core requires
-::* xapian-core-libs
-:: xapian-bindings requires
-::* coreutils
-::* python
-::* xapian-core-libs
-:: perl-Search-Xapian requries
-::* perl
-::* xapian-core-libs
-:: xapian-omega requires
-::* httpd
-::* perl
-::* perl-DBI
-::* xapian-core-libs
-:* Set-up Notes
-:: Using custom crawler that requires perl-libwww-perl
-:: Updated Omega rpm .spec to move /var/lib/omega to /var/www/omega, including updating /etc/omega.conf
-:: Use default database at /var/www/omega/data/default
-:: /var/www/omega/data/default.index is
-::: URI : field boolean=Q unique=Q
-::: ContentType : field=MIMEtype index=T
-::: LastModified : field=Date date=unix
-::: Title : field index=S
-::: Content : index
-:: To index, run (as someone who can write to /var/www/omega/data/default)
-::: ./crawl.pl <nowiki>http://fedoraproject.org</nowiki> | scriptindex /var/www/omega/data/default /var/www/omega/data/default.index
-:: Note that "Content : unhtml index" would be preferable in the index, but unhtml apparently has bugs
-* Zettair <ref name="Zettair">{{cite web|url=http://www.seg.rmit.edu.au/zettair/|title=Zettair|publisher=Search Engine Group, Royal Melbourne Institute of Technology}}</ref>
+* Isearch <ref name="Isearch">{{cite web|url=http://isite.awcubed.com/|title=Isearch|publisher=Isite}}</ref>
-: written in C
+** Reason for elimination
+*: It has no crawling/spidering facility. It is intended for local file indexing only.
-=== Eliminated from Consideration ===
+*: The web site appears to have fallen out of maintenance.
+** Description
+*: written in C++
 * Lucene <ref name="Lucene">{{cite web|url=http://lucene.apache.org/|title=Lucene|publisher=Apache Software Foundation}}</ref> - akistler examined
@@ Line 346: / Line 476: @@
 ::* java-1.4.2-gcj-compat
-* mnoGoSearch <ref name="mnoGoSearch">{{cite web|url=http://www.mnogosearch.org/|title=mnoGoSearch|publisher=LavTech}}</ref> - akistler examined
+* Namazu <ref name="Namazu">{{cite web|url=http://www.namazu.org/|title=Namazu|publisher=Namazu Project}}</ref>
 :* Reason for elimination
-:: Uses an external database.  Tested against SQLite.  Didn't work.
+:: It has no crawling/spidering facility.  It indexes local documents only.
 :* Description
-:: written in C
+:: written in Perl
-:: UNIX/Linux source code is GPL; Windows binaries are commercial, likely based on the GPL UNIX/Linux code, and lag a few versions behind
+:: in Fedora already
-:: Indices are stored in a database; Supported databases include MySQL, PostgreSQL, and SQLite (among others)
-:: HTTP, FTP, and NNTP crawling
-:: C, PHP, and Perl APIs
-:: SBCS and most MBCS supported, including most eastern Asian languages
-:* Evaluation
-:: The supplied install.pl script generates a configure command, but does not support SQLite.  Adding --with-sqlite3 to the generated command adds SQLite support.  An empty database must be created manually.  A URI in the indexer.conf file specifies the location of the database.  According to the documentation, sqlite:/path/to/db/file should work, but doesn't.  According to the message boards on mnoGoSearch.org, sqlite://localhost/path/to/db/file should work, but doesn't.  No other databases were tested for evaluation.
-:* Requirements
-:: buildrequires
-::* gcc make
-::* sqlite-devel (for SQLite support)
-::* zlib-devel
-:: requires
-::* sqlite (for SQLite support)
-::* zlib
 * Solr <ref name="Solr">{{cite web|url=http://lucene.apache.org/solr/|title=Solr|publisher=Apache Software Foundation}}</ref> - akistler examined
@@ Line 384: / Line 500: @@
 :: requires
 ::* java-1.6.0-openjdk
+* Terrier (TERabyte RetrIEveR) <ref name="Terrier">{{cite web|url=http://ir.dcs.gla.ac.uk/terrier/|title=Terrier|publisher=Terrier Project}}</ref>
+:* Reason for elimination
+:: It has no crawler or user search tool. It does not run as a service (as provided), only interactively.
+:* Description
+:: written in Java
+:: runs from the command line (i.e., not a Tomcat servlet)
 * YaCy <ref name="YaCy">{{cite web|url=http://yacy.net/|title=YaCy|publisher=Karlsruhe Institute of Technology}}</ref> - huzaifa examined
@@ Line 393: / Line 516: @@
 :* customized search parameters
 :* fast indexing and web interface for querying the back-end db
+* Zettair <ref name="Zettair">{{cite web|url=http://www.seg.rmit.edu.au/zettair/|title=Zettair|publisher=Search Engine Group, Royal Melbourne Institute of Technology}}</ref>
+:* Reason for elimination
+:: No crawling capability, only indexes local documents
+:: User search/retrieval tool is command-line only, no web interface
+:* Description
+:: written in C
 === Never Considered ===
@@ Line 451: / Line 581: @@
 === Search Engines ===
-*Xapian (<nowiki>http://publictest3.fedoraproject.org/cgi-bin/omega</nowiki>)
+* DataparkSearch (<nowiki>http://publictest3.fedoraproject.org/cgi-bin/dpsearch</nowiki>)
-:At this time, only installed xapian-core-libs, xapian-core, and xapian-omega (i.e., no xapian-bindings or perl-Search-xapian)
+: PostgreSQL installed (postgresql, postgresql-devel, postgresql-libs, postgresql-server)
-:Enabled cgi-bin in /etc/httpd/conf.d/cgi-bin.conf
+: SELinux (not present on publictest3, but needed eventually) needs:
-:Omega bombs on <nowiki>http://fedoraproject.org/wiki/Overview</nowiki> (and would possibly on others later) with "unhtml index"
+:: "setsebool -P httpd_can_network_connect=1" to connect to PostgreSQL
-::Resolution: Don't use "unhtml"
+: Crawling trials (with database cleared each time, i.e., not incremental)
-:Omega bombs on long URIs (longer than 24 chars)
+:: Memory and CPU utilization are modest, less than 10% each. Most CPU time is spent in I/O Wait for the database.
-::Example: <nowiki>http://fedoraproject.org/wiki/Special:WhatLinksHere/Ru_RU/%D0%9F%D0%BB%D0%B0%D0%BD_%D1%80%D0%B0%D0%B1%D0%BE%D1%82%D1%8B_%D0%BF%D0%BE_%D0%BF%D0%B5%D1%80%D0%B5%D0%B2%D0%BE%D0%B4%D1%83_web-%D1%81%D0%B0%D0%B9%D1%82%D0%B0_%D0%BF%D1%80%D0%BE%D0%B5%D0%BA%D1%82%D0%B0_Fedora</nowiki>
+:: Depth=4, 2.5 hrs crawling, 2k documents, db = 320M (700M with clone detection off)
-::Resolution: Enhanced custom crawler to filter URIs better (fedoraproject.org/w/ from the wiki); added capability to discard URIs that are too long (mostly hex URIs translated from other DBCS)
+:: Depth=5, 16.5 hrs crawling, 12k documents, db = 1.1G
-:Perl custom crawler prints warnings for (and refuses to translate) URIs with Unicode characters outside the Latin 1 range
-::Resolution: None. This issue is known for URI.pm.<ref name="URI.pm">{{cite web|url=http://www.nntp.perl.org/group/perl.libwww/2006/01/msg6540.html|title="URI.pm error"|publisher=Usenet}}</ref>
+* mnoGoSearch (<nowiki>http://publictest3.fedoraproject.org/cgi-bin/mnoGoSearch</nowiki>)
-:First successful crawl
+: PostgreSQL installed (postgresql, postgresql-devel, postgresql-libs, postgresql-server)
-::4 hour run, scriptindex used 40% of 4G of memory (2G RAM, 2G swap) for Depth=4 (15218 documents)
+: SELinux (not present on publictest3, but needed eventually) needs:
-::Index is about 500M on disk
+:: "setsebool -P httpd_can_network_connect=1" to connect to PostgreSQL
-::In a previous attempt, scriptindex used 70% of memory (0% free) at Depth=5 (terminated, system sluggish with swap I/O)
+: Crawling trials (with database cleared each time, i.e., not incremental)
-:Second successful crawl
+:: Memory and CPU utilization are quite modest, about 1% each. Most CPU time is spent in I/O Wait for the database.
-::8 hour run, scriptindex used 20% of 6G of memory (4G RAM, 2G swap) for Depth=5 (41171 documents)
+:: Depth=4, 2 hrs crawling, 1.5 min indexing, 11k documents
-::Index is about 1G on disk
+:: Depth=5, 4.5 hrs crawling, 12 min indexing, 25k documents
+:: Depth=6, 6.5 hrs crawling, 16 min indexing, 34k documents
+:: Depth=7, 12 hrs crawling, 23 min indexing, 40k documents, db = 2.6G
+* Nutch (<nowiki>http://publictest3.fedoraproject.org/nutch</nowiki>)
+: The Nutch tarball was unpacked in /opt/nutch-1.0, just as in preliminary local testing
+: Tomcat is reverse proxied through Apache (see notes below)
+: Nutch's definition/conception of depth appears to be unusual.  The crawler must be directed to spider much more deeply than should be necessary.
+: Crawls are executed as (e.g.) "/opt/nutch-1.0/bin/nutch crawl /opt/nutch-1.0/urls -dir /opt/nutch-1.0/crawl -depth 5 -threads 2"
+: Crawling trials
+:: java process uses about 18% of 6G of memory (4G RAM, 2G swap), regardless of depth
+:: Depth=4, 2 threads, 1.5k documents
+:: Depth=5, 2 threads, 3 hrs, 8k documents
+:: Depth=6, 1 thread, 8.5 hrs, 23k documents
+:: Depth=7, 1 thread, 14.5 hrs, 37k documents, db = 400M
+:: Depth=8, 1 thread, 16.5 hrs, 44k documents, db = 440M
+* Xapian (<nowiki>http://publictest3.fedoraproject.org/cgi-bin/omega</nowiki>)
+: At this time, only installed xapian-core-libs, xapian-core, and xapian-omega (i.e., no xapian-bindings or perl-Search-xapian)
+: Enabled cgi-bin in /etc/httpd/conf.d/cgi-bin.conf (see notes below)
+: Omega bombs on <nowiki>http://fedoraproject.org/wiki/Overview</nowiki> (and would possibly on others later) with "unhtml index"
+:: Resolution: Don't use "unhtml"
+: Omega bombs on long URIs (longer than 244 chars)
+:: Example: <nowiki>http://fedoraproject.org/wiki/Special:WhatLinksHere/Ru_RU/%D0%9F%D0%BB%D0%B0%D0%BD_%D1%80%D0%B0%D0%B1%D0%BE%D1%82%D1%8B_%D0%BF%D0%BE_%D0%BF%D0%B5%D1%80%D0%B5%D0%B2%D0%BE%D0%B4%D1%83_web-%D1%81%D0%B0%D0%B9%D1%82%D0%B0_%D0%BF%D1%80%D0%BE%D0%B5%D0%BA%D1%82%D0%B0_Fedora</nowiki>
+:: Resolution: Enhanced custom crawler to filter URIs better (fedoraproject.org/w/ from the wiki); added capability to discard URIs that are too long (mostly hex URIs translated from other DBCS)
+: Perl custom crawler prints warnings for (and refuses to translate) URIs with Unicode characters outside the Latin 1 range
+:: Resolution: None. This issue is known for URI.pm.<ref name="URI.pm">{{cite web|url=http://www.nntp.perl.org/group/perl.libwww/2006/01/msg6540.html|title="URI.pm error"|publisher=Usenet}}</ref>
+: Crawling trials
+:: Failed, Depth=5, scriptindex used 70% of 4G of memory (2G RAM, 2G swap, 0% free)
+::: terminated, system sluggish with swap I/O
+:: Depth=4, 4 hrs, 15218 documents, index is about 500M on disk, scriptindex used 40% of 4G of memory (2G RAM, 2G swap)
+:: Depth=5, 8 hrs, 41171 documents, index is about 1G on disk, scriptindex used 20% of 6G of memory (4G RAM, 2G swap)
 === Apache Configuration Notes ===
-*CGI for Xapian Omega
+* CGI for Xapian Omega and mnoGoSearch
-:In /etc/httpd/conf.d/cgi-bin.conf, just use the default configuration normally commented out in httpd.conf.
+: In /etc/httpd/conf.d/cgi-bin.conf, just use the default configuration normally commented out in httpd.conf.
+<pre>
   # ScriptAlias: This controls which directories contain server scripts.
   # ScriptAliases are essentially the same as Aliases, except that
@@ Line 492: / Line 653: @@
       Allow from all
   </Directory>
+</pre>
-*Reverse proxy for Tomcat
+* Reverse proxy for Tomcat
-:In /etc/httpd/conf.d/tomcat5.conf:
+: In /etc/httpd/conf.d/tomcat5.conf:
+<pre>
   <Location /admin>
     Order Allow,Deny
@@ Line 512: / Line 674: @@
   ProxyPass        /nutch   <nowiki>http://localhost:8082/nutch</nowiki>
   ProxyPassReverse /nutch   <nowiki>http://localhost:8082/nutch</nowiki>
+</pre>
 === Tomcat Configuration Notes ===
-*In /etc/tomcat5/server.xml:
+* In /etc/tomcat5/server.xml:
-:comment out the AJP connector on port 8009
+: comment out the AJP connector on port 8009
-:comment out the HTTP connector on port 8080
+: comment out the HTTP connector on port 8080
-:uncomment the proxied HTTP connector on 8082
+: uncomment the proxied HTTP connector on 8082
-:add proxyName to the HTTP connector on 8082
+: add proxyName to the HTTP connector on 8082
-::could alternatively define proxyName and proxyPort and undefine redirectPort in the HTTP connector on port 8080
+:: could alternatively define proxyName and proxyPort and undefine redirectPort in the HTTP connector on port 8080
-*SELinux (not present on publictest3, but needed eventually)
+* SELinux (not present on publictest3, but needed eventually)
-:For port 8082, SELinux needs "setsebool -P httpd_can_network_connect=1"
+: For port 8082, SELinux needs "setsebool -P httpd_can_network_connect=1"
-:Alternatively, for port 8080, SELinux needs "setsebool -P httpd_can_network_relay=1"
+: Alternatively, for port 8080, SELinux needs "setsebool -P httpd_can_network_relay=1"
-*manager and admin (in fact, all users) defined in /etc/tomcat5/tomcat-users.xml
+* manager and admin (in fact, all users) defined in /etc/tomcat5/tomcat-users.xml
-*Recommended, but not done here, change the shutdown password in server.xml (default is SHUTDOWN)
+* Recommended, but not done here, change the shutdown password in server.xml (default is SHUTDOWN)
 == Deployment Plan ==