Points of Contact
Project Sponsor
Name: Mike McGrath
Fedora Account Name: mmcgrath
Group: Infrastructure
Infrastructure Sponsor: mmcgrath
Secondary Contact info
Name: Allen Kistler
Fedora Account Name: akistler
Group: Infrastructure
Name: Huzaifa Sidhpurwala
Fedora Account Name: huzaifas
Group: Infrastructure
Project Info
Project Name: Search Engine Enhancement
Target Audience: All users of Fedora web sites
Expiration/Delivery Date (required): F13
Description/Summary
Fedora needs a search engine[1]
Requirements
- Crawl the web sites (wiki and non-wiki)
- Search the web sites (wiki and non-wiki)
- Java, if any, must be the GCJ/OpenJDK versions in RHEL5; Sun/IBM/BEA Java is not acceptable
Preferences
- Python-based
- Programmable keywords to have control over what pages get displayed for certain keywords
- XML or library interface so other applications can use it
Project Plan
- Investigate and evaluate existing open source search engines
- Select candidate software
- Create public test instances of candidate software
- Test for functionality, performance, and impact (re-evaluate, if necessary)
- Create capacity and deployment plans
- Deploy
Resources Needed
- Public Test for testing candidate software
- Permanent home(s) for deployment
- Web server(s)
- Database server(s) (maybe)
Software Investigation and Evaluation
Comparison by Requirements
Engine Name | Source Language | Integrated Crawler | Integrated Search Tool | Programmable Categories | Application Interface |
---|---|---|---|---|---|
CLucene | C++ | ||||
DataparkSearch | C | ||||
Egothor | Java | ||||
Ferret | Ruby | ||||
Indri | C/C++ | ||||
Isearch | C++ | ||||
KinoSearch | Perl/C | No (sample file crawler included) |
No (sample included) |
Yes | Yes (BDB/JSON) |
Namazu | Perl | ||||
Nutch | Java | Yes (OpenJDK command line) |
Yes (Tomcat servlet) |
No | Yes (Java) |
Swish-e | C/Perl | Yes (Perl) |
Sort Of (sample included, but has problems) |
No (but can search on META tags) |
Yes (Perl and C APIs) |
Xapian | C++ | Sort Of (combined Omega with custom Perl) |
Yes (rudimentary Omega CGI) |
Yes | Yes (C++, Perl, Python, Ruby) |
Zettair | C | ||||
Engine Name | Source Language | Integrated Crawler | Integrated Search Tool | Programmable Categories | Application Interface |
In Progress
- CLucene [2]
- C++ port of Lucene
- in Fedora already
- described as beta by the developers
- DataparkSearch [3]
- Description
- written in C
- forked from mnoGoSearch in 2003 when mnoGoSearch went semi-commercial
- Indices are stored in a database; Supported databases include MySQL and PostgreSQL (among others, but not SQLite)
- Support for all SBCS and some DBCS
- Search tool supports Booleans (AND, OR, NOT, NEAR)
- Egothor [4]
- written in Java
- Ferret [5]
- Ruby port of Lucene
- KinoSearch and Ferret intend to merge as Lucy [6]
- Indri [7]
- written in C/C++
- Isearch [8]
- written in C++
- KinoSearch [9] - akistler examined
- Description
- Perl/C port of Lucene
- in Fedora already
- maintainer Rectangular Research appears to be just one guy, who considers KinoSearch to be alpha software
- KinoSearch and Ferret intend to merge as Lucy [6]
- Evaluation
- Search engine library with sample indexer and search page rather than fully-functional application. Stores indices in Berkeley DB files with JSON interfaces. Allows custom-designed indices, including categories (exact match) to fulfill "programmable keywords" requirement. Each document index on each document source is a single write-once file collection (BDB and JSON) in a unique directory. Rerunning the indexer creates a new directory, obsoleting the old directory if all the old documents are included. The old directory then needs to be cleaned up. Postings can, however, be deleted from an index. Additionally, only the new documents can be indexed, but that's not efficient.
- Requirements
- buildrequires
- gcc
- (EPEL) perl-Module-Build
- requires
- (EPEL) perl-JSON-XS
- Problem: Desires 1.53, but EPEL has 1.43
- Note: http://web.archive.org/web/20071122035408/search.cpan.org/src/MLEHMANN/JSON-XS-1.53/
- Note: works with 1.43, anyway
- (EPEL) perl-Lingua-Stem-Snowball
- (EPEL) perl-Lingua-StopWords
- (EPEL) perl-Parse-RecDescent
- sample indexer reads files from the file system and requires
- (EPEL) perl-HTML-Tree
- sample cgi search script requires
- (CPAN) Data::Pageset (which requires Data::Page)
- (EPEL) perl-Test-Exception
- (EPEL) perl-Class-Accessor-Chained
- mnoGoSearch [10] - Allen reexamining
- Reason for elimination
- Uses an external database. Tested against SQLite. Didn't work.
- Description
- written in C
- UNIX/Linux source code is GPL; Windows binaries are commercial, likely based on the GPL UNIX/Linux code, and lag a few versions behind
- Indices are stored in a database; Supported databases include MySQL, PostgreSQL, and SQLite (among others)
- HTTP, FTP, and NNTP crawling
- C, PHP, and Perl APIs
- SBCS and most MBCS supported, including most eastern Asian languages
- Evaluation
- The supplied install.pl script generates a configure command, but does not support SQLite. Adding --with-sqlite3 to the generated command adds SQLite support. An empty database must be created manually. A URI in the indexer.conf file specifies the location of the database. According to the documentation, sqlite:/path/to/db/file should work, but doesn't. According to the message boards on mnoGoSearch.org, sqlite://localhost/path/to/db/file should work, but doesn't. No other databases were tested for evaluation.
- Requirements
- buildrequires
- gcc make
- sqlite-devel (for SQLite support)
- zlib-devel
- requires
- sqlite (for SQLite support)
- zlib
- Namazu [11] - Huzaifa investigating
- written in Perl
- in Fedora already
- Nutch [12] - akistler examined
- Description
- written in Java
- based on Lucene
- the crawler/indexer is a Java command line application; the default depth is 5; the default number of threads is 10
- the search tool runs in a Java servelet container, e.g., Tomcat
- Evaluation
- There's nothing to build. Simply configure the crawler (which is actually the indexer, too) and deploy/configure the searcher. The crawler caches the pages it indexes, making the cache available to the search tool. The search interface is extremely simple and is multi-lingual, but is almost entirely an advertisement for the Nutch project. It doesn't look particularly easy to rebrand. Overall, the polish of the finished product means it's less flexible to custom modifications, like programmable keywords. After creating a new index (i.e., after a new crawl), the search application must be reloaded in Tomcat manager. The crawler is more flexible than a brief investigation could reveal. The official documentation leaves a lot to be desired. Searches are for single terms only, no multiple terms or +/- Booleans.
- Requirements
- java-1.6.0-openjdk, java-1.6.0-openjdk-devel (explicit specification short-circuits yum's attempt to install BEA for Tomcat)
- tomcat5, tomcat5-webapps, tomcat5-admin-webapps (tomcat5-admin-webapps is probably not actually required, but is very handy)
- Set-up Notes
- The best "tutorial" is http://wiki.apache.org/nutch/NutchTutorial
- Create /etc/profile.d/java.sh for "export JAVA_HOME=/etc/alternatives/jre" (required for the crawler, architecture independent definition)
- In /opt (in this example, though anyplace will do) "tar xzf /path/to/tar/file/nutch-1.0.tar.gz"
- Set http.agent.name in /opt/nutch-1.0/conf/nutch-site.xml (e.g., FedoraProject)
- Create a flat file of starting URLs in /opt/nutch-1.0/urls/crawl-start.txt (it can be any file name in the directory)
- Edit /opt/nutch-1.0/conf/crawl-urlfilter.txt to set regular expressions which keep or discard links for processing (e.g., the domain of the servers crawled)
- If not running the crawler as root (Why would you?), create directories /opt/nutch-1.0/crawl and /opt/nutch-1.0/logs writable by whatever uid/gid used to crawl
- Note: The crawler also needs to be able to create temporary files in its working directory
- Run the crawler, putting the database in /opt/nutch-1.0/crawl
- Deploy the WAR file from /opt/nutch-1.0/nutch-1.0.war as /nutch in Tomcat manager
- Set searcher.dir in /var/lib/tomcat5/webapps/nutch/WEB-INF/classes/nutch-site.xml to /opt/nutch-1.0/crawl
- Swish-e [13] - akistler examined
- Description
- written in C
- Note: Swish++ is a rewrite in C++ (not evaluated here)
- Evaluation
- Search engine with a built-in web crawler, a built-in file system crawler, and an interface for an external crawler. The distribution includes sample search pages which use the Perl API. There is also a C API. The index is not customizable, but does include a facility for including metawords (exact match) and the path in the index for each document. The documentation acknowledges that the software only supports ASCII, but some MBCS may also work.
- Requirements
- buildrequires
- gcc
- make
- libxml2-devel
- zlib-devel
- requires
- libxml2
- zlib
- perl-libwww-perl (for the built-in spider)
- others as desired to index documents (pdf, etc.)
- Xapian [14] - akistler examined
- Description
- written in C++
- bindings to Python, Ruby, and Perl XS
- xapian-core, xapian-bindings, and perl-Search-Xapian in Fedora already; xapian-omega is not
- additional bindings to PHP, Java, and more (?)
- Omega provides a Xapian front-end for indexing (via script) and searching (command line or CGI)
- Omega provides glue scripts for ht://Dig, mbox files, and perl DBI
- Flax [15] is another search engine built on top of Xapian and CherryPy
- Evaluation
- Xapian is a search engine library. Omega adds functionality on top of Xapian. The Xapian database is very flexible, supporting an entirely user-designed schema. Usage through Omega loses very little, if any, of that flexibility, however the supplied Omega CGI is extremely rudimentary. The supplied Omega CGI also requires the database to be named "default," although that can be changed. Database columns are of type field or index. Fields are stored verbatim (e.g., URL, date, MIME type, keywords). Indices are input as blocks of text or other content to be indexed, but not stored (e.g., the corpus of a file or web page). The Omega scriptindex utility can be combined with an external web crawler for HTML. Making Omega work with Apache requires relabeling /var/lib/omega as httpd_sys_content, or moving /var/lib/omega to /var/www/omega and using the default context there. In this evaluation, /var/lib/omega was moved to /var/www/omega. Xapian only works with UTF-8.
- Requirements
- xapian-core buildrequires
- gcc gcc-c++
- make
- zlib-devel
- xapian-bindings buildrequires (not including gcc gcc-c++ make)
- python python-devel
- ruby ruby-devel
- xapian-core-devel
- perl-Search-Xapian buildrequires (not including gcc gcc-c++ make)
- perl
- xapian-core-devel
- xapian-omega buildrequires (not including gcc gcc-c++ make)
- libtool
- xapian-core-devel
- xapian-core requires
- xapian-core-libs
- xapian-bindings requires
- coreutils
- python
- xapian-core-libs
- perl-Search-Xapian requries
- perl
- xapian-core-libs
- xapian-omega requires
- httpd
- perl
- perl-DBI
- xapian-core-libs
- Set-up Notes
- Using custom crawler that requires perl-libwww-perl
- Updated Omega rpm .spec to move /var/lib/omega to /var/www/omega, including updating /etc/omega.conf
- Use default database at /var/www/omega/data/default
- /var/www/omega/data/default.index is
- URI : field boolean=Q unique=Q
- ContentType : field=MIMEtype index=T
- LastModified : field=Date date=unix
- Title : field index=S
- Content : index
- To index, run (as someone who can write to /var/www/omega/data/default)
- ./crawl.pl http://fedoraproject.org | scriptindex /var/www/omega/data/default /var/www/omega/data/default.index
- Note that "Content : unhtml index" would be preferable in the index, but unhtml apparently has bugs
- Zettair [16]
- written in C
Eliminated from Consideration
- Lucene [17] - akistler examined
- Reason for elimination
- It has no crawling/spidering facility. It has no user query interface. There are no samples.
- Description
- written in Java, but ported to others [18]
- Requires/uses GCJ
- in Fedora already
- PyLucene [19] is a Python wrapper around Java Lucene
- Evaluation
- Search engine library meant to be integrated into applications
- Requirements
- buildrequires (based on 1.4.3-f7)
- ant
- ant-junit
- java-1.4.2-gcj-compat-devel
- javacc
- jpackage-utils
- junit
- make
- requires (based on 1.4.3-f7)
- java-1.4.2-gcj-compat
- Solr [20] - akistler examined
- Reason for elimination
- It has no crawling/spidering facility. It has no user query interface. There are no samples.
- Description
- written in Java
- based on Lucene
- Evaluation
- The documentation describes installing Sun Java to run Solr, but OpenJDK 1.5 or later is fine. Solr needs a Java servlet container in which to run. It comes with Jetty, but other containers should work, as well (e.g., Tomcat). Currently only supports UTF-8 characters.
- Basically Solr provides an HTTP admin GUI for a search engine that uses a superset of the Lucene query syntax. The schema is very flexible. Set-up is essentially entirely through XML files. Applications can query the servlet port and get XML or JSON responses.
- Requirements
- buildrequires
- ant (note that ant currently pulls in java-gcj-compat, too, but it appears not to be a problem)
- ant-junit
- java-1.6.0-openjdk-devel
- junit
- requires
- java-1.6.0-openjdk
- Terrier (TERabyte RetrIEveR) [21]
- Reason for elimination
- It has no crawler or user search tool. It does not run as a service (as provided), only interactively.
- Description
- written in Java
- runs from the command line (i.e., not a Tomcat servlet)
- YaCy [22] - huzaifa examined
- Reason for elimination
- Requires Sun Java
- written in Java
- well maintained
- support for peer search engine database exchanges
- customized search parameters
- fast indexing and web interface for querying the back-end db
Never Considered
- Gonzui [23]
- written in Ruby
- specializes in source code search
- not actively maintained
- Grub [24]
- written in C#
- Heritrix [25]
- written in Java
- archives content rather than indexing it
- ht://Dig [26]
- written in C++
- not actively maintained
- HtdigSearch [27]
- It's just a MediaWiki plugin, not suitable for searching non-wiki sites
- MWSearch [28]
- Requires EzMwLucene (Java) to be running on the servers to be searched
- EzMwLucene is wiki-only, therefore MWSearch is wiki-only
- OpenFTS [29]
- written in Perl or TCL on top of PostgreSQL
- Python interface available
- not actively maintained
- Plucene [30]
- Perl port of Lucene
- not actively maintained
- RigorousSearch [31]
- Crawls the MediaWiki database, not the web site
- Doesn't work for non-MediaWiki web sites, including any non-wiki web site
- Sphinx [32]
- written in C++
- designed to index SQL tables, not web pages
- SphinxSearch [33]
- Written in C++
- MediaWiki plug-in, so it's wiki-only
- Whoosh [34]
- written in Python
- inspired by Lucene, but closer to a Python port of parts of KinoSearch combined with some features of Terrier
- toolkit only, not even sample crawlers and user interfaces
Public Testing
Public Testing is taking place on publictest3.
Search Engines
- Nutch (http://publictest3.fedoraproject.org/nutch)
- The Nutch tarball was unpacked in /opt/nutch-1.0, just as in preliminary local testing
- Tomcat is reverse proxied through Apache (see notes below)
- Nutch's definition/conception of depth appears to be unusual. The crawler must be directed to spider much more deeply than should be necessary.
- Crawls are executed as (e.g.) "/opt/nutch-1.0/bin/nutch crawl /opt/nutch-1.0/urls -dir /opt/nutch-1.0/crawl -depth 5 -threads 2"
- Crawling trials
- java process uses about 18% of 6G of memory (4G RAM, 2G swap), regardless of depth
- Depth=4, 2 threads, 1.5k documents
- Depth=5, 2 threads, 3 hrs, 8k documents
- Depth=6, 1 thread, 8.5 hrs, 23k documents
- Depth=7, 1 thread, 14.5 hrs, 37k documents, db = 400M
- Depth=8, 1 thread, 16.5 hrs, 44k documents, db = 440M
- Xapian (http://publictest3.fedoraproject.org/cgi-bin/omega)
- At this time, only installed xapian-core-libs, xapian-core, and xapian-omega (i.e., no xapian-bindings or perl-Search-xapian)
- Enabled cgi-bin in /etc/httpd/conf.d/cgi-bin.conf (see notes below)
- Omega bombs on http://fedoraproject.org/wiki/Overview (and would possibly on others later) with "unhtml index"
- Resolution: Don't use "unhtml"
- Omega bombs on long URIs (longer than 244 chars)
- Example: http://fedoraproject.org/wiki/Special:WhatLinksHere/Ru_RU/%D0%9F%D0%BB%D0%B0%D0%BD_%D1%80%D0%B0%D0%B1%D0%BE%D1%82%D1%8B_%D0%BF%D0%BE_%D0%BF%D0%B5%D1%80%D0%B5%D0%B2%D0%BE%D0%B4%D1%83_web-%D1%81%D0%B0%D0%B9%D1%82%D0%B0_%D0%BF%D1%80%D0%BE%D0%B5%D0%BA%D1%82%D0%B0_Fedora
- Resolution: Enhanced custom crawler to filter URIs better (fedoraproject.org/w/ from the wiki); added capability to discard URIs that are too long (mostly hex URIs translated from other DBCS)
- Perl custom crawler prints warnings for (and refuses to translate) URIs with Unicode characters outside the Latin 1 range
- Resolution: None. This issue is known for URI.pm.[35]
- Crawling trials
- Failed, Depth=5, scriptindex used 70% of 4G of memory (2G RAM, 2G swap, 0% free)
- terminated, system sluggish with swap I/O
- Depth=4, 4 hrs, 15218 documents, index is about 500M on disk, scriptindex used 40% of 4G of memory (2G RAM, 2G swap)
- Depth=5, 8 hrs, 41171 documents, index is about 1G on disk, scriptindex used 20% of 6G of memory (4G RAM, 2G swap)
- Failed, Depth=5, scriptindex used 70% of 4G of memory (2G RAM, 2G swap, 0% free)
Apache Configuration Notes
- CGI for Xapian Omega
- In /etc/httpd/conf.d/cgi-bin.conf, just use the default configuration normally commented out in httpd.conf.
# ScriptAlias: This controls which directories contain server scripts. # ScriptAliases are essentially the same as Aliases, except that # documents in the realname directory are treated as applications and # run by the server when requested rather than as documents sent to the client. # The same rules about trailing "/" apply to ScriptAlias directives as to # Alias. # ScriptAlias /cgi-bin/ "/var/www/cgi-bin/" # # "/var/www/cgi-bin" should be changed to whatever your ScriptAliased # CGI directory exists, if you have that configured. # <Directory "/var/www/cgi-bin"> AllowOverride None Options None Order allow,deny Allow from all </Directory>
- Reverse proxy for Tomcat
- In /etc/httpd/conf.d/tomcat5.conf:
<Location /admin> Order Allow,Deny Allow from ... </Location> ProxyPass /admin http://localhost:8082/admin ProxyPassReverse /admin http://localhost:8082/admin # <Location /manager> Order Allow,Deny Allow from ... </Location> ProxyPass /manager http://localhost:8082/manager ProxyPassReverse /manager http://localhost:8082/manager # ProxyPass /nutch http://localhost:8082/nutch ProxyPassReverse /nutch http://localhost:8082/nutch
Tomcat Configuration Notes
- In /etc/tomcat5/server.xml:
- comment out the AJP connector on port 8009
- comment out the HTTP connector on port 8080
- uncomment the proxied HTTP connector on 8082
- add proxyName to the HTTP connector on 8082
- could alternatively define proxyName and proxyPort and undefine redirectPort in the HTTP connector on port 8080
- SELinux (not present on publictest3, but needed eventually)
- For port 8082, SELinux needs "setsebool -P httpd_can_network_connect=1"
- Alternatively, for port 8080, SELinux needs "setsebool -P httpd_can_network_relay=1"
- manager and admin (in fact, all users) defined in /etc/tomcat5/tomcat-users.xml
- Recommended, but not done here, change the shutdown password in server.xml (default is SHUTDOWN)
Deployment Plan
<tbd>
References
- ↑ "Fedora Search Engine". Infrastructure Trac. https://fedorahosted.org/fedora-infrastructure/ticket/1055.
- ↑ "CLucene". CLucene Project. http://sourceforge.net/projects/clucene/.
- ↑ "DataparkSearch". DataparkSearch. http://www.dataparksearch.org/.
- ↑ "Egothor". Egothor. http://www.egothor.org/.
- ↑ "Ferret". David Balmain. http://ferret.davebalmain.com/.
- ↑ 6.0 6.1 "Lucy". Apache Software Foundation. http://lucene.apache.org/lucy/.
- ↑ "Indri". The Lemur Project. http://www.lemurproject.org/indri/.
- ↑ "Isearch". Isite. http://isite.awcubed.com/.
- ↑ "KinoSearch". Rectangular Research. http://www.rectangular.com/kinosearch/.
- ↑ "mnoGoSearch". LavTech. http://www.mnogosearch.org/.
- ↑ "Namazu". Namazu Project. http://www.namazu.org/.
- ↑ "Nutch". Apache Software Foundation. http://lucene.apache.org/nutch/.
- ↑ "Swish-e". Swish-e. http://swish-e.org/.
- ↑ "Xapian". Xapian Project. http://xapian.org/.
- ↑ "Flax". Flax. http://www.flax.co.uk/products.shtml.
- ↑ "Zettair". Search Engine Group, Royal Melbourne Institute of Technology. http://www.seg.rmit.edu.au/zettair/.
- ↑ "Lucene". Apache Software Foundation. http://lucene.apache.org/.
- ↑ "Lucene Implementations". Apache Software Foundation. http://wiki.apache.org/lucene-java/LuceneImplementations.
- ↑ "PyLucene". Apache Software Foundation. http://lucene.apache.org/pylucene/.
- ↑ "Solr". Apache Software Foundation. http://lucene.apache.org/solr/.
- ↑ "Terrier". Terrier Project. http://ir.dcs.gla.ac.uk/terrier/.
- ↑ "YaCy". Karlsruhe Institute of Technology. http://yacy.net/.
- ↑ "Gonzui". SourceForge. http://gonzui.sourceforge.net/.
- ↑ "Grub". Wikia, Inc. http://grub.org/.
- ↑ "Heritrix". Internet Archive. http://crawler.archive.org/.
- ↑ "ht://Dig". The ht://Dig Group. http://www.htdig.org/.
- ↑ "HtdigSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:HtdigSearch.
- ↑ "MWSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:MWSearch.
- ↑ "OpenFTS". XWare. http://www.astronet.ru/xware/#fts.
- ↑ "Plucene". CPAN. http://search.cpan.org/~tmtm/Plucene-1.25.
- ↑ "RigorousSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:RigorousSearch.
- ↑ "Sphinx". Sphinx Technologies. http://sphinxsearch.com/.
- ↑ "SphinxSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:SphinxSearch.
- ↑ "Whoosh". Matt Chaput. http://whoosh.ca/.
- ↑ ""URI.pm error"". Usenet. http://www.nntp.perl.org/group/perl.libwww/2006/01/msg6540.html.