A common feature of websites is to have an inbuilt search facility for retrieving data of user's interest. Developers generally incorporate in their website the customized search APIs of popular search engines like Google, Yahoo!, MSLive, Amazon, ete. These companies crawl the related websites and provide search facility among the docuŽments of those websites and of worldwide web also. It may also act as an advertisement for them through the websites.As a matter of pride, many organ izations would prefer to have their own search engine embedded in the website.

A decade ago, many search engines like Altavista, Lycos, Yahoo, Askjeeves were popular. Later, Google with its sophisticated ranking strategy ensured acceptable results for different types of user queries. But getting the customized service of Google is pay for availing the facility. Then different search engines like cuil, guruji, khoj, terrior came up with their own ranking strategy in the web supporting multiple languages. Along with these developments, Open Source search engines also emerged aside.

Nutch

Nutch is an Open Source search engine developed in JAVA on top of Lucene, which itself is a free Open Source information retrieval system. Nutch can be deployed in Internet or Intranet environments and can be customized for building small or large scale information retrieval systems supporting multiple languages.

Prerequisites

1. JAVA and JRE should be installed and path variables for JAVA_HOME and JRE_HOME should be set.

2. Set Path to current ANT build, if not done already. Apache Ant is a JAVA-based build tool which builds the project using configuration files that are XML based. Its current version (1.7.1) can be downloaded.

Name:  Embedding Open Source Search Engine.jpg
Views: 144
Size:  49.5 KB

Installing and configuring Nutch

The latest version of Nutch (ver 0.9) can be downloaded from . Assume that the login is pcquest and the home folder is /home/pcquest. Create a folder, named say <mySearch> and download the file nutch-0.9.tar.gz (Size 68MB) in it, extract the contents therein and then go to folder

/home/pcq uest/ mySearch/ nutch -0.9/ which is the root folder of Nutch. Now Nutch has to be configured, which includes two tasks:

1. Configuring Crawl Filter: Edit the file conf/crawl-urlfilter.txt file and change - to + only at one place after the line "# skip everthing else" so that it appears as:

2. Modification to Nutch configuration: This includes the folder con taining the crawled data and enables Nutch Searcher to search crawled web data. Initially the file conf/nutchsite.xml does not contain any configuration details. We have to modify it by including the target folder <myCrawled> which contains the crawled data. Add the following lines between <configuration> </configuration> tags: The file conf/nutch-default.xml should be modified for including agent name between the tags <value></value>. We use 'pcquest' as the agent name and the final entry looks like:

Crawling, indexing and searching Website

Nutch initially crawls and indexes websites and is then ready for serving user's query through searching the indexed data.

1 Crawling and Indexing websites:

In the Nutch folder, Ihome/pcquest/mySearchl nutch-0.9/; make a directory named uris in which, create a text file named seed_uris having list of uris one per line (we used, and then build the system using command - "ant && ant war". Now remove ROOT>+- from we bapps folder of Apache-tomcat folder and copy nutch-0.9.war into the webapps folder of tomcat in the name of ROOT.war and restart tomcat server. Now the system should perform crawling using the following command:

It should be ensured that the folder myCrawled does not exist already. The above command creates the folder named myCrawled and stores the crawled & indexed data in it. If this folder already exists, then the crawler terminates. The values of the parameters depth and threads are user defined where depth shows the level of the webŽsites to be crawled and threads shows the number of concurrent crawling processes. Once crawling is over, then searching starts with Nutch user interface.

2. Searching for User Query among Indexed documents.

The deployment of the search engine can be tested using address http://localhost:80801.This loads the default Nutch user interface (below)which can be modified to fit in your website. Instead of using the above default inŽterface, the following code can be used to include search box with submit button in your website: