Search is an integral of all websites. Most of the current WordPress an other sites use a built-in search capability or rely on Google custom search. However, many a times you will want to add your own search engine which you can yourself control. This can be particularity useful if you have a small intranet. In this post we will see how to integrate a small PHP search engine into any website to add custom search capabilities.
Sphider is a lightweight (measuring less than 100Kb) web spider and search engine written in PHP, using MySQL as its back end database. Sphider supports all standard search options, but also includes other advanced features such as word autocompletion, spelling suggestions etc. The administration interface provides a simple way to index and control the search features. Sphider includes an automated crawler, which can follow links found on a site, and an indexer which builds an index of all the search terms found in the pages. It is written in PHP and uses MySQL as its back end database.
Download the library from sphider.eu and follow the instructions in the ‘install.txt’ file. Basically all you need to do is create a empty database named ‘sphider_db’ and update the database details in the file ‘settings/database.php’. Once that is done open the admin page ‘admin/admin.php’ and log in using the username/password as ‘admin’.
Before we proceed I would like to give the final search result page as would be visible after indexing a site. Note that most of the elements can be customized from the admin section. Also check a live demo at mathisfun.com.
Indexing your site
Once you are logged-in to the admin area you will be greeted with the following screen.
Our first task will be add a site and start indexing. The indexing page is given below with various options to set the level of depth. Note that if you select full it can take some time to index a large site. Indexing depth means how many “clicks” away the page can be from the starting page. Depth 0 means that only the starting page is indexed, depth 1 indexes the starting page and all the pages linked from it etc. For initial testing it is recommended to keep the depth to 1.
You can further customize the crawl process by clicking on the ‘Advanced Options’ link on the left. This will display additional options as below. By default, Sphider never leaves a given domain, so that links from domain.com pointing to domain2.com are not followed. By checking ‘Spider can leave domain ‘ option Sphider can leave the domain, however in this case its highly advisable to define proper must include / must not include string lists to prevent the spider from going too far.
If you want to change the default behaviour of Sphider, you can do this either through the admin interface, or by directly editing ‘settings/conf.php’.
Once the indexing is done you can check the statistics – links indexed, words indexed etc in the ‘statistics’ tab. You can also check which keywords were mostly frequently searched and other stats as such.
Using the indexer from commandline
It is possible to spider webpages from the command line. This can be extremely useful if you want to to automate the process or run the sphider at regular intervals using CRON.
For example, for spidering and indexing http://www.domain.com/test.html to depth 2, we can use the following.
php spider.php -u http://www.domain.com/test.html -d 2
If you want to reindex the same url, use:
php spider.php -u http://www.domain.com/test.html -r
The complete options are given below.
php spider.php <options> -all Reindex everything in the database -u <url> Set the url to index -f Set indexing depth to full (unlimited depth) -d <num> Set indexing depth to <num> -l Allow spider to leave the initial domain -r Set spider to reindex a site -m <string> Set the string(s) that an url must include (use \n as a delimiter between multiple strings) -n <string> Set the string(s) that an url must not include (use \n as a delimiter between multiple strings)
Customizing the result page
Note that each element of the search result page be customized from the ‘settings’ section of the admin page. Also you can modify the CSS to match your site design.