Building a Simple Search Engine with PHPby Daniel Solin
A little while ago, I was working on an intranet site for a mid-sized company. As the site grew in both size and popularity, the assigner requested me to extend the site with a search feature. Since one of the rules of the intranet was that all logic code should be written in-house, using an existing open source engine was not an option.
Within a day, the engine was quite complete, and the result actually turned out better than expected. With PHP, MySQL, and a few techniques, these small projects are very easy. This article presents a cut-down version of the search engine. I hope this will encourage you to develop an engine that suits your particular needs, with the exact features you desire.
Database Design and Logic
We'll use MySQL as a database backend to store our search data. It's
possible to shell out to Unix commands such as
find, but that would mean running the search engine on the machine
hosting the files. As well, it would be more difficult to index pages served
from a database. We'll tackle the database first.
The database for the search engine consists of three tables:
page holds all indexed web pages, and
word holds all
of the words found on the indexed pages. The rows in
correlate words to their containing pages. Each row represents one occurrence
of one particular word on one particular page. The SQL for creating these
tables are shown below.
CREATE TABLE page ( page_id int(10) unsigned NOT NULL auto_increment, page_url varchar(200) NOT NULL default '', PRIMARY KEY (page_id) ) TYPE=MyISAM; CREATE TABLE word ( word_id int(10) unsigned NOT NULL auto_increment, word_word varchar(50) NOT NULL default '', PRIMARY KEY (word_id) ) TYPE=MyISAM; CREATE TABLE occurrence ( occurrence_id int(10) unsigned NOT NULL auto_increment, word_id int(10) unsigned NOT NULL default '0', page_id int(10) unsigned NOT NULL default '0', PRIMARY KEY (occurrence_id) ) TYPE=MyISAM;
word hold actual data,
occurrence acts only as a reference table. By joining
word, we can
determine which pages contain a word, as well as how many times the word
occurs. Before that, though, we need some data.