Women in Technology

Hear us Roar

  Building a Simple Search Engine with PHP
Subject:   A more difficult search engine
Date:   2003-12-18 08:16:26
From:   anonymous2
I need to develop a PHP/MySQL search engine for my compnay (the publisher of an academic magazine) and my boss has expressed a desire for an "exact phrase option." For instance, when a user types in "martin luther king", the search engine would only bring up articles on Dr. King, rather than articles about Martin Luther and his dealings with german royalty. However, this seems like a terribly difficult thing to implement. Your tutorial has been a great deal of help on this topic, and I was wondering if you might point me in the right direction.

- Giff

Full Threads Newest First

Showing messages 1 through 2 of 2.

  • RE: A more difficult search engine
    2003-12-19 00:49:31  dsolin [View]

    Hi Giff,

    From what I can understand by reading your description, this is something that you will need to implement in the indexing mechanism of your search engine -- when the user provides the engine with a search phrase, the backend needs to already know the difference between "an article on Dr. King" and "an article about Martin Luther and his dealings with german royalty".

    So, without being able to get into much detail, I think you need to implement this in your backend database. Maybe you should add a column that indicates the state of a certain URL -- is it an article on Dr. King or a about his dealings with german royalty? Of course, the hardest part of such a project would be to implement logic into the indexing mechanism that calculates that value of this column. Maybe it needs to be done manually?

    If you find a working solution, Giff, please feel free to post it here. I'm sure that would be interesting reading for many of us. Good luck!

    • RE: A more difficult search engine
      2005-02-09 21:25:14  corich [View]

      Giff is probably writing for Google by now, but maybe this will help someone else. This is actually not as tough as it might appear. First, add an extra field to the pages table -call it page_text, and make it the text datatype so that it's not constrained size-wise. Next, in your spider, insert an extra line as follows:

      /* Try to remove all HTML-tags: */
      $buf = strip_tags($buf);
      $buf = ereg_replace('/&\w;/', '', $buf);
      /* the above is for context, here's the new stuff: */
      mysql_query("UPDATE page SET page_text = '$buf' WHERE page_id = $page_id");

      This stores the entire page text, stripped of tags, in the table as a contiguous string.

      The only other thing is to add code that checks for the exact search phrase to the search engine portion of the project. The simplest way to write an exact phrase match search (and this will only find exact matches) would be to replace the search query with something like this:

      SELECT p.page_url AS url
      FROM page p
      WHERE page_text LIKE '%$keyword%'

      This query searches the pages table for instances of the keyword phrase within the full text of the page.

      Hope this helps!