Women in Technology

Hear us Roar



Article:
  Building a Simple Search Engine with PHP
Subject:   RE: A more difficult search engine
Date:   2003-12-19 00:49:31
From:   dsolin
Response to: A more difficult search engine

Hi Giff,


From what I can understand by reading your description, this is something that you will need to implement in the indexing mechanism of your search engine -- when the user provides the engine with a search phrase, the backend needs to already know the difference between "an article on Dr. King" and "an article about Martin Luther and his dealings with german royalty".


So, without being able to get into much detail, I think you need to implement this in your backend database. Maybe you should add a column that indicates the state of a certain URL -- is it an article on Dr. King or a about his dealings with german royalty? Of course, the hardest part of such a project would be to implement logic into the indexing mechanism that calculates that value of this column. Maybe it needs to be done manually?


If you find a working solution, Giff, please feel free to post it here. I'm sure that would be interesting reading for many of us. Good luck!


-Daniel

Main Topics Oldest First

Showing messages 1 through 1 of 1.

  • RE: A more difficult search engine
    2005-02-09 21:25:14  corich [View]

    Giff is probably writing for Google by now, but maybe this will help someone else. This is actually not as tough as it might appear. First, add an extra field to the pages table -call it page_text, and make it the text datatype so that it's not constrained size-wise. Next, in your spider, insert an extra line as follows:


    /* Try to remove all HTML-tags: */
    $buf = strip_tags($buf);
    $buf = ereg_replace('/&\w;/', '', $buf);
    /* the above is for context, here's the new stuff: */
    mysql_query("UPDATE page SET page_text = '$buf' WHERE page_id = $page_id");

    This stores the entire page text, stripped of tags, in the table as a contiguous string.


    The only other thing is to add code that checks for the exact search phrase to the search engine portion of the project. The simplest way to write an exact phrase match search (and this will only find exact matches) would be to replace the search query with something like this:


    SELECT p.page_url AS url
    FROM page p
    WHERE page_text LIKE '%$keyword%'

    This query searches the pages table for instances of the keyword phrase within the full text of the page.

    Hope this helps!

    --Rich