Related link: http://www.cdbaby.com/

Heard any good strategies for super-fast catalog-database search results?

We’ve got a database of 70,000 albums with 5 varchar fields that people may search:

- artist name
- album title
- description
- mis-spellings of artist name
- similar/related artists

The problem is that most people just like to use the one generic SEARCH _______ [search] box on the front page, and expect to find exactly what they were looking for.

I used to return the results using “LIKE” like this:
SELECT * FROM albums WHERE artistname=’$word’ — (exact match always gets first priority)
UNION
SELECT * FROM albums WHERE albumname=’$word’
UNION
SELECT * FROM albums WHERE (artistname LIKE ‘%$word%’ OR albumname LIKE ‘%$word%’ OR style LIKE ‘%$word%’ OR mispeling LIKE ‘%$word%’);

Everyone was happy with the results they got, except the search would take 5 to 40 seconds! (Yes all those fields named above are indexed.)

So I switched to MySQL’s FULLTEXT search:
SELECT * FROM albums WHERE artistname=’$word’
UNION
SELECT * FROM albums WHERE albumname=’$word’
UNION
SELECT * FROM albums WHERE MATCH (albumname, soundlike, mispeling, style) AGAINST (’$word’)

MUCH faster! Way way faster! Customers stopped complaining. Only problem is… artists started complaining. If the musician’s name is Bob “Red” Thomas and someone searches for Bob Thomas, he’ll be in the FULLTEXT search results… under about 1000 others with any “Bob” and “Thomas” in their name.

But it was “good enough” so I left it that way, and apologized to the folks affected by the fulltext search results.

But now that I’m thinking of switching to PostgreSQL for various reasons, the issue of search-results-speed has come up again.

We already have a dual-Xeon fast-RAID-SCSI box dedicated to be the master, and a slave dual-Xeon fast-RAID-SCSI box dedicated to be used for nothing but read-only search-results, and STILL it’s not fast enough.

(I should mention I’ve tried lots of other ways in the past, too, such as having PHP serialize the results of every search, saved in a filename that’s the MD5 hash of the SQL query itself, so that if someone made that same query later that day it would unserialize the previous search-results, which were flushed each night at midnight. That helped a bit for common searches.)

So here I am asking the O’Reilly crowd - how do the big online catalog sites search many fields in their massive databases in millseconds? Do you just keep throwing more hardware at it, or am I going about things the wrong way entirely?

Heard any good strategies for super-fast catalog-database search results?