This project upgrades an online forum to add a search engine, using Test Driven Development. Our tools are RoR’s Beast, Sphinx, and (naturally) assert{ 2.0 }.

We follow this MVC guideline:

Anything a user can do to the data through the Views,
a unit test can do, the same way, through the Models
Our test cases simulate a user searching.

Adding a View to this project is left as an exercise for the reader. It will be Rails-easy.

Start by learning to install Sphinx, at Data Noise: acts_as_sphinx plugin.

sphinx.conf

A query like this provides a Sphinx index:

  SELECT posts.id     as id, 
         topics.title as title,
         posts.body   as body, 
         users.login  as user, 
         forums.name  as forum,
         topics.hits  as hits 
    FROM posts 
    LEFT OUTER JOIN topics ON topics.id = posts.topic_id 
    LEFT OUTER JOIN users  ON  users.id = posts.user_id
    LEFT OUTER JOIN forums ON forums.id = posts.forum_id
That yields a table like this:

id title body user forum hits
1 PDI this! P D I pdi aaron rails 0
2 PDI this! what? pdi sam rails 0
3 PDI this! you heard me pdi aaron rails 0
6 Galactus is coming galactus gah-lac-tus aaron comics 42
7 Galactus is coming Galactus oh no! sam comics 42
8 Agent of SHIELD agent of SHIELD sam comics 19
9 Agent of SHIELD Blah Blah aaron comics 19
10 il8n in rails? il8n in rails! aaron rails 0
etc…

That’s Beast’s Post fixture data. Our test cases will lead Sphinx to return results from them. To generate that big SELECT statement, I wrote this little test case:

  def test_pre_sphinx
    posts = Post.find(:all, :include => [:topic, :user, :forum])
    assert 'keep the fixture system on its toes!' do posts.any? end
  end
Then I read the test.log, and copied out the list of LEFT OUTER JOINs required to write that big query. (Incidentally, other Sphinx plugins, such as Ultrasphinx, read your ActiveRecord associations for you, so they might simplify this step…).

Now copy the SELECT statement into your config/sphinx.conf file, and use it to index and start Sphinx:

source posts
{
    type            = mysql
    sql_host        = 127.0.0.1
    sql_user        = root
    sql_pass        = 
    sql_db          = beast_dev
    sql_sock        = /tmp/mysql.sock

    sql_query       = \
      SELECT posts.id     as id,   \
            topics.title as title, \
            posts.body   as body,  \
            users.login  as user,  \
            forums.name  as forum, \
            topics.hits  as hits   \
        FROM posts                  \
        LEFT OUTER JOIN topics ON topics.id = posts.topic_id \
        LEFT OUTER JOIN users  ON  users.id = posts.user_id  \
        LEFT OUTER JOIN forums ON forums.id = posts.forum_id 

    sql_attr_uint = hits  #  for sort_attr
}

index posts
{
    source          = posts
    path            = ../tmp/sphinx/posts
    morphology      = stem_en
}

The line sql_attr_uint declares we can sort results by hits.

ask_sphinx

Now add acts_as_sphinx to the Post model, and write a test to find some sample records:

require 'fileutils'

class PostTest < Test::Unit::TestCase
  include FileUtils
  all_fixtures
  
  def test_aardvark
    system 'rake RAILS_ENV=development db:fixtures:load'
    cd 'config' do
        #  index our fixtures before a test run!
      indexer_response = `indexer --rotate --all --quiet`
      assert{ indexer_response == '' }
    end
    sleep 0.7
  end
  
  def test_ask_sphinx
    response = Post.ask_sphinx('pdi')
    assert{ response[:total] == 3 }
    sphinx_found = response[:matches].keys  # <-- the post IDs!
    posts = Post.find(:all, :conditions => 'body like "%pdi%"')
    mysql_found = posts.map(&:id)
    assert{ sphinx_found.to_set == mysql_found.to_set }
  end
...

The test_aardvark case runs before all the others. Sphinx likes MySQL, but Beast tests like Sqlite3. They must meet each other halfway, to ensure any changes to our fixtures immediately appear in our Sphinx index. After our fixture directive pushes our fixtures into the beast_dev database, we invoke the indexer, and wait for it to SIGHUP the searchd process.

(Suggestions for better techniques are welcome!)

The test itself finds the three posts that discuss “PDI” - whatever that is. We then challenge Sphinx by finding the posts ourselves, and compare each set of returned IDs.

find_with_sphinx

Now we clone the test, and modify it to use find_with_sphinx:

  def test_find_with_sphinx
    seek = 'agent of shield'
    sphinx_posts = Post.find_with_sphinx(seek)
    assert{ sphinx_posts.total == 2 }  #  <-- find_with_sphinx provides that accessor
    best_match = sphinx_posts[0]
    _2nd_match = sphinx_posts[1]
    sought = /#{seek}/i
    assert{ best_match.topic.title =~ sought and best_match.body =~ sought }
    assert{ _2nd_match.topic.title =~ sought and _2nd_match.body !~ sought }
  end

When you clone a test, sometimes you should change its assembled sample data. Test cases should diversify, to extend their coverage.

The test finds two posts. The first one has "agent of SHIELD" in both its topic and body, and the second has it only in its title.

sort_mode

Next, suppose our boss requests us to find posts in order of popularity. Beast ticks the Post.hits attribute each time a user hits a page, but the topics.yml file has no positive hits, so we edit topics.yml, and declare the Galactus thread the most popular, and the Agent of Shield thread second-most popular.

galactus:
  id: 6
  hits: 42
  title: Galactus is coming
...
shield:
  id: 7
  hits: 19
  title: Agent of SHIELD
...
il8n:
  id: 8
  hits: 55
  title: il8n in rails?

To order results by the hits attribute, we must overcome a missing feature in find_by_sphinx. It only sorts by weight, but we need our batch of Post records sorted by hits. This code works around those issues:

  def sort_posts_by_hits(matches)
    posts = Post.find_all_by_id(matches.keys)
    return posts.sort_by{|post| -matches[post.id][:attrs][:hits] }
  end

  def test_sphinx_should_sort_by_hits
    popularity = { :mode => :extended, :sort_mode => [ :attr_desc, 'hits'] }
    results = Post.ask_sphinx('Galactus', popularity)
    posts   = sort_posts_by_hits(results[:matches])
    titles  = posts.map(&:topic).map(&:title).uniq
    assert{ titles == ['il8n in rails?', 'Galactus is coming'] }
  end

The sphinx.conf line sql_attr_uint = hits registered hits as an “attribute” (not a searchable), so we can sort by it. But find_with_sphinx did not sort using that variable.

We have now run off the end of acts_as_sphinx’s feature set. I chose it because it runs closest to the raw Sphinx.

That ask_sphinx call sorts our code by hits, but our imaginary boss is not satisfied yet!

SPH_SORT_EXPR

Our boss requests that matches in a Topic.title rank higher than popular posts without matches in the title. That leads to this stray “Galactus”, in an unrelated thread, in posts.yml

il8n:
  id: 10
  forum_id: 1  <%# rails, not comics! %>
  body: il8n in rails! Galactus!

…and this failing test:

  def test_sphinx_should_sort_by_relevant_hits
    popularity = { :mode => :extended, :sort_mode => [:attr_desc, 'hits'] }
    results = Post.ask_sphinx('Galactus', popularity)
    posts   = sort_posts_by_hits(results[:matches])
    titles  = posts.map(&:topic).map(&:title).uniq
    assert{ titles == ['Galactus is coming', 'il8n in rails?'] }
  end

The test fails with this diagnostic; the posts are out of order:

assert{ titles == ["Galactus is coming", "il8n in rails?"] }    --> false - should pass
    titles --> ["il8n in rails?", "Galactus is coming"]

To pass that test, you may need to first edit sphinx.rb, and add this line:

  SPH_SORT_RELEVANCE     = 0
  SPH_SORT_ATTR_DESC     = 1
  SPH_SORT_ATTR_ASC      = 2
  SPH_SORT_TIME_SEGMENTS = 3
  SPH_SORT_EXTENDED      = 4
  SPH_SORT_EXPR          = 5

Upgrade the test like this:

  def sort_posts_by_weight(matches)
    posts = Post.find_all_by_id(matches.keys)
    return posts.sort_by{|post| -matches[post.id][:weight] }
  end
  
  def test_sphinx_should_sort_by_relevant_hits
    popularity = { :mode => :extended, 
                   :weights => [10, 1],
                   :sort_mode => [:expr, '@weight + hits * 1000'] }
    results = Post.ask_sphinx('Galactus', popularity)
    posts   = sort_posts_by_weight(results[:matches])
    titles  = posts.map(&:topic).map(&:title).uniq
    assert{ titles[0..1] == ['Galactus is coming', 'il8n in rails?'] }
  end

To Do

Now that our tests have taught us how to operate Sphinx, we can use Extract Method Refactor to pull out the code we want to use in production.

The next Sphinx feature to explore is its query language. Using :mode => :extended, a search term of “Galactus @user (sam)” will filter out other users.

The interfaces between Sphinx and Ruby are young and growing, and these testing techniques should help them grow!