KinoSearch is a CPAN distribution that’s a Perlish port of the powerful Apache Lucene search engine. (In one sense, it’s a competitor to the earlier Plucene project.) Here’s what I learned from playing with it one afternoon.

How Search Works

Despite its low version number, KinoSearch was powerful and stable in my tests. (Admittedly, I’ve seen the author, Marvin Humphrey, give an extended talk about it and have discussed the module and some of my questions before writing this review.) I did run into one large caveat when giving the wrong path for the index location, but Marvin promised to fix this very soon. I recommend upgrading to a version after 0.06 for this reason.

KinoSearch builds and queries an inverted index from a large document corpus. Marvin’s original goal is to provide a usable search engine for large web sites, but apart from the documentation having a bias toward the web, my experimentation showed that it should work just as well in other contexts.

KinoSearch’s technique is to scan the complete corpus of documents to build up its index before performing any searches. This is the first step. After that, whenever performing a search, it queries the index for the terms and produces information on the documents in which those terms appear. Oh yes — it throws out stopwords and stems words too, which is nice (but probably relies on having a stopword list and stemmer for your particular language.)

You don’t really have to know this to make the code work. All you have to know is that preparing the search engine is a two step process.

The Code

To see how the project works, I borrowed and modified the code from KinoSearch::Docs::Tutorial to search the text of the book Perl Hacks. For the most part, the indexer code did what I want. I did modify it slightly, partly to change the file paths, and partly because my files are POD and not HTML.

#!/usr/bin/perl

use strict;
use warnings;

use Cwd;
use File::Spec;
use KinoSearch::InvIndexer;
use KinoSearch::Analysis::PolyAnalyzer;

my $dir              = cwd();
my $source_dir       = '/home/chromatic/work/books/perl_hacks/';
my $path_to_invindex = File::Spec->catdir( $dir, 'kindex' );

chdir $source_dir;

my @filenames  = <chapter_0*/*.pod>;

chdir $dir;

my $analyzer   = KinoSearch::Analysis::PolyAnalyzer->new( language => 'en', );

my $invindexer = KinoSearch::InvIndexer->new(
	analyzer   => $analyzer,
	invindex   => $path_to_invindex,
	create     => 1,
);

$invindexer->spec_field( name => 'title' );
$invindexer->spec_field( name => 'text'  );

foreach my $filename (@filenames)
{
	my $filepath = File::Spec->catfile( $source_dir, $filename );
	open( my $fh, '<', $filepath )
		or die "couldn't open file '$filepath': $!";

	my $content  = do { local $/; <$fh> };
	my $doc      = $invindexer->new_doc();

	next unless $content =~ /=head1 (.*)$/m;
	my $title    = $1;
	my $text     = $content;

	$doc->set_value( text  => $text  );
	$doc->set_value( title => $title );

	$invindexer->add_doc($doc);
}

$invindexer->finish();

Notice that there’s a separate directory provided for the location of the kindex (that is, KinoSearch’s index). In version 0.06, the code clears the destination directory rather too aggressively. Marvin has promised to change the code in a new release.

This code may seem a bit long for merely creating an index, but it performs an function besides configuration. It allows you to define arbitrary fields for each document in the index. That is, the tutorial example shows storing a base URL for each document. That’s unnecessary here, where I’m not creating a web search engine. However, if you need to add that field to the index, you can do so. You can add any other field you want too. Of course, to do that you have to write special code to process your documents — but you can’t avoid that either.

With the index created, how do you search for something?

I took the search.cgi file from the tutorial and turned it into a standalone program. (Don’t let the tutorial fool you with all of the HTML-generation and paging code; it’s easier than it looks.)

#!/usr/bin/perl
use strict;
use warnings;

use KinoSearch::Searcher;
use KinoSearch::Analysis::PolyAnalyzer;
use KinoSearch::Highlight::Highlighter;

my $path_to_invindex = 'kindex';

die "No search string\n" unless @ARGV;

my $analyzer = KinoSearch::Analysis::PolyAnalyzer->new( language => 'en', );

my $searcher = KinoSearch::Searcher->new(
	invindex => $path_to_invindex,
	analyzer => $analyzer,
);

my $hits         = $searcher->search( query => join ' ', @ARGV );
my $highlighter  = KinoSearch::Highlight::Highlighter->new(
	excerpt_field => 'text', pre_tag => '*', post_tag => '*',
);

$hits->create_excerpts( highlighter => $highlighter );
$hits->seek( 0, 100 );

while ( my $hit = $hits->fetch_hit_hashref() )
{
	printf "%s: %.03f\n%s\n", @{$hit}{qw( title score excerpt )};
}

my $total_hits  = $hits->total_hits();
print $total_hits ?  "$total_hits total hits\n" : "No matches for query.\n";

Notice the use of KinoSearch::Highlight::Highlighter. This module allows you to highlight the occurrence of key terms in search results. I didn’t explore it in detail, but it seemed to work decently for prepending and appending tokens to the terms. (A future enhancement might be to pass in a subroutine reference to transform the terms with much more power. I’ve wanted this in similar systems.)

Caveats

I found one frustration in having to call seek() on the $hits object before I could retrieve any results. While I can understand the value of having this method available (especially when paging results on a web site), it seems like the default should be to start with the first values. It’s a minor issue, but I didn’t see an explanation in the tutorial.

One frustration is that the documentation on creating special search objects is a bit unclear. If you need something more than searching for simple words and phrases, you may want to wait until the documentation and tutorial are more complete. (Of course, if you don’t already have a search engine available, anything is better than nothing, and I have confidence in Marvin’s maintenance.)

The current version of the distribution must rebuild the index whenever you add a document. This is a limitation Marvin plans to address in the near future — incremental indexing is very valuable for large projects that regularly add documents.

Recommendations

How did it work? In my simple tests, unconcerned with speed or optimization, KinoSearch found what I expected it to find. It was easy to use and modify the example code and it seems like I could go on to experiment with the module in more detail if I were to deploy it for a real project. Would I do so? Definitely.