Speedup at What Cost?by Christopher Dwan
Homology search through sequence similarity, using BLAST (the Basic Local Alignment Search Tool), is the most common task in bioinformatics. In spite of its wild popularity, the algorithmic details of BLAST are poorly understood by its community of users. Of particular importance is the fact that BLAST uses a heuristic approximation algorithm that gains speed at the expense of accuracy of results.
BLAST is arguably the most valuable bioinformatics tool developed to date, but it is not the only program that can perform sequence similarity searches, nor is it the best choice for all applications. As other choices become more prevalent in the community, it is vital that we develop a shared understanding of the fundamental biological questions addressed as well as a methodology for transforming, comparing, and replicating results produced with different implementations and algorithms.
In this article, I discuss some example scenarios in which life scientists perform sequence-based homology searches, and the pitfalls they should be aware of when using any heuristic algorithm. I will discuss this at much greater length at the O'Reilly Bioinformatics Technology conference in January of 2002. In addition, I will present comparative results on a large (80,000 query sequences versus DB-EST) set of homology searches across several algorithms.
O'Reilly's Bioinformatics Technology Conference, January 28-31, 2002, in Tucson, Arizona, will present rich technical training for biologists and computational scientists, with a practical emphasis on how to choose and use the right tools. Register by December 7 and save up to $545.
Homology and Search
Two items are homologs if they fill the same role in different settings. Coffee in the U.S. might be homologous to tea in the U.K. In genomics, this concept has several interpretations. The most accessible of these is that when a particular biochemical pathway is present in related species, two genes that do the same thing are said to be homologs.
Post any questions or concerns, and Chris will try to address them at his conference session.
A more formal definition (based on the history of the word itself) is that two biochemical sequences are homologous if they are derived from a common ancestor. This definition is a minefield of arguments that I wish to sidestep in this short article. I will, however, take one piece of the more formal definition and leave the rest: Homology is a Boolean property. There is no such thing as a degree or a percent of homology. There might be percent similarities or degrees of match, but two items are either homologous or they are not.
In attempting to decipher the function of a bit of biochemical sequence, one of the first steps undertaken by a life scientist is to perform a search for homologs across all of the sequences publicly available in a repository like NCBI's GenBank. If a close relative (indicated by a high degree of sequence similarity) is found, which has already been extensively studied, it provides valuable leverage for determining the function and behavior of the sequence in question.
The most popular metric used to formalize sequence similarity (and the one used in BLAST) is related to a computer science standby: edit distance. Loosely, this is the minimum number of single character replacements, insertions, and deletions by which one sequence can be transformed into another. A lower edit distance indicates a more similar sequence.
Judging homology (a functional similarity) based on string similarity relies on the (rather weak) assertion that sequence similarity can imply functional similarity. There are many rather glaring counterexamples on either side of this claim. Sequences with greater than 70 percent identity have been found that fill totally different roles, while sequence pairs with less than 35 percent identity have been found that not only perform the same function, but do it in precisely the same way.
Despite this weakness, sequence similarity is the most powerful bioinformatics technique currently available, and BLAST is the most popular tool to implement it.
Christopher Dwan will be speaking on this topic at O'Reilly's first Bioinformatics Technology Conference, January 28-31, 2002, in Tucson, Arizona. For his session description, read Speedup at What Cost? An Evaluation of Heuristic vs. Complete Homology Search Techniques.
Time versus Accuracy
BLAST is a heuristic, rather than complete, search technique. Heuristic techniques sacrifice accuracy of results in favor of the speed with which those results are produced. The canonical example of a complete search on (approximately) the same metric distance as BLAST is the Smith-Waterman algorithm first published in 1981. Smith-Waterman can be thought of as the standard to which BLAST emerged as an approximation.
Using Smith-Waterman, the time required to search a dataset containing N letters for matches to a query M-letters-long grows with at least (M + N) squared. In some cases the value is closer to the cube than the square of that sum. BLAST produces its results in approximately linear time with the same sum.
For the publicly available datasets, N is truly monstrous: Millions of letters for even a small genome, and billions of letters for the sets of genomic data that are of real interest. Confounding the problem is that modern automation techniques permit the generation of thousands of query sequences per day even by small facilities. Obviously, to keep up with the deluge of data, we will have to compromise somewhere.
The obvious question is, "What do we lose?". Setting aside subtle (but important) differences in the metric space searched by the two algorithms, the answer to that question depends on the exact parameters of the search as well as how the results are to be used. For the question as it has been described to this point, BLAST turns out to be a remarkably good approximation to Smith-Waterman. This has been shown several times since BLAST's initial publication in 1990, but as current datasets pass orders of magnitude beyond those original test sets it is important to check that those initial results still hold true.
In experiments with a large set of data, using a hardware-accelerated, Smith-Waterman algorithm on a Decypher board from TimeLogic Corporation, the correlation between very strong matches was nearly complete. For anything that wasn't very nearly identical, however, the correlation fell off rapidly. Further, while the sensitivity of BLAST (percentage of the hits available to be found, which were found) was high, its selectivity (percentage of the hits returned that were truly correct) was surprisingly low once the very best hits were discarded.
The message is that, while BLAST is a fine tool for finding very highly conserved sequences, researchers who are looking for distant homology, or who are training automated recognizers, or who are performing other nontraditional searches might be better served by another heuristic, or by investing the time to get the real numbers on a smaller set of queries.
For a list of O'Reilly's books on bioinformatics, visit bio.oreilly.com.
Once a good pair (or more) of homologs is found, a question very much on the mind of a life scientist is "How similar are these sequences, and which of those similarities are the most important?". A detailed discussion of techniques for answering these questions is beyond the scope of this article. The important bit here is that most of the techniques for answering these questions are based on sequence alignments, in which every letter of one sequence is mapped to either a letter or a gap in the other of the sequences in question.
For each of the pairs of similar sequences that it finds, BLAST reports an alignment. This alignment is frequently close to, but not precisely the same as, one that would be found by a complete search of the alignment-space for the two sequences in question. Again, it's a time versus correctness tradeoff.
Using known homologs from two closely related species, I compared a couple of the popular analyses based on alignments from both BLAST and Smith-Waterman. As might be expected, the results based on BLAST consistently indicated a greater distance between the sequences than those results shown by the optimal alignment.
More disturbing was the fact that I could change the alignments generated by BLAST by using different substitution matrixes, which specify the cost involved in changing one letter to another. Most published results based on experiments of this sort do not mention the exact parameters used to generate hits and alignments. Numbers that are seemingly based on the same technique might in fact be quite different, destroying the reproducibility of the experiment on publication.
BLAST is a heuristic search technique, which means that users accept a cost (less than accurate results) in exchange for a benefit (quicker turnaround on those results). As the field explodes with new researchers and new ideas, it is vital that that tradeoff be taken into account when selecting the tool to use to search for and analyze homologs.
It is also imperative that bioinformaticians running the actual searches not be too distant from the real biochemical questions being addressed. Otherwise details like those shown above, and many others, can get swept under the rug.
Christopher Dwan has spent five years applying machine learning and computational techniques to scientific investigations in vision processing, sensor analysis, and bioinformatics.
Return to the Linux DevCenter.