Twenty-First Century Rocket Science
by Ian Korf, coauthor of BLAST08/11/2003
Bioinformatics, genomics, proteomics: these are the buzzwords of the 21st century. Gone are the halcyon days where cool was breaking the sound barrier or sending a rocket to the moon. We can now read the book of life and marvel at its passages. Now that's cool.
Technological advances in computer technology and molecular biology in the last 30 years have changed the life sciences forever. Yes, it's true that a lot of biology is the same, and that biologists still travel to exotic lands and go unwashed for weeks just so they can observe some squishy thing or other. But computational molecular biology changes the way we look at the living world around us. We can now ask questions about the insides of organisms, and I don't mean their stomachs.
Genes and proteins: this is where the action is. Bioinformatics is a new field full of new ideas. It doesn't take much to begin. A computer with an Internet connection is a must, but equally important is your own curiosity. Hopefully if you're new to the field, the mixture of math, biology, and computer jargon won't put you off. It isn't all that complicated in the end, but it will seem like it at first. Three books that will make your life easier are BLAST, Learning Perl for Bioinformatics, and Developing Bioinformatics Computer Skills.
Germs and Disease
Despite modern medicine, diseases caused by microorganisms are still common, dangerous, and downright scary. I don't know about you, but the recent SARS outbreaks scared me silly. And I live in the UK! The sad truth is that we will always be fighting against germs. To combat them, we need to know as much about them as possible. One way we can begin to understand them is to look at their genes.
The bacterium that causes bubonic plague is named Yersinia pestis. What is it about Y. pestis that makes it so vile? One way to answer this question is to compare its genes to those of a not-so-vile bacterium. By the way, this is the typical way that scientists set up an experiment. They start with a question (why so vile?) and then think of an approach that will attempt to answer the question (compare genes). Our experiment will offer some explanations, but they are only part of a much larger story.
BLAST, Unix, and Perl
In this article, we'll use a mixture of BLAST, Unix, and Perl to answer the question posed above.
- Why BLAST? Because you can learn a lot by comparing sequences, and BLAST is the standard program for this task. BLAST software is available from the National Center for Biotechnology Information (NCBI) and Washington University (WU). NCBI-BLAST is free software in the public domain. WU-BLAST must be licensed, but is free for non-commercial users. We'll use NCBI-BLAST in this article because it is more easily obtained. Installation, tutorials, and other essential information on BLAST are presented in the BLAST book.
- Why Unix? Because BLAST, like many other bioinformatics applications, was originally developed in Unix. It is the most common operating system for computational biology research. If you have a Windows PC, you can install one of the inexpensive Unix-like operating systems such as Linux or FreeBSD on a spare partition or drive. An alternative is to install Cygwin, which will give you a Unix-like environment in Windows. Or you can just use Windows throughout, but the examples below will have to be adjusted for the Windows command line. If you have a Mac, you're already running Unix if you're using OS X. OS 9 users should upgrade to OS X if possible or install one of the PPC Linux distributions if you don't upgrade.
- Why Perl? Because Perl is the most popular programming language in bioinformatics. If you have Unix/Linux/Cygwin, no doubt you already have Perl installed. You could equally well use Python, Ruby, or some other scripting language with a rapid development cycle. For the tasks in this article, Java, C++, and the like are overkill.
GenBank
Let's review what we want to accomplish. We want to find the genes that are in Y. pestis but not some other nice bacteria. We're going to keep things simple in this exercise and equate genes and proteins. The first thing we need is the entire set of proteins (also known as the proteome) for Y. pestis. So where do you get sequences? The first and last stop is GenBank. GenBank is a huge repository for biological sequences and associated information. You can thank the United States government for putting U.S. taxpayers' money to a good cause that benefits everyone.
Start by going to the bacterial genomes page. There are quite a few genomes there. Here's a link directly to the Y. pestis genome and to the ftp site for the proteome. We also need a nice bacterium to compare it to. For this we'll use Eschericia coli. E. coli is a bacterium that lives in your gut and is a favorite organism for various molecular biology techniques. There are a few strains that can give you food poisoning, but E. coli is generally harmless. Here's a link to its proteome. You should now have two new files on your computer, named NC_004088.faa and NC_000913.faa.
Pilot Experiments
Despite the fact that art and science play by very different rules, there are some important commonalties. They both take thought and effort. Artists often do studies in pencil or charcoal to try out their ideas before they put paint to canvas. For scientists, pilot experiments are charcoal.
But while haphazard art can be fun, haphazard science rarely is. Just
because it's easy to do a BLAST search doesn't mean one should rush
into it. Taking the time to do things carefully can actually save you
time in the end. In our pilot experiment, we'll search 1 percent of the Y.
pestis proteome against the E. coli proteome. The lessons
we learn on 1% of the data will help use when we do the full experiment.
Let's begin by looking at the files we just downloaded. We'll use the
Unix head command to look at the first 10 lines.
% head NC_004088.faa
>gi|22123923|ref|NP_667346.1| initiation of chromosome replication [Yersinia pestis KIM]
MADITLISGSTLGSAEYVAEHLADKLEEAGFSTEILHGPELDELTLNGLWLIVTSTHGAGDLPDNLQPLL
EQIEQQKPDLSQVRFGAVGLGSSEYDTFCGAIIKLDQQLIAQGAQRLGEILEIDVIQHEIPEDPAEIWVK
DWINLL
>gi|22123924|ref|NP_667347.1| regulator for asnA, asnC and gidA [Yersinia pestis KIM]
MSEIYQIDNLDRSILKALMENARTPYAELAKNLAVSPGTIHVRVEKMRQAGIITAACVHVNPKQLGYDVC
CFIGIILKSAKDYPSALKKLESLEEVVEAYYTTGHYSIFIKVMCKSIDALQQVLINKIQTIDEIQSTETL
ISLQNPIMRTIVP
>gi|22123925|ref|NP_667348.1| asparagine synthetase A [Yersinia pestis KIM]
MKKQFIQKQQQISFVKSFFSRQLEQQLGLIEVQAPILSRVGDGTQDNLSGSEKAVQVKVKSLPDSTFEVV
The sequences are in FASTA format. The line beginning with ">" is called
the definition line. There is always exactly one. The sequence lines
follow, and there may be any number of sequence lines. The last sequence
has been truncated by the head command. The file names
are a little confusing, so let's make an alias for each.
% ln -s NC_004088.faa Y.pestis
% ln -s NC_000913.faa E.coli
One of the first things we might want to know is how many proteins are
in each file. For this we can use grep with the -c
option.
% grep -c ">" Y.pestis
4090
% grep -c ">" E.coli
4279
Both proteomes contain a little over 4,000 genes. The next step is to
create a subset of the Y.pestis file for our pilot experiment that contains
roughly 1% of the sequence. The wc command tells us how
many lines, words, and characters there are in a file.
% wc Y.pestis
24269 47944 1619914 Y.pestis
Now we'll use the head command again to select about 1%
of the sequence, and we'll check how many sequences we actually have.
head -243 Y.pestis > yp-pilot
grep -c ">" yp-pilot
37
Before we can search this against the E. coli proteome, we must
format the E.coli file as a BLAST database. We do this with the formatdb
program. We'll also format the Y.pestis file because BLAST databases
can be used to retrieve individual sequences, and we'll use this functionality
later.
% formatdb -i Y.pestis -o
% formatdb -i E.coli -o
We're now ready to do the pilot experiment, which is to search each of the 37 Y. pestis sequences in the yp-pilot file against the BLAST database containing the E. coli proteome. This will produce 37 BLAST reports all in the same file. We'll eventually process these with a Perl script, but for this pilot experiment we want to look at the reports by eye.
One of the greatest mistakes one can make is to always use the default
BLAST parameters. For BLASTP, which compares proteins, the defaults
are often fine, just not optimal. We'll begin with these parameters
and tune them later. We can add the Unix time program to
keep track of CPU usage so we can estimate the time for the larger search.
% time blastall -p blastp -d E.coli -i yp-pilot > blast-default
9.420u 1.030s 0:10.98 95.1%
The search took about 11 seconds on my trusty 550 MHz PowerBook. With these parameters, we can expect about 1,100 seconds, or about 18 minutes for the real search. Let's look at a few alignments now. The first query sequence in the report is described as:
Query= gi|22123923|ref|NP_667346.1| initiation of chromosome
replication [Yersinia pestis KIM]
(146 letters)
The best alignment between this sequence and the E. coli proteome is at the top of the first BLAST report. Here it is:
>ref|NP_418198.1| initiation of chromosome replication [Escherichia coli K12]
Length = 147
Score = 216 bits (550), Expect = 5e-58
Identities = 98/146 (67%), Positives = 124/146 (84%)
Query: 1 MADITLISGSTLGSAEYVAEHLADKLEEAGFSTEILHGPELDELTLNGLWLIVTSTHGAG 60
MADITLISGSTLG AEYVAEHLA+KLEEAGF+TE LHGP L++L +G+WL+++STHGAG
Sbjct: 1 MADITLISGSTLGGAEYVAEHLAEKLEEAGFTTETLHGPLLEDLPASGIWLVISSTHGAG 60
Query: 61 DLPDNLQPLLEQIEQQKPDLSQVRFGAVGLGSSEYDTFCGAIIKLDQQLIAQGAQRLGEI 120
D+PDNL P E +++QKPDLS VRFGA+G+GS EYDTFCGAI KL+ +L GA++ GE
Sbjct: 61 DIPDNLSPFYEALQEQKPDLSAVRFGAIGIGSREYDTFCGAIDKLEAELKNSGAKQTGET 120
Query: 121 LEIDVIQHEIPEDPAEIWVKDWINLL 146
L+I+++ H+IPEDPAE W+ W+NLL
Sbjct: 121 LKINILDHDIPEDPAEEWLGSWVNLL 146
Notice the low Expect value of 5e-58 and the high level of sequence identity (67%). If you're new to BLAST, these numbers may be very mysterious. We explain these in-depth in the BLAST book. Basically, this Y. pestis protein (NP_667346) is very similar to an E. coli protein (NP_418198). So this protein is not unique to Y. pestis.
Take some time to peruse some of the other 36 reports. You will find a repeating pattern. Like the report above, the first alignment in each report has high identity and a low Expect. This is because the two proteins are orthologous. In other words, they are the same gene in two different organisms.
There are also some reports where the Y. pestis protein does not have any good alignments to E. coli. Some of the Y. pestis sequences are labeled as "hypothetical" proteins and are very short. These might be overzealous gene predictions and not real proteins. Scroll down to the report with the following header (search for the unique identifier 22123943 or NP_667366).
Query= gi|22123943|ref|NP_667366.1| hemolysin precursor [Yersinia
pestis KIM]
(1654 letters)
Database: E.coli
4279 sequences; 1,361,003 total letters
Searching.........done
Score E
Sequences producing significant alignments: (bits) Value
ref|NP_415919.1| split orf [Escherichia coli K12] 48 5e-06
ref|NP_417134.1| putative ATP-binding component of a transport s... 42 3e-04
ref|NP_415687.1| putative ATP-binding component of a transport s... 38 0.005
ref|NP_416485.1| putative factor [Escherichia coli K12] 33 0.16
ref|NP_415923.1| split orf [Escherichia coli K12] 33 0.21
ref|NP_418376.1| 5,10-methylenetetrahydrofolate reductase [Esche... 32 0.46
ref|NP_416012.1| orf, hypothetical protein [Escherichia coli K12] 31 0.79
The top match is reported with an Expect value of 5e-06. This is relatively low by some people's standards, and one might stop here and accept that the query has found a significant match in the database. Let's look at the best alignment to be sure.
>ref|NP_415919.1| split orf [Escherichia coli K12]
Length = 852
Score = 48.1 bits (113), Expect = 5e-06
Identities = 66/254 (25%), Positives = 94/254 (37%), Gaps = 48/254 (18%)
Query: 1175 GKDLTLQGTEFGSKNAAT---GDVQLTAGGK-VDFQAAQSQSSKQDMTWSASVKAGKSKS 1230
GKD T GTE N GD+ ++ GG +D S+ D + +V +
Sbjct: 236 GKDST--GTEINGNNGKVIQDGDLDVSGGGHGIDITG---DSATVDNKGTMTV----TDP 286
Query: 1231 NSTSENDDGTKTHTNNKGFSA----GAEANVKNTDETSLTHQGGIINSNGAVTIKADGKD 1286
S DG K NN+G S G + D T+ N+NG T+ DGKD
Sbjct: 287 ESMGIQIDGDKAIVNNEGESTITNGGTGTQINGDDATA--------NNNGKTTV--DGKD 336
Query: 1287 QQAIHLQNTN-IISQETILNADNGGIVMESAQDKEHKNNWNVNTNLNGSQKNTIKSDDQG 1345
+ N + Q+ L+ GG ++ D +N T + DQ
Sbjct: 337 STGTEINGNNGKVIQDGDLDVSGGGHGIDITGDSATVDNKGTMTVTDPESIGIQVDGDQA 396
Query: 1346 VVDKDSAKKIHNAAIKLDGGVDKLDSVTQQNTHINSDKVTLNSAKNTELAGAVLQANQIS 1405
VV+ + I N GG T IN D T N+ T + G +I+
Sbjct: 397 VVNNEGESAITN------GGT---------GTQINGDDATANNNGKTTVDGKDSTGTEIA 441
Query: 1406 GQIG-----GDLNV 1414
G G GDL+V
Sbjct: 442 GNNGKVIQDGDLDV 455
There are many gaps and the percent identity is a low 25%. The sequences might be distantly related, but there doesn't appear to be anything in the E. coli proteome that is very similar to this Y. pestis protein. Super! Unique Y. pestis proteins are exactly what we are looking for!
Scaling Up
When it comes time to do the real experiment, there are two additional factors we must address.
- How are we going to be able to identify unique proteins? We don't want to look at all of the files by eye, especially if we do this kind of experiment with many genomes.
- How can we optimize our search to make it faster? We might want to do these kinds of experiments many times.
We'll solve these issues one at a time.
Parsing BLAST Reports
From our brief pilot experiment we can guess that orthologs tend to have high percent identities (> 50 percent) and low Expect values (< 1e-5). We'll use these values for this article, but we would probably come up with more precise conditions given more time. We can set the Expect threshold on the BLAST command line, but percent identity is not a command line option for BLAST. So we must use a parser that can retrieve this information. There are two easy ways to parse BLAST reports in Perl:
- Bioperl
- Tabular BLAST output
Bioperl is a worldwide group of developers who have banded together to make life easier for bioinformatics professionals. Bioperl has a lot more to offer than just BLAST parsers. If you're serious about bioinformatics, learn how to use Bioperl. It will save you a lot of time and effort.
For this particular application, we won't use Bioperl because it's often helpful to see the gory details. NCBI-BLAST optionally outputs data in a convenient tabular format. Tabular data is compact and simple to parse, and there are many situations where a simple table is superior to XML, ASN.1, or some other highly structured format. The following script reads a BLAST report and outputs a FASTA file containing the unique proteins:
#!/usr/bin/perl -w
use strict;
die "usage: table-parser.pl sequences blast-report\n"
unless @ARGV == 2;
my %Sequence;
open(SEQ, $ARGV[0]) or die;
while (<SEQ>) { # Egads, the <> operator becomes invisible!
if (/^>(\S+)/) {$Sequence{$1} = 1}
}
my %Match;
open(BLAST, $ARGV[1]);
while (<blast>) { # Egads, the <> operator becomes invisible!
my ($query, $sbjct, $percent) = split;
if ($percent > 50) {$Match{$query} = 1}
}
foreach my $key (keys %Match) {
delete $Sequence{$key};
}
print join("\n", keys %Sequence), "\n";
Optimizing BLAST Parameters
Since we're looking for sequences that align with high-percent identity,
we can use parameters that are less sensitive and therefore faster than
the defaults. We also don't care about anything other than the best
alignment. In the following command line, -f 999 reduces
sensitivity by turning off neighborhood words; -v 1 limits
reports to the best match; -m 8 sets the output to tabular;
and -e 1e-5 sets the Expect threshold. These options and
more are explained in the BLAST book. Let's time the search for comparison.
% time blastall -p blastp -d E.coli -i yp-pilot -f 999 -v 1 -m 8 -e 1e-5 > blast-opt
2.650u 0.650s 0:03.49 94.5%
The new search takes about one third of the time of the default search. Let's process the sequences and BLAST report and see what we get.
gi|22123956|ref|NP_667379.1|
gi|22123936|ref|NP_667359.1|
gi|22123958|ref|NP_667381.1|
gi|22123955|ref|NP_667378.1|
gi|22123951|ref|NP_667374.1|
gi|22123957|ref|NP_667380.1|
gi|22123950|ref|NP_667373.1|
gi|22123940|ref|NP_667363.1|
gi|22123934|ref|NP_667357.1|
gi|22123943|ref|NP_667366.1|
If you look up these identifiers in the original report you'll find that these Y. pestis proteins do not have matching E. coli proteins. Some of these are labeled as hypothetical and might not be real proteins, but we can't figure that out from the BLAST report itself.
The Big Experiment
Now it's time to do the complete experiment and search all Y. pestis proteins against all E. coli proteins.
% time blastall -p blastp -d E.coli -i Y.pestis -f 999 -v 1 -m 8 -e 1e-5 > big-experiment
298.690u 66.640s 6:16.36 97.0%
The search was completed in about 6 minutes, which is about what we expected (one third of 18 minutes). Now, let's process the report and find out how many unique proteins there are to Y. pestis.
% table-parser.pl Y.pestis big-experiment > yp-unique
% wc yp-unique
1955 1955 56695 yp-unique
Wow, 1,995 sequences. We know that some of these are hypothetical and might not be real. Let's create a custom Perl script to retrieve the non-hypothetical sequences from the Y.pestis BLAST database. (This implementation is not the most efficient, but it demonstrates how to fetch sequences from a BLAST database).
#!/usr/bin/perl
# not-hypothetical.pl
while (<>) {
chomp;
open(FETCH, "fastacmd -d Y.pestis -s '$_' |") or die;
my ($def, @seq) = <fetch>;
close FETCH;
next if $def =~ /hypothetical/i;
print $def, @seq;
}
We use it like this:
not-hypothetical.pl <yp-unique > yp-real
% grep -c ">" yp-real
832
Bioinformatics Hero
We've just identified 832 proteins that are in Y. pestis but not E. coli. And the whole procedure didn't take very long. I don't know about you, but it makes me feel like a bioinformatics hero. We live in a truly amazing age when it's possible to do cutting-edge science with a personal computer from the comfort of our living room. One of the great things about this field is that we don't need much to begin. An old PC can be picked up for next to nothing, and Linux, BLAST, and Perl are free. Bioinformatics is fun, inexpensive, and cool; great for a career or a hobby.
What Next?
From a medical/scientific perspective, the experiment we just performed is the beginning and not the end of the story. What are those unique Y. pestis genes and what do they do? If we compared those genes to the genomes of other non-harmful bacteria, would we be able to prune the list further? If we found genes that are truly unique to Y. pestis, could we develop drugs to specifically target this bacterium without side-effects to the patient or the environment? These kinds of questions are the new frontier. Why not join in? Probably the best place to begin is with the new BLAST book. If you let your curiosity support you through the unfamiliar bits, and temper your enthusiasm with an attention to details, you might be surprised at just how quickly you become a bioinformatics hero.
Ian Korf is a post-doctoral fellow researching comparative genomics and gene prediction at the Wellcome Trust Sanger Institute in the U.K. He is also a coauthor of O'Reilly's recently released BLAST.
O'Reilly & Associates recently released (July 2003) BLAST.
-
Sample Chapter 4, Sequence Similarity, is available free online.
-
You can also look at the Table of Contents, the Index, and the Full Description of the book.
-
For more information, or to order the book, click here.
Return to bio.oreilly.com
You must be logged in to the O'Reilly Network to post a talkback.
Showing messages 1 through 2 of 2.
-
not-hypothetical.pl <yp-unique > yp-real
2003-09-14 12:16:18 julian_cook [Reply | View]
Only problem I had was with not-hypothetical.pl. The script (actually fastacmd) did not like the string "Y.pestisbig-experiment" which was inserted into the first line of yp-unique. If you remove it manually then not-hypothetical.pl runs, but it does not write any output to yp-real. Possibly because it doesn t find any hypothetical entries in yp-unique (I could not find any from manual inspection).
-
Nice Article
2003-09-11 12:01:29 anonymous2 [Reply | View]
I enjoyed your article and although I have not yet actually run it myself I am looking forward to firing up my iBook and trying it out. The article felt a little rushed at the end though and inevitably there were plugs for the BLAST book, which I think I will be getting.
One quick Q...do I need to download BLAST and install it on my Mac...and if so how? At the Terminal cml? Thanks!




