Relational Modeling of Biological Data: Trees and Graphs
by Aaron Mackey, speaker at O'Reilly's upcoming Bioinformatics Technology Conference11/27/2002
Editor's note: Aaron Mackey discovered bioinformatics during the second year of his Ph.D. program in Immunology at Washington University in St. Louis. After taking a class in computational molecular biology, Aaron realized that it was possible to combine an aptitude for mathematics and computer science with non-theoretical biological research, and was hooked. Aaron will be giving a tutorial on Relational Databases for Bioinformatics at O'Reilly's upcoming Bioinformatics Technology Conference.
Sources of Hierarchical Data
|
Related Reading
Beginning Perl for Bioinformatics |
Many biological data sources represent information in hierarchies. Remember the mnemonic "King Philip Came Over From Germany Speedily" you once learned in grade school biology class? It was supposed to help you remember the hierarchical taxonomy classifications: Kingdom Phylum Class Order Family Genus Species. Today, this information is a click away: the NCBI Taxonomy database contains the common and scientific names of every organism that is associated with a biological sequence in Entrez, in addition to the complete hierarchical taxonomic tree of species. At the NCBI Taxonomy Web site you can search for a given species, or browse up and down the taxonomic tree. One can learn, for instance, that the human species has the taxonomic definition "cellular organisms; Eukaryota; Fungi/Metazoa group; Metazoa; Eumetazoa; Bilateria; Coelomata; Deuterostomia; Chordata; Craniata; Vertebrata; Gnathostomata; Teleostomi; Euteleostome; Sarcopterygii; Tetrapoda; Amniota; Mammalia; Theria; Eutheria; Primates; Catarrhini; Hominidae; Homo/Pan/Gorilla Group; Homo; Homo sapiens."
The Structural Classification of Proteins, or SCOP, is another example of biological data organized in a hierarchy. SCOP manually clusters small families of related proteins into larger superfamilies that share common structure, and even larger clusters of classes that share more general fold topology. SCOP, and other databases like it, are commonly used as gold standards to compare the performances of homology-identifying sequence or structural-comparison algorithms such as SSEARCH, PSI-BLAST, HMMER, DALI, or VAST. Though not as deep a hierarchy as the NCBI Taxonomy tree, working with the SCOP dataset requires similar methods for hierarchical modeling.
Finally, the Gene Ontology consortium curates a controlled vocabulary dictionary of known biological processes, molecular functions, and cellular localizations. The GO database contains both general terms like enzyme or cell, and more specific terms like gluthione transferase and vacuole. These lists of terms and definitions are useful alone, but the real gold mine of information from the GO project is the carefully constructed ontology that relates more specific terms to their general counterparts.
While these projects have rich public Web sites for browsing and downloading
the data, none of them provide the means to compute on the data, to
relate your own data to theirs, or to query across the disparate
resources. Fortunately, all of these data sources can be freely obtained by
ftp or anonymous cvs for integration into a local,
custom-made relational database. But how does one represent hierarchical data in
a relational database? And, more importantly, how can one make efficient use of
such data?
seqdb_demo: A Simple Sequence Database
Before we start looking at these issues, we'll briefly introduce a very simple
biosequence database schema, seqdb_demo, meant to store the
contents of the NCBI nonredundant
("nr") flatfile protein database. This database has approximately 1.2
million protein sequences, each of which has one or more descriptions (separated by ctrl-A characters):
>gi|10732787|gb|AAG22538.1| homocysteine S-methyltransferase-2 [Zea mays]
MVVTAAGSAEEAVRRWVDAAGGRLVLDGGLATELEANGADLNDPLWSAKCLLSSPHLIRK
VHMDYLEAGANIIITASYQATIQGFESKGFSKEQSENLLTKSVEIALEAREMFLKEHLEK
CKDGAVLIGGCCRTTPNTIRAIHRTLNKSPNKQQLPAVE
>gi|15241446|ref|NP_196966.1| putative protein; protein id: At5g14620.1
[Arabidopsis thaliana]^A
gi|11281152|pir||T48635 hypothetical protein T15N1.110 - Arabidopsis thaliana^A
gi|7573311|emb|CAB87629.1| putative protein [Arabidopsis thaliana]
MIVISGENVDIAELTDFLCAAQMAREFSEFYTEHEEQKPRHNIKKRRFESKGEPRSSVDD
EPIRLPNPMIGFGVPNEPGLITHRSLPELARGPPFFYYENVALTPKGVWETISRHLFEIP
KYGGFDLVIGGSPCNNLAGGNRVSRVGLEGDQSSLFFEYCRILEVVRARMRGS
The seqdb_demo database has a
seq table to hold each biosequence and an associated descriptions
table, descr that contains all descriptions of the sequence and
its various accession numbers from NCBI (the GI number) and/or its parent
database (SwissProt, EMBL, PDB, and so on):
|
|
Here's an example query that shows the data from the nr flatfile snippet:
SELECT descr.*, SUBSTRING(seq, 1, 20) AS subseq
FROM seq
INNER JOIN descr USING (seq_id)
| descr_id | seq_id | gi | acc | db | descr | subseq |
|---|---|---|---|---|---|---|
| 1 | 1 | 10732787 | AAG22538.1 | gb | homocysteine S-methyltransferase-2 | MVVTAAGSAEEAVRRWVDAA |
| 2 | 2 | 15241446 | NP_196966.1 | ref | putative protein; protein id: At5g14620.1 | MIVISGENVDIAELTDFLCA |
| 3 | 2 | 11281152 | T48635 | pir | hypothetical protein T15N1.110 | MIVISGENVDIAELTDFLCA |
| 4 | 2 | 7573311 | CAB87629.1 | emb | putative protein | MIVISGENVDIAELTDFLCA |
We'll use and expand on this simple schema throughout this article. Our goal will be to see how to integrate other sources of hierarchical data into our simple sequence database, writing efficient SQL queries to make use of that data.
Modeling Trees
We'll start by storing and querying the NCBI Taxonomy database. From the NCBI FTP site, you can download the taxonomy-related data in tabular format and import the tables directly into your relational database. NCBI provides a separate table of the many possible names for each taxon, but we'll simplify the matter and store only the unique scientific name.
Adjacency List Representation
To represent the taxonomic hierarchy, NCBI has chosen a common
representation of tree structures. Using an endogenous foreign key
(parent_id), reference the primary key (taxon_id) of
the parental row:
|
|
This representation, called an adjacency list, is easy to understand. Every
parent-child relationship is explicitly defined. Each taxon's
parent_id points to its parent taxon's taxon_id. To find the parent taxon of humans, we write:
SELECT *
FROM taxon AS parent
INNER JOIN taxon AS child
ON (child.parent_id = parent.taxon_id)
WHERE child.name = 'Homo sapiens'
We could manually repeat this query for each parent taxon we retrieve, up to the root of the tree. We can also find out how many unique species there are in the taxonomy tree by counting how many taxa are "terminal", in other words, they have no children:
SELECT COUNT(*)
FROM taxon AS terminal
LEFT JOIN taxon AS child
ON (child.parent_id = terminal.taxon_id)
WHERE child.parent_id IS NULL
This yields approximately 120,000 unique species. Estimates of the planet's true biodiversity are far greater than this value, so we know that the NCBI Taxonomy database is far from a complete representation of all organisms.
Finally, we can also download and capture the biosequence/species association information, adding a taxon_id foreign key to the descr table that references the taxon.taxon_id primary key. Now we can easily
write an SQL query that fetches all the human sequences present in GenBank:
SELECT descr.gi, descr.descr, seq.seq
FROM seq
INNER JOIN descr USING (seq_id)
INNER JOIN taxon USING (taxon_id)
WHERE taxon.name = "Homo sapiens"
But what if we want to retrieve all Primate sequences, including humans and other ape-like species? Or perhaps we'd like to retrieve all Bacterial sequences but only those that aren't also from the more specific Enterobacteria. With the adjacency list representation, we'd have to "walk" through the tree procedurally (usually recursively) to include or exclude the taxonomies of interest, collecting sequences as we walked along. While potentially prohibitively expensive in both time and memory requirements, Brian Jepson's DBIx::Tree Perl module, or database-native procedural TransactSQL-like languages, provides some facility for these activities, but are neither efficient nor a solution in a pure SQL environment.
Nested-Set Representation
Fortunately, an attractive alternative exists, called the nested-set
representation. SQL for Smarties author Joe Celko has long
endeavored to make people aware of this methodology, and I will carry on his
crusade here. The basic idea is this: instead of explicitly representing
relationships between immediate child-parent pairs with parent_id
foreign keys, we'll use two surrogate, calculated values--left_id
and right_id--appended to our taxon table. Without
showing yet how they're calculated, we'll simply require that the values in
these two fields have the following, useful property: for all parent-child
pairs, the child's left_id and right_id will be
between the parent's left_id and right_id
values. The same will be true of the child's children, and so on all the way
down the hierarchy:
|
|
At first, these extra numbers may not seem very useful. Are they foreign
keys? What field do they reference? The answers are: No and they
don't. These two numbers represent the same hierarchical information as the
taxon_id, parent_id pairs, but implictly,
instead of explicitly. Fundamentally, the values in left_id and
right_id are just numbers, nothing else and could be changed to any
other set of numbers, as long as the between-ness property holds. But
it's exactly this property that provides the utility we need to select entire
subsets of taxonomies in one step. Now to select all Primate sequences we
can use this SQL:
SELECT descr.gi, descr.descr, seq.seq
FROM seq
INNER JOIN descr USING (seq_id)
INNER JOIN taxon USING (taxon_id)
INNER JOIN taxon AS include
ON (taxon.left_id BETWEEN include.left_id AND
include.right_id)
WHERE include.name = 'Primate'
Here, we've joined the taxon table onto itself in a
self-join governed by a BETWEEN clause that enforces the
implicit relationship between taxonomies encoded by the included parental
left_id and right_id values. The WHERE
clause specifies which taxonomy we wish to specifically include in the result
set. We can extend the same approach to select all Bacterial sequences
but now exclude any that are also Enterobacterial in origin:
SELECT descr.gi, descr.descr, seq.seq
FROM seq
INNER JOIN descr USING (seq_id)
INNER JOIN taxon USING (taxon_id)
INNER JOIN taxon AS include
ON (taxon.left_id BETWEEN include.left_id AND
include.right_id)
INNER JOIN taxon AS exclude
ON (taxon.left_id NOT BETWEEN exclude.left_id AND
exclude.right_id)
WHERE include.name = 'Bacteria'
AND exclude.name = 'Enterobacteria'
Pages: 1, 2 |




