O'Reilly Network    
 Published on O'Reilly Network (http://www.oreillynet.com/)
 See this if you're having trouble printing code examples


Relational Modeling of Biological Data: Trees and Graphs

by Aaron Mackey, speaker at O'Reilly's upcoming Bioinformatics Technology Conference
11/27/2002

Editor's note: Aaron Mackey discovered bioinformatics during the second year of his Ph.D. program in Immunology at Washington University in St. Louis. After taking a class in computational molecular biology, Aaron realized that it was possible to combine an aptitude for mathematics and computer science with non-theoretical biological research, and was hooked. Aaron will be giving a tutorial on Relational Databases for Bioinformatics at O'Reilly's upcoming Bioinformatics Technology Conference.

Sources of Hierarchical Data

Related Reading

Beginning Perl for Bioinformatics
By James Tisdall

Many biological data sources represent information in hierarchies. Remember the mnemonic "King Philip Came Over From Germany Speedily" you once learned in grade school biology class? It was supposed to help you remember the hierarchical taxonomy classifications: Kingdom Phylum Class Order Family Genus Species. Today, this information is a click away: the NCBI Taxonomy database contains the common and scientific names of every organism that is associated with a biological sequence in Entrez, in addition to the complete hierarchical taxonomic tree of species. At the NCBI Taxonomy Web site you can search for a given species, or browse up and down the taxonomic tree. One can learn, for instance, that the human species has the taxonomic definition "cellular organisms; Eukaryota; Fungi/Metazoa group; Metazoa; Eumetazoa; Bilateria; Coelomata; Deuterostomia; Chordata; Craniata; Vertebrata; Gnathostomata; Teleostomi; Euteleostome; Sarcopterygii; Tetrapoda; Amniota; Mammalia; Theria; Eutheria; Primates; Catarrhini; Hominidae; Homo/Pan/Gorilla Group; Homo; Homo sapiens."

The Structural Classification of Proteins, or SCOP, is another example of biological data organized in a hierarchy. SCOP manually clusters small families of related proteins into larger superfamilies that share common structure, and even larger clusters of classes that share more general fold topology. SCOP, and other databases like it, are commonly used as gold standards to compare the performances of homology-identifying sequence or structural-comparison algorithms such as SSEARCH, PSI-BLAST, HMMER, DALI, or VAST. Though not as deep a hierarchy as the NCBI Taxonomy tree, working with the SCOP dataset requires similar methods for hierarchical modeling.

Finally, the Gene Ontology consortium curates a controlled vocabulary dictionary of known biological processes, molecular functions, and cellular localizations. The GO database contains both general terms like enzyme or cell, and more specific terms like gluthione transferase and vacuole. These lists of terms and definitions are useful alone, but the real gold mine of information from the GO project is the carefully constructed ontology that relates more specific terms to their general counterparts.

While these projects have rich public Web sites for browsing and downloading the data, none of them provide the means to compute on the data, to relate your own data to theirs, or to query across the disparate resources. Fortunately, all of these data sources can be freely obtained by ftp or anonymous cvs for integration into a local, custom-made relational database. But how does one represent hierarchical data in a relational database? And, more importantly, how can one make efficient use of such data?

seqdb_demo: A Simple Sequence Database

Before we start looking at these issues, we'll briefly introduce a very simple biosequence database schema, seqdb_demo, meant to store the contents of the NCBI nonredundant ("nr") flatfile protein database. This database has approximately 1.2 million protein sequences, each of which has one or more descriptions (separated by ctrl-A characters):


>gi|10732787|gb|AAG22538.1| homocysteine S-methyltransferase-2 [Zea mays]
MVVTAAGSAEEAVRRWVDAAGGRLVLDGGLATELEANGADLNDPLWSAKCLLSSPHLIRK
VHMDYLEAGANIIITASYQATIQGFESKGFSKEQSENLLTKSVEIALEAREMFLKEHLEK
CKDGAVLIGGCCRTTPNTIRAIHRTLNKSPNKQQLPAVE

>gi|15241446|ref|NP_196966.1| putative protein; protein id: At5g14620.1 
    [Arabidopsis thaliana]^A
gi|11281152|pir||T48635 hypothetical protein T15N1.110 - Arabidopsis thaliana^A
gi|7573311|emb|CAB87629.1| putative protein [Arabidopsis thaliana]
MIVISGENVDIAELTDFLCAAQMAREFSEFYTEHEEQKPRHNIKKRRFESKGEPRSSVDD
EPIRLPNPMIGFGVPNEPGLITHRSLPELARGPPFFYYENVALTPKGVWETISRHLFEIP
KYGGFDLVIGGSPCNNLAGGNRVSRVGLEGDQSSLFFEYCRILEVVRARMRGS

The seqdb_demo database has a seq table to hold each biosequence and an associated descriptions table, descr that contains all descriptions of the sequence and its various accession numbers from NCBI (the GI number) and/or its parent database (SwissProt, EMBL, PDB, and so on):

Definition: seq
  field datatype
PK seq_id INT
  seq TEXT
  len INT
Definition: descr
  field datatype
PK descr_id INT
FK seq_id INT REFERENCES (seq.seq_id)
  gi INT
  acc VARCHAR(20)
  db VARCHAR(3)
  descr TEXT

Here's an example query that shows the data from the nr flatfile snippet:


SELECT descr.*, SUBSTRING(seq, 1, 20) AS subseq
FROM   seq
       INNER JOIN descr USING (seq_id)
Query results:
descr_idseq_idgiaccdbdescrsubseq
1110732787AAG22538.1gbhomocysteine S-methyltransferase-2MVVTAAGSAEEAVRRWVDAA
2215241446NP_196966.1refputative protein; protein id: At5g14620.1MIVISGENVDIAELTDFLCA
3211281152T48635pirhypothetical protein T15N1.110MIVISGENVDIAELTDFLCA
427573311CAB87629.1embputative proteinMIVISGENVDIAELTDFLCA

We'll use and expand on this simple schema throughout this article. Our goal will be to see how to integrate other sources of hierarchical data into our simple sequence database, writing efficient SQL queries to make use of that data.

Modeling Trees

We'll start by storing and querying the NCBI Taxonomy database. From the NCBI FTP site, you can download the taxonomy-related data in tabular format and import the tables directly into your relational database. NCBI provides a separate table of the many possible names for each taxon, but we'll simplify the matter and store only the unique scientific name.

Adjacency List Representation

To represent the taxonomic hierarchy, NCBI has chosen a common representation of tree structures. Using an endogenous foreign key (parent_id), reference the primary key (taxon_id) of the parental row:

NCBI Taxonomy table taxon
Definition: taxon
  field datatype
PK taxon_id INT
FK parent_id INT REFERENCES (taxon.taxon_id)
  name TEXT
Example data:
taxon_id name parent_id
1 root NULL
2 Bacteria 1
2157 Archaea 1
2759 Eukaryota 1
1224 Proteobacteria 2
... ... ...
543 Enterobacteriaceae 91347
561> Escherichia 543
562 Escherichia coli 561
83333 Escherichia coli K12 562

This representation, called an adjacency list, is easy to understand. Every parent-child relationship is explicitly defined. Each taxon's parent_id points to its parent taxon's taxon_id. To find the parent taxon of humans, we write:

O'Reilly Bioinformatics Conference.

SELECT *
FROM   taxon AS parent
       INNER JOIN taxon AS child
         ON (child.parent_id = parent.taxon_id)
WHERE  child.name = 'Homo sapiens'

We could manually repeat this query for each parent taxon we retrieve, up to the root of the tree. We can also find out how many unique species there are in the taxonomy tree by counting how many taxa are "terminal", in other words, they have no children:


SELECT COUNT(*)
FROM   taxon AS terminal
       LEFT JOIN taxon AS child
         ON (child.parent_id = terminal.taxon_id)
WHERE  child.parent_id IS NULL

This yields approximately 120,000 unique species. Estimates of the planet's true biodiversity are far greater than this value, so we know that the NCBI Taxonomy database is far from a complete representation of all organisms.

Finally, we can also download and capture the biosequence/species association information, adding a taxon_id foreign key to the descr table that references the taxon.taxon_id primary key. Now we can easily write an SQL query that fetches all the human sequences present in GenBank:


SELECT descr.gi, descr.descr, seq.seq
FROM   seq
       INNER JOIN descr USING (seq_id)
       INNER JOIN taxon USING (taxon_id)
WHERE  taxon.name = "Homo sapiens"

But what if we want to retrieve all Primate sequences, including humans and other ape-like species? Or perhaps we'd like to retrieve all Bacterial sequences but only those that aren't also from the more specific Enterobacteria. With the adjacency list representation, we'd have to "walk" through the tree procedurally (usually recursively) to include or exclude the taxonomies of interest, collecting sequences as we walked along. While potentially prohibitively expensive in both time and memory requirements, Brian Jepson's DBIx::Tree Perl module, or database-native procedural TransactSQL-like languages, provides some facility for these activities, but are neither efficient nor a solution in a pure SQL environment.

Nested-Set Representation

Fortunately, an attractive alternative exists, called the nested-set representation. SQL for Smarties author Joe Celko has long endeavored to make people aware of this methodology, and I will carry on his crusade here. The basic idea is this: instead of explicitly representing relationships between immediate child-parent pairs with parent_id foreign keys, we'll use two surrogate, calculated values--left_id and right_id--appended to our taxon table. Without showing yet how they're calculated, we'll simply require that the values in these two fields have the following, useful property: for all parent-child pairs, the child's left_id and right_id will be between the parent's left_id and right_id values. The same will be true of the child's children, and so on all the way down the hierarchy:

Definition: taxon
  field datatype
PK taxon_id INT
FK parent_id INT REFERENCES (taxon.taxon_id)
  left_id INT
  right_id INT
  name TEXT
Example data:
taxon_id name parent_id left_id right_id
1 root NULL 1 323458
2 Bacteria 1 21703 87862
2157 Archaea 1 87863 92266
2759 Eukaryota 1 92267 323456
1224 Proteobacteria 2 23982 49591
... ... ... ... ...
543 Enterobacteriaceae 91347 26681 27938
561> Escherichia 543 26852 26891
562 Escherichia coli 561 26853 26868
83333 Escherichia coli K12 562 26856 26857

At first, these extra numbers may not seem very useful. Are they foreign keys? What field do they reference? The answers are: No and they don't. These two numbers represent the same hierarchical information as the taxon_id, parent_id pairs, but implictly, instead of explicitly. Fundamentally, the values in left_id and right_id are just numbers, nothing else and could be changed to any other set of numbers, as long as the between-ness property holds. But it's exactly this property that provides the utility we need to select entire subsets of taxonomies in one step. Now to select all Primate sequences we can use this SQL:


SELECT descr.gi, descr.descr, seq.seq
FROM   seq
       INNER JOIN descr USING (seq_id)
       INNER JOIN taxon USING (taxon_id)

       INNER JOIN taxon AS include
         ON (taxon.left_id BETWEEN include.left_id AND
                                   include.right_id)

WHERE  include.name = 'Primate'

Here, we've joined the taxon table onto itself in a self-join governed by a BETWEEN clause that enforces the implicit relationship between taxonomies encoded by the included parental left_id and right_id values. The WHERE clause specifies which taxonomy we wish to specifically include in the result set. We can extend the same approach to select all Bacterial sequences but now exclude any that are also Enterobacterial in origin:


SELECT descr.gi, descr.descr, seq.seq
FROM   seq
       INNER JOIN descr USING (seq_id)
       INNER JOIN taxon USING (taxon_id)
       INNER JOIN taxon AS include
         ON (taxon.left_id BETWEEN include.left_id AND
                                   include.right_id)

       INNER JOIN taxon AS exclude
         ON (taxon.left_id NOT BETWEEN exclude.left_id AND
                                       exclude.right_id)

WHERE  include.name = 'Bacteria'
  AND  exclude.name = 'Enterobacteria'

Generating left_id and right_id

So where do these magic left_id and right_id values come from? To generate these values, we must resort to a procedural language (in this case Perl), in which we "walk" down the tree, visiting all of a taxon's children before moving on to a taxon's sibling. (For example, we perform a depth-first traversal of the tree). Whenever we first visit a taxon, we set its left_id value before moving on to its children. When we're finished with all of its children, we then set its right_id. We keep a continuously increasing counter to use for the values. This process is diagrammed in Figure 1, and shown in the Perl code in Example 1 below:


#!/usr/bin/perl -w

# depth-first tree traversal to generate left_id, right_id

use strict;
use DBI;

my $dbh = DBI->connect("dbi:mysql:seqdb_demo",
                       "seqdb_user",
                       "seqdb_pass")
  or die $DBI::errstr;

my $children = $dbh->prepare("SELECT taxon_id
                              FROM taxon
                              WHERE parent_id = ?");
my $setleft  = $dbh->prepare("UPDATE taxon
                              SET left_id = ?
                              WHERE taxon_id = ?");
my $setright = $dbh->prepare("UPDATE taxon
                              SET right_id = ?
                              WHERE taxon_id = ?");

my $ctr = 1;
my $rootid = 1;

walktree($rootid);
$dbh->disconnect;

sub walktree {
  my $id = shift;
  $setleft->execute($ctr++, $id);
  $children->execute($id);
  while(my ($id) = $children->fetchrow_array) {
    walktree($id);
  }
  $setright->execute($ctr++, $id);
}
Figure 1
Figure 1

On a modest database server running MySQL, this process takes about 15 minutes to execute with the NCBI Taxonomy data. Since we only update our local database weekly (by reloading the entire NCBI Taxonomy datasets), we rebuild the entire nested-set representation after each database load. Methods to maintain the values in an incremental update/delete scenario are well known but will not be discussed here. For further information and an excellent treatment of adjacency list, and nested-set representations for tree-like data, Joe Celko's book, SQL For Smarties is highly recommended.


O'Reilly Bioinformatics Conference.

Representing Graphs

To build upon our simple biosequence database, we'll bring in additional information from the Gene Ontology project. Just as we previously asked for all sequences from any Primate, we now want to ask for all sequences that are either enzymes, or are found in mitochondria, or are enzymes found in mitochondria.

The difference between the Taxonomy data and the GO data is that specific GO terms may have multiple general parent terms. For instance, the focal adhesion kinase term has parental kinase, transferase, and signal tranducer terms. Fortunately, no term's parents can also be a child of the term. The hierarchy is one way, or directed. Generally, the GO term ontology is implemented as a specific graph structure called a directed acyclic graph or DAG.

Storing Nodes and Edges in a Graph

Unlike the adjacency-list representation for trees, graphs cannot generally be stored in a single table. The GO terms (or other graph nodes) are stored in one table, and a second table is used to store a list of edges between nodes, with pairs of foreign keys referencing the node table:

Definition: go_term
  field datatype
PK go_acc CHAR(10)
  name TEXT
  class CHAR(1)
Definition: go_edge
  field datatype
PK go_edge_id INT
FK child CHAR(10) REFERENCES (go_term.go_acc)
FK parent CHAR(10) REFERENCES (go_term.go_acc)

The go_edge table is another form of the adjacency list. To move up and down the hierarchy requires a procedural language. Unfortunately, there is no equivalent to the nested-set representation for these more general graph structures. Instead, we'll build an entirely separate accessory table that contains one row for every unique pair of nodes that can be connected transitively; that is, if A is connected to B, and B is connected to C, then A is transitively connected to C. Our transitive closure tables will contain three rows corresponding to this relationship, one each for A & B, B & C, and A & C; see the example below and Figure 2.

GO transitive closure table go_tc
Definition: go_tc
  field datatype
FK child CHAR(10) REFERENCES (go_term.go_acc)
FK parent CHAR(10) REFERENCES (go_term.go_acc)
  length INT
PRIMARY KEY (child, parent)
Example data:
child parent length
A B 1
B C 1
C D 1
C E 1
A C 2
A D 3
A E 3
B D 2
B E 2
Figure 2
Figure 2

To fill the go_tc table, we will again require a procedural, iterative approach. We begin by first copying all the rows from go_edge into go_tc, setting length to one (for completeness, we also add all the terms in go, setting length to zero). We then iteratively add each new pair of terms that are two edges away from each other, three edges away, four edges away, and so on until no new rows have been inserted:


#!/usr/bin/perl -w

use strict;
use DBI;

my $dbh = DBI->connect("dbi:mysql:seqdb_demo",
                       "seqdb_user",
                       "seqdb_pass")
  or die $DBI::errstr;

$dbh->do("INSERT INTO go_tc
          SELECT go_acc AS child,
                 go_acc AS parent,
                 0 AS length
          FROM   go");

$dbh->do("INSERT INTO go_tc
          SELECT child, parent, 1 AS length
          FROM   go_edge");

my $select = $dbh->prepare(q{
  SELECT DISTINCT tc1.child,
                  tc2.parent,
                  tc1.length + 1
  FROM   go_tc AS tc1
         INNER JOIN go_tc AS tc2
           ON (tc1.parent = tc2.child)
  WHERE  tc2.length = 1
    AND  tc1.length = ?
});

my $insert = $dbh->prepare(q{
  REPLACE INTO go_tc (child, parent, length)
              VALUES (    ?,      ?,      ?)
});

my ($oldsize) =
  $dbh->selectrow_array("SELECT COUNT(*) FROM go_tc");
my $newsize;

my $len = 1;
while (!$newsize || $oldsize < $newsize) {
  $oldsize = $newsize || $oldsize;
  $newsize = $oldsize;
  $select->execute($len++);
  while(my @data = $select->fetchrow_array) {
    $insert->execute(@data);
    $newsize++;
  }
}

Building the go_tc table on our system takes just a few minutes. As with the nested-set tree data, we won't bother to figure out how to make incremental updates to go_tc when data changes in go_edge. We'll simply run the transitive closure script immediately following any changes to go or go_edge.

The last part of the GO data is the associations between some protein sequences and GO terms; unlike an organism in the Taxonomy tree, a protein may have multiple GO terms with which it's associated. Furthermore, the associations are often made to SwissProt or TREMBL accessions, not NCBI GI numbers, so instead of keying off of gi, we'll reference the acc field of the descr table:


SELECT descr.gi, descr.descr, seq.seq
FROM   seq
       INNER JOIN descr USING (seq_id)
       INNER JOIN go_assoc ON (descr.acc = go_assoc.seq_acc)
       INNER JOIN go_tc ON (go_assoc.go_acc = go_tc.child)
       INNER JOIN go_term ON (go_tc.parent = go_term.go_acc)
WHERE  go_term.name = "enzyme"
  AND  descr.gi = ( SELECT MAX(gi)
                    FROM descr AS d
                    WHERE d.seq_id = descr.seq_id )

Unfortunately, because of MySQL's continued lack of support for SQL statements with sub-selects, we cannot use the previous solution directly. Instead, we break the previous statement into two parts: first we create a temporary table full of GI numbers from the sequence we want, and second, we use the temporary table to pull out the rest of the sequence-specific information:


CREATE TEMPORARY TABLE go_gi
SELECT MAX(descr.gi) AS gi
FROM   descr
       INNER JOIN go_assoc ON (descr.acc = go_assoc.seq_acc)
       INNER JOIN go_tc ON (go_assoc.go_acc = go_tc.child)
       INNER JOIN go_term ON (go_tc.parent = go_term.go_acc)
WHERE  go_term.name = 'enzyme'
GROUP BY descr.seq_id

Then, to get at the corresponding sequence data:


SELECT descr.gi, descr.descr, seq.seq
FROM   seq
       INNER JOIN descr USING (seq_id)
       INNER JOIN go_gi USING (gi)

The approach we've shown here can be extended to include any number of boolean conditions, such as "enzyme AND mitochondrion;" the trick will be to find sequences that have an accession (or accessions) associated with both terms.


CREATE TEMPORARY TABLE go_gi
SELECT MAX(descr.gi) AS gi,
       COUNT(DISTINCT go_term.name) AS count
FROM   descr
       INNER JOIN go_assoc ON (descr.acc = go_assoc.seq_acc)
       INNER JOIN go_tc ON (go_assoc.go_acc = go_tc.child)
       INNER JOIN go_term ON (go_tc.parent = go_term.go_acc)
WHERE  ( go_term.name = "enzyme"
         OR  go_term.name = "mitochondrion"
       )
GROUP BY descr.seq_id
HAVING count = 2

We've done a few things differently here. For each unique sequence, we're collecting the count of how many different requested GO terms it satisfies. To enforce the boolean logic for "enzyme AND mitochondrion," we use an accessory value, count, and require (via the HAVING count = 2 clause) that the rows returned have the correct number of GO terms associated with them. To get "enzyme OR mitochondrion," simply remove the HAVING clause, and the associated count field.

To implement exclusions, such as "enzyme AND mitochondrion NOT mitochondrial matrix," the usual mechanism is to extend the WHERE clause with a NOT EXISTS sub-select:


WHERE ( go_term.name = "enzyme"
        OR go_term.name = "mitochondrion"
      ) AND NOT EXISTS (

SELECT seq_id
FROM   descr AS d
       INNER JOIN go_assoc ON (d.acc = go_assoc.seq_acc)
       INNER JOIN go_tc ON (go_assoc.go_acc = go_tc.child)
       INNER JOIN go_term ON (go_tc.parent = go_term.go_acc)
WHERE  go_term.name = "mitochondrial matrix"
  AND  descr.seq_id = d.seq_id

      )

But again, since MySQL doesn't support subselect statements, we have to resort to placing the results of the subselect statement into a temporary table, go_gi_exclude, and altering our normal SELECT statement to implement a LEFT JOIN:


CREATE TEMPORARY TABLE go_gi
SELECT MAX(descr.gi) AS gi,
       COUNT(DISTINCT go_term.name) AS count
FROM   descr
       INNER JOIN go_assoc ON (descr.acc = go_assoc.seq_acc)
       INNER JOIN go_tc ON (go_assoc.go_acc = go_tc.child)
       INNER JOIN go_term ON (go_tc.parent = go_term.go_acc)
       LEFT JOIN go_gi_exclude
         ON (descr.seq_id = go_gi_exclude.seq_id)
WHERE  ( go_term.name = "enzyme"
         OR  go_term.name = "mitochondrion"
       ) AND go_gi_exclude.seq_id IS NULL
GROUP BY descr.seq_id
HAVING count = 2

Between the inclusionary HAVING count = n mechanism and the exclusionary NOT EXISTS (or LEFT JOIN alternative) mechanism, all manners of boolean queries can be constructed.

It's All in the Way You Look at Things

The lesson here is that often the obvious way to model data (such as adjacency lists of tree or graph edges) is not the most efficient data model with which to query or compute. Alternatives such as the nested-set implicit representation of a tree-like hierarchy, or the transitive closure of a graph structure costs very little to create, and provides one-step mechanisms to extract informative subsets of the data. Commercial relational database products that advertise native support for hierarchical data are often simply transparently building and managing these same representations. Now you know how to it yourself.

In the next installment, we'll look at some alternatives to modeling temporal or historical biological data.

Aaron Mackey is involved with the BioPerl and GMOD software projects, and distributes his own simple biosequence relational database tool package, seqdb, for educational purposes.

Copyright © 2009 O'Reilly Media, Inc.