Relational Modeling of Biological Data: Trees and Graphs
Pages: 1, 2
Generating left_id and right_id
So where do
these magic left_id and right_id values come from? To
generate these values, we must resort to a procedural language (in this case
Perl), in which we "walk" down the tree, visiting all of a taxon's children
before moving on to a taxon's sibling. (For example, we perform a depth-first
traversal of the tree). Whenever we first visit a taxon, we set its
left_id value before moving on to its children. When we're finished
with all of its children, we then set its right_id. We keep a
continuously increasing counter to use for the values. This process is
diagrammed in Figure 1, and shown in the Perl code in Example 1 below:
#!/usr/bin/perl -w
# depth-first tree traversal to generate left_id, right_id
use strict;
use DBI;
my $dbh = DBI->connect("dbi:mysql:seqdb_demo",
"seqdb_user",
"seqdb_pass")
or die $DBI::errstr;
my $children = $dbh->prepare("SELECT taxon_id
FROM taxon
WHERE parent_id = ?");
my $setleft = $dbh->prepare("UPDATE taxon
SET left_id = ?
WHERE taxon_id = ?");
my $setright = $dbh->prepare("UPDATE taxon
SET right_id = ?
WHERE taxon_id = ?");
my $ctr = 1;
my $rootid = 1;
walktree($rootid);
$dbh->disconnect;
sub walktree {
my $id = shift;
$setleft->execute($ctr++, $id);
$children->execute($id);
while(my ($id) = $children->fetchrow_array) {
walktree($id);
}
$setright->execute($ctr++, $id);
}
|
| Figure 1 |
On a modest database server running MySQL, this process takes about 15 minutes to execute with the NCBI Taxonomy data. Since we only update our local database weekly (by reloading the entire NCBI Taxonomy datasets), we rebuild the entire nested-set representation after each database load. Methods to maintain the values in an incremental update/delete scenario are well known but will not be discussed here. For further information and an excellent treatment of adjacency list, and nested-set representations for tree-like data, Joe Celko's book, SQL For Smarties is highly recommended.
Representing Graphs
To build upon our simple biosequence database, we'll bring in additional information from the Gene Ontology project. Just as we previously asked for all sequences from any Primate, we now want to ask for all sequences that are either enzymes, or are found in mitochondria, or are enzymes found in mitochondria.
The difference between the Taxonomy data and the GO data is that specific GO terms may have multiple general parent terms. For instance, the focal adhesion kinase term has parental kinase, transferase, and signal tranducer terms. Fortunately, no term's parents can also be a child of the term. The hierarchy is one way, or directed. Generally, the GO term ontology is implemented as a specific graph structure called a directed acyclic graph or DAG.
Storing Nodes and Edges in a Graph
Unlike the adjacency-list representation for trees, graphs cannot generally be stored in a single table. The GO terms (or other graph nodes) are stored in one table, and a second table is used to store a list of edges between nodes, with pairs of foreign keys referencing the node table:
|
|
The go_edge table is another form of the adjacency list. To move
up and down the hierarchy requires a procedural language. Unfortunately, there
is no equivalent to the nested-set representation for these more general graph
structures. Instead, we'll build an entirely separate accessory table that
contains one row for every unique pair of nodes that can be connected
transitively; that is, if A is connected to B, and B is connected to C,
then A is transitively connected to C. Our transitive closure tables will
contain three rows corresponding to this relationship, one each for A & B, B
& C, and A & C; see the example below and Figure 2.
|
| |||||||||||||||||||||||||||||||||||||||||||||
|
| Figure 2 |
To fill the go_tc table, we will again require a procedural,
iterative approach. We begin by first copying all the rows from
go_edge into go_tc, setting length to one (for
completeness, we also add all the terms in go, setting length to
zero). We then iteratively add each new pair of terms that are two edges away
from each other, three edges away, four edges away, and so on until no new rows
have been inserted:
#!/usr/bin/perl -w
use strict;
use DBI;
my $dbh = DBI->connect("dbi:mysql:seqdb_demo",
"seqdb_user",
"seqdb_pass")
or die $DBI::errstr;
$dbh->do("INSERT INTO go_tc
SELECT go_acc AS child,
go_acc AS parent,
0 AS length
FROM go");
$dbh->do("INSERT INTO go_tc
SELECT child, parent, 1 AS length
FROM go_edge");
my $select = $dbh->prepare(q{
SELECT DISTINCT tc1.child,
tc2.parent,
tc1.length + 1
FROM go_tc AS tc1
INNER JOIN go_tc AS tc2
ON (tc1.parent = tc2.child)
WHERE tc2.length = 1
AND tc1.length = ?
});
my $insert = $dbh->prepare(q{
REPLACE INTO go_tc (child, parent, length)
VALUES ( ?, ?, ?)
});
my ($oldsize) =
$dbh->selectrow_array("SELECT COUNT(*) FROM go_tc");
my $newsize;
my $len = 1;
while (!$newsize || $oldsize < $newsize) {
$oldsize = $newsize || $oldsize;
$newsize = $oldsize;
$select->execute($len++);
while(my @data = $select->fetchrow_array) {
$insert->execute(@data);
$newsize++;
}
}
Building the go_tc table on our system takes just a few minutes.
As with the nested-set tree data, we won't bother to figure out how to make
incremental updates to go_tc when data changes in
go_edge. We'll simply run the transitive closure script immediately
following any changes to go or go_edge.
The last part of the GO data is the associations between some protein
sequences and GO terms; unlike an organism in the Taxonomy tree, a protein may
have multiple GO terms with which it's associated. Furthermore, the associations
are often made to SwissProt or TREMBL accessions, not NCBI GI numbers, so
instead of keying off of gi, we'll reference the acc
field of the descr table:
SELECT descr.gi, descr.descr, seq.seq
FROM seq
INNER JOIN descr USING (seq_id)
INNER JOIN go_assoc ON (descr.acc = go_assoc.seq_acc)
INNER JOIN go_tc ON (go_assoc.go_acc = go_tc.child)
INNER JOIN go_term ON (go_tc.parent = go_term.go_acc)
WHERE go_term.name = "enzyme"
AND descr.gi = ( SELECT MAX(gi)
FROM descr AS d
WHERE d.seq_id = descr.seq_id )
Unfortunately, because of MySQL's continued lack of support for SQL statements with sub-selects, we cannot use the previous solution directly. Instead, we break the previous statement into two parts: first we create a temporary table full of GI numbers from the sequence we want, and second, we use the temporary table to pull out the rest of the sequence-specific information:
CREATE TEMPORARY TABLE go_gi
SELECT MAX(descr.gi) AS gi
FROM descr
INNER JOIN go_assoc ON (descr.acc = go_assoc.seq_acc)
INNER JOIN go_tc ON (go_assoc.go_acc = go_tc.child)
INNER JOIN go_term ON (go_tc.parent = go_term.go_acc)
WHERE go_term.name = 'enzyme'
GROUP BY descr.seq_id
Then, to get at the corresponding sequence data:
SELECT descr.gi, descr.descr, seq.seq
FROM seq
INNER JOIN descr USING (seq_id)
INNER JOIN go_gi USING (gi)
The approach we've shown here can be extended to include any number of boolean conditions, such as "enzyme AND mitochondrion;" the trick will be to find sequences that have an accession (or accessions) associated with both terms.
CREATE TEMPORARY TABLE go_gi
SELECT MAX(descr.gi) AS gi,
COUNT(DISTINCT go_term.name) AS count
FROM descr
INNER JOIN go_assoc ON (descr.acc = go_assoc.seq_acc)
INNER JOIN go_tc ON (go_assoc.go_acc = go_tc.child)
INNER JOIN go_term ON (go_tc.parent = go_term.go_acc)
WHERE ( go_term.name = "enzyme"
OR go_term.name = "mitochondrion"
)
GROUP BY descr.seq_id
HAVING count = 2
We've done a few things differently here. For each unique sequence,
we're collecting the count of how many different requested GO terms it
satisfies. To enforce the boolean logic for "enzyme AND mitochondrion," we use
an accessory value, count, and require (via the HAVING count
= 2 clause) that the rows returned have the correct number of GO terms
associated with them. To get "enzyme OR mitochondrion," simply remove the
HAVING clause, and the associated count field.
To implement exclusions, such as "enzyme AND mitochondrion NOT mitochondrial
matrix," the usual mechanism is to extend the WHERE clause with a NOT
EXISTS sub-select:
WHERE ( go_term.name = "enzyme"
OR go_term.name = "mitochondrion"
) AND NOT EXISTS (
SELECT seq_id
FROM descr AS d
INNER JOIN go_assoc ON (d.acc = go_assoc.seq_acc)
INNER JOIN go_tc ON (go_assoc.go_acc = go_tc.child)
INNER JOIN go_term ON (go_tc.parent = go_term.go_acc)
WHERE go_term.name = "mitochondrial matrix"
AND descr.seq_id = d.seq_id
)
But again, since MySQL doesn't support subselect statements, we
have to resort to placing the results of the subselect statement into a
temporary table, go_gi_exclude, and altering our normal
SELECT statement to implement a LEFT JOIN:
CREATE TEMPORARY TABLE go_gi
SELECT MAX(descr.gi) AS gi,
COUNT(DISTINCT go_term.name) AS count
FROM descr
INNER JOIN go_assoc ON (descr.acc = go_assoc.seq_acc)
INNER JOIN go_tc ON (go_assoc.go_acc = go_tc.child)
INNER JOIN go_term ON (go_tc.parent = go_term.go_acc)
LEFT JOIN go_gi_exclude
ON (descr.seq_id = go_gi_exclude.seq_id)
WHERE ( go_term.name = "enzyme"
OR go_term.name = "mitochondrion"
) AND go_gi_exclude.seq_id IS NULL
GROUP BY descr.seq_id
HAVING count = 2
Between the inclusionary HAVING count = n mechanism
and the exclusionary NOT EXISTS (or LEFT JOIN
alternative) mechanism, all manners of boolean queries can be constructed.
It's All in the Way You Look at Things
The lesson here is that often the obvious way to model data (such as adjacency lists of tree or graph edges) is not the most efficient data model with which to query or compute. Alternatives such as the nested-set implicit representation of a tree-like hierarchy, or the transitive closure of a graph structure costs very little to create, and provides one-step mechanisms to extract informative subsets of the data. Commercial relational database products that advertise native support for hierarchical data are often simply transparently building and managing these same representations. Now you know how to it yourself.
In the next installment, we'll look at some alternatives to modeling temporal or historical biological data.
Aaron Mackey is involved with the BioPerl and GMOD software projects, and distributes his own simple biosequence relational database tool package, seqdb, for educational purposes.
Showing messages 1 through 6 of 6.
-
PostBIS (Biospatial Information Systems)
2004-05-25 16:44:02 postgresinc [View]
-
Comment
2003-08-13 05:02:07 anonymous2 [View]
O'Reilly,
Excellent article. Perhaps not so interesting to those with CS backgrounds but great for those of us who come to bionformmatics from biology. More Aaron Mackey please!
BIO
-
fetch() without execute() error in Example 1
2003-06-14 16:24:32 gee [View]
I've been looking most of the day for a script like the one in Example 1 (converting adjacency list to nested set) to fix a mysql table that I fubared this morning.
Fortunately, I had the foresight to track parent id's as well as record left and right id's thinking I might need to repair a mistake sometime. Unfortunately, I can't get the script in Example 1 to run correctly.
The error I get is:
DBD::mysql::st fetchrow_array failed: fetch() without execute() at ./fixnest line 33.
I have determined error occurs once the script begins to process a leaf node -- one that does not have any children. After that, I confess I am stumped, this being my first exposure to the DBI module.
Has anyone else encountered this problem? If so, is there a workaround to suggest? At the very least, is there a confirmed case of the script above working (perhaps there is a line left out in the published version)?
Thanks in advance for any help -- feel free to write direct to geestarr at the domain geedev dot com. I'll summarize any solutions here.
-
Tree graphic
2003-03-31 09:24:35 taeric [View]
Maybe I am simply misunderstanding something, but I figured I would go ahead and ask.
In the graphic for the tree, it seems to be saying that node 3 is a child of node 4. Further, it seems to say that every node other then the root is a child of node 2. Clearly, I am misunderstanding something here.
Could someone please explain what it is I'm missing?
Thanks,
-josh
-
Standard issue
2002-12-07 14:11:13 rlucente [View]
The question of storing hierarchical data in a relationship model has been around for a long time. Joe Celko has a book out which addresses this topic extensively. David Rozenshtein supposedly published an entire book on this subject. I can't seem to find it anywhere. Leads on this book would be greatly appreciated. O'Reilly actually publishes a book Mastering Oracle SQL which also contains an extensive discussion on this topic.
-
Data Warehousing of Forest Genetics Data
2002-12-03 10:17:41 anonymous2 [View]
If anyone can direct me to info sources for data warehousing and forest genetics (or in the ballpark) I'd be very grateful.
Thanks,
Bob Berger








http://postgresinc.com ; http://bioinformatics-research.com