Ewan Birney will be a keynote speaker at O'Reilly's upcoming Bioinformatics Technology conference. Ewan is a biochemist by training and a programmer by practice, and he is currently a Team Leader for Genomic Annotation at the European Bioinformatics Institute (EBI).
Ewan received his B.A. in biochemistry from Balliol College, Oxford, and his Ph.D. from the Sanger Centre in association with Cambridge University. He is known as the "chief cheerleader" for open source bioinformatics, and he has provided code and leadership to the Bioperl and other Bio* projects. Ewan is also one of the leaders of Ensembl, a joint project between the EBI and the Sanger Centre, that delivers a completely free view of the human genome to the world.
We spoke to Ewan about some of the many projects he is involved in and the future of open source bioinformatics.
Stewart: When did you first get interested in bioinformatics?
Birney: I was working at Cold Spring Harbor Laboratory, one of the pioneering labs in molecular biology work since the 1970s, and I had a problem that I couldn't get the computer people interested in. A friend of mine who had gotten into bioinformatics bought me Kerrigan and Ritchie's The C Programming Language, and I never looked back.
Some of my first programs were written on a rather cranky Unix system and produced PostScript, which as I later learned, is a rather cranky language. The only thing I could do to test a program was print it out; either it printed (and worked) or nothing happened. It was a rather extreme debugging experience.
My real breakthrough into bioinformatics happened the next summer, when I was doing summer jobs at CSHL while studying at Oxford for my undergraduate degree. The EST database had just come out--this was the first really large-scale, automated DNA sequencing to happen. I wanted to find my proteins of interest, RNA binding proteins, inside of it. Sadly no one had written the tool I wanted, an on-the-fly translation profile search. So I settled down with a "Methods in Enzymology" chapter that was written by Bill Pearson and taught myself dynamic programming, which is the bedrock of most DNA sequence analysis.
This was tremendously successful. Yes, I found RNA binding proteins in the EST database, but more importantly I realized that I could cope with frameshift errors in the DNA inside the program. This is standard stuff these days, but back in 1993 it was pretty radical, and unbelievably exciting for a 20-year-old undergraduate who didn't know better.
That program, originally named "PairWise" eventually became "GeneWise," which is the algorithm I am most associated with and which really started my career in this field.
Stewart: What do you think are the most interesting tools and techniques being developed in bioinformatics today?
Birney: It is so tough to choose! Bioinformatics stretches from what looks on the outside as utterly mundane "database design" issues (they are never mundane) to cutting-edge algorithms. What I love about working at EBI is that nearly all of this is represented. We do stuff from "How do you sensibly store 2 terabytes of data for random access" to "Now we've got 200 expression analysis results; what on earth do we do with them?"
I am deeply impressed by Alvis Brazma's group next door to mine, which works on expression--they manage the ArrayExpress database and, more importantly, figure out how to use it. In terms of capturing information, I think work like Mindori Harris's on GO (Gene Ontology project) is incredible--I feel like I'm watching some old cartoon where strange, amorphous biology concepts are being squeezed down single, pithy descriptions, each "bar-coded" with a GO accession number.
Closer to home, Michele Clamp and Val Curwen, working over the road from me at the Sanger Centre take my algorithms (which usually take, say, an odd decade of computer time to run) and figure out how to actually run them in sensible time across the human genome. That impresses me no end.
Stewart: What can be done to improve the current tools?
Birney: All sorts. Like any dispersed software development group we don't do enough code reuse. There is too much reinvention of basic stuff, despite the best efforts of projects like Bioperl and BioJava.
More fundamentally, we are starting in the big bioinformatics set-ups to really push the limits of hardware and algorithms. A consultant came around and looked at our system and was shocked at the level we push our machines--when everything is up and running we cream through a year's worth of CPU cycles each day, and over that time track about 2 million unique processes. The consultant was gobsmacked. Each time we take it to the next level we discover something--the hardware, the network, the OS, the algorithm--scaling not quite right.
Stewart: How much do you think the Internet has had to do with the genome progress?
Birney: An immense amount. Everyday EBI (Europe), NCBI (U.S.) and DDBJ (Japan) synchronize the world DNA database via the Internet. The democratization of the Internet is ideal for getting this data out to the 100,000 individual labs that actually do the wet work. We're not like astronomy where everyone knows each other; we have to have the Web to get our message out and to deliver the actual data.
It is a two-way street--a lot of the Web development was actually fostered by bioinformatics. Lincoln Stein wrote CGI.pm, probably the most common way of interfacing to the Web, because of his needs in bioinformatics. The needs of molecular biologists was probably one of the reasons why many campuses have upgraded their Internet connectivity over the last decade.
Interestingly, the Internet is now not scaling for us . . . Dat tape by DHL has a higher bandwidth than fiber when you are shifting terabytes.
Interestingly, the Internet is now not scaling for us. The NCBI and Hinxton Campus (where EBI and Sanger Centre are located) swap trace files (raw DNA data dumps) across the Atlantic: sadly the Internet just isn't up to the job, and we have to use DAT tape. Tape by DHL has a higher bandwidth than fiber when you are shifting terabytes.
Stewart: What kind of coordination is there between researchers at EBI and NCBI?
Birney: There is a real deep coordination between NCBI and EBI. For the DNA database, which is the main archival dataset shared worldwide, there is nightly synchronization of the data (this also includes DDBJ in Japan). In my area we work very closely with NCBI on providing resources to track the "finishing" of the human genome (finishing here is used in a technical way as well as its common meaning of completing).
In other areas we duke it out as to who is best, but in a pretty good-natured way (if anyone is interested, of course, EBI is the best...).
Stewart: How do you see the relationship between open source, open science, and entrepreneurial enterprise?
Birney: As in some ways, as it always has been. Science should be open--it provides the infrastructure for research and discoveries--but as things head towards something that "does" something, you often need to protect it to actually make extreme investment pay out. So, I think you keep DNA sequence completely open (the infrastructure) but you have to patent diagnostic-level discoveries (say, the precise role of a protein in a particular disease) to allow real, useful products to be developed.
I see software in the same light. The infrastructure, and in particular the libraries, should be open--we can share and reuse without any barriers. However, when people start to complain about the color scheme and expect people to fix it, then you start charging. In software I think people should really focus on paying for the time and effort to have a developer on the project to get something done, not really for the software itself. In that sense I am a real open source believer, but more Eric Raymond than Richard Stallman. I GPL my algorithms, but have licensed the libraries BSD-style. Makes sense to me.
Stewart: Let's talk a little more about patents. Should gene patents be allowed?
Birney: Gene patents when you've just randomly sequenced bits of DNA--no. Gene patents when you have a very good idea of what it does in this disease, then yes, for this gene, for that disease.
It is a pretty obvious distinction in my mind. I can't believe people argue about this the way they do, or support the two extremes--always patent or never patent. It is something I try to stay out of.
As an aside, I think the world should be very thankful to the stance on DNA sequence taken by the Wellcome Trust, the U.K. charity that funded about one-third of the human genome project. They were able to make reasoned decisions about the openness of data as scientists over the latter part of the last decade, and it is their stance, along with John Sulston, who headed the Sanger Centre at the time that kept the genome open. I think we would be living in a different world if it wasn't for the Wellcome Trust.
Stewart: You're involved in the Bioperl project. What is the objective of Bioperl and how is that going?
Beginning Perl for Bioinformatics
We've had to get more serious. With our sister projects BioJava and BioPython colocated on the same server (currently being upgraded to Sun Microsystems hardware due to a very nice grant from Sun--many thanks!), we've had to register as a charity and all sorts. It gives us more coherence and lets us do stuff like the forthcoming "hackathon" organized with O'Reilly and Electric Genetic's help (thanks all!), where we can pull together the key open source hackers to produce a real "infrastructure" for bioinformatics. I call this project the "/etc" for bioinformatics, as we need some way for bioinformatics tools to bootstrap their way into an infrastructure that is both global and also customized to the location and setup of the machine.
Bioperl is the granddaddy of the projects and we are heading towards a 1.0 release, probably by the end of the year. We don't want to claim that we've hit 1.0 until we are feature-complete for sequence analysis in bioinformatics. Just in case anyone thinks we are young, actually we are about seven years old and our code is used all over the place. We're just conservative in our release numbering.
Stewart: Another project you're involved with is the Ensembl Human Genome Server. Can you describe what Ensembl is?
Birney: Ensembl is many things. For molecular biologists we hope it is the first place they go to on the Web for questions about the human and mouse genome. We hope to land them in a nice Web site they can spend time in and that they quickly discover things of interest to them, so they can get on with designing and running real experiments and not waste time on some infuriating trail through the data.
Along with that we are the "infrastructure" supporting a number of really exciting projects both locally and worldwide. For example, Mike Stratton's group in the Sanger Centre is looking at cancer genes--but not just one or two genes in one or two cancers but *all* genes in *all* common cancers. Wow. We deliver the phrase "all genes" to him.
To make that Web site and the infrastructure behind it we have a serious software challenge. Everything is painful when one does stuff with the human genome, as it is huge and there has been 50 years of associated data to somehow gather and place. To do this we went to the Wellcome Trust, the U.K. charity I mentioned earlier, to bid for a large grant in terms of bioinformatics (8 million pounds sterling, but this is minor compared to the total cost of sequencing). Thankfully we got it and now, one year later, we are a team of 30 bioinformaticians with one of the largest computing resources dedicated to biology. You might wonder, "Why so many people?," but as I mentioned, nearly everything is painful when you are dealing with the human genome--its data size, the breadth of data resources. We are all working flat out to do this, and there are still mountains to climb.
Being open has fostered a real community around us. We've had contributions worldwide, and pieces of Ensembl have been reused in all sorts of contexts.
Ensembl was started by three of us--Michele Clamp, Tim Hubbard, and myself--and we were able to say "we are making this open source" from the start. Our motto is to be "as open as possible, at every level". Like any good open source project you can get our source code via anonymous CVS and participate in our discussions on firstname.lastname@example.org. All our data is also accessible--either as raw MySQL dumps (open source database, of course!) or as an Internet accessible MySQL server (kaka.sanger.ac.uk, username anonymous, database name "current"). We play well with other open data groups like the UCSC genome folks in the U.S.(Jim Kent, the programmer who put together the draft genome is over there and a regular email correspondent with us) and NCBI, our American counterpart.
Being open has fostered a real community around us. We've had contributions worldwide, and pieces of Ensembl have been reused in all sorts of contexts. One of the most exciting things is that we are starting to really spread worldwide. A very gifted programmer in my group, Elia Stupka is heading up the Fugu annotation project in Singapore. This is less of Elia leaving Ensembl and more of Ensembl stretching to Singapore. Elia and his entire group should be working off the same CVS code base--expect some interesting discussions on our mailing list as that happens--and we are looking forward to 24x7 response on ensembl-dev as Singapore is 7 hours ahead of us in the U.K.
On a personal level, Ensembl reuses the Bioperl infrastructure (tick for code reuse) and it makes heavy use of the algorithm that got me into all of this, GeneWise (although it would be impossible to use without the magic of Michele and Val). So I get a lot of pleasure from the project as "just another Ensembl hacker."
Stewart: What do they think are the unsolved problems in bioinformatics that will yield the greatest scientific advances?
Developing Bioinformatics Computer Skills
Birney: Oh, take your pick. Protein Folding? Expression? Regulation? Systems Biology? Being able to reliably transmit 50GB across the Internet? Who knows....
The main thing that I think will happen is that bioinformatics will merge with molecular biology just as molecular biology merged with biochemistry. It will just be part of "how you do" biological science. When molecular biology courses have a mandatory requirement for basic information science (data down a noisy channel) that's when we'll be making real advances.
Stewart: What will you be talking about in your keynote at the O'Reilly Bioinformatics Conference?
Birney: I will be talking about open source software in bioinformatics. I think I'll use three examples from my own work: Genewise being a one man project about an algorithm, Bioperl being a completely open hacker project about a framework, and Ensembl being a funded project for large infrastructure. I'll touch on what makes some of the technology interesting and what makes some of the biology interesting.
O'Reilly & Associates has plans to publish more books on bioinformatics in 2002. Currently, O'Reilly has two bioinformatics publications:
Developing Bioinformatics Computer Skills (April 2001)
Beginning Perl for Bioinformatics (due out in October 2001)
Copyright © 2009 O'Reilly Media, Inc.