An Interview with Ewan Birney: Keynote Speaker at O'Reilly's Bioinformatics Technology Conference
Pages: 1, 2
Stewart: You're involved in the Bioperl project. What is the objective of Bioperl and how is that going?
Beginning Perl for Bioinformatics
We've had to get more serious. With our sister projects BioJava and BioPython colocated on the same server (currently being upgraded to Sun Microsystems hardware due to a very nice grant from Sun--many thanks!), we've had to register as a charity and all sorts. It gives us more coherence and lets us do stuff like the forthcoming "hackathon" organized with O'Reilly and Electric Genetic's help (thanks all!), where we can pull together the key open source hackers to produce a real "infrastructure" for bioinformatics. I call this project the "/etc" for bioinformatics, as we need some way for bioinformatics tools to bootstrap their way into an infrastructure that is both global and also customized to the location and setup of the machine.
Bioperl is the granddaddy of the projects and we are heading towards a 1.0 release, probably by the end of the year. We don't want to claim that we've hit 1.0 until we are feature-complete for sequence analysis in bioinformatics. Just in case anyone thinks we are young, actually we are about seven years old and our code is used all over the place. We're just conservative in our release numbering.
Stewart: Another project you're involved with is the Ensembl Human Genome Server. Can you describe what Ensembl is?
Birney: Ensembl is many things. For molecular biologists we hope it is the first place they go to on the Web for questions about the human and mouse genome. We hope to land them in a nice Web site they can spend time in and that they quickly discover things of interest to them, so they can get on with designing and running real experiments and not waste time on some infuriating trail through the data.
Along with that we are the "infrastructure" supporting a number of really exciting projects both locally and worldwide. For example, Mike Stratton's group in the Sanger Centre is looking at cancer genes--but not just one or two genes in one or two cancers but *all* genes in *all* common cancers. Wow. We deliver the phrase "all genes" to him.
To make that Web site and the infrastructure behind it we have a serious software challenge. Everything is painful when one does stuff with the human genome, as it is huge and there has been 50 years of associated data to somehow gather and place. To do this we went to the Wellcome Trust, the U.K. charity I mentioned earlier, to bid for a large grant in terms of bioinformatics (8 million pounds sterling, but this is minor compared to the total cost of sequencing). Thankfully we got it and now, one year later, we are a team of 30 bioinformaticians with one of the largest computing resources dedicated to biology. You might wonder, "Why so many people?," but as I mentioned, nearly everything is painful when you are dealing with the human genome--its data size, the breadth of data resources. We are all working flat out to do this, and there are still mountains to climb.
Being open has fostered a real community around us. We've had contributions worldwide, and pieces of Ensembl have been reused in all sorts of contexts.
Ensembl was started by three of us--Michele Clamp, Tim Hubbard, and myself--and we were able to say "we are making this open source" from the start. Our motto is to be "as open as possible, at every level". Like any good open source project you can get our source code via anonymous CVS and participate in our discussions on firstname.lastname@example.org. All our data is also accessible--either as raw MySQL dumps (open source database, of course!) or as an Internet accessible MySQL server (kaka.sanger.ac.uk, username anonymous, database name "current"). We play well with other open data groups like the UCSC genome folks in the U.S.(Jim Kent, the programmer who put together the draft genome is over there and a regular email correspondent with us) and NCBI, our American counterpart.
Being open has fostered a real community around us. We've had contributions worldwide, and pieces of Ensembl have been reused in all sorts of contexts. One of the most exciting things is that we are starting to really spread worldwide. A very gifted programmer in my group, Elia Stupka is heading up the Fugu annotation project in Singapore. This is less of Elia leaving Ensembl and more of Ensembl stretching to Singapore. Elia and his entire group should be working off the same CVS code base--expect some interesting discussions on our mailing list as that happens--and we are looking forward to 24x7 response on ensembl-dev as Singapore is 7 hours ahead of us in the U.K.
On a personal level, Ensembl reuses the Bioperl infrastructure (tick for code reuse) and it makes heavy use of the algorithm that got me into all of this, GeneWise (although it would be impossible to use without the magic of Michele and Val). So I get a lot of pleasure from the project as "just another Ensembl hacker."
Stewart: What do they think are the unsolved problems in bioinformatics that will yield the greatest scientific advances?
Developing Bioinformatics Computer Skills
Birney: Oh, take your pick. Protein Folding? Expression? Regulation? Systems Biology? Being able to reliably transmit 50GB across the Internet? Who knows....
The main thing that I think will happen is that bioinformatics will merge with molecular biology just as molecular biology merged with biochemistry. It will just be part of "how you do" biological science. When molecular biology courses have a mandatory requirement for basic information science (data down a noisy channel) that's when we'll be making real advances.
Stewart: What will you be talking about in your keynote at the O'Reilly Bioinformatics Conference?
Birney: I will be talking about open source software in bioinformatics. I think I'll use three examples from my own work: Genewise being a one man project about an algorithm, Bioperl being a completely open hacker project about a framework, and Ensembl being a funded project for large infrastructure. I'll touch on what makes some of the technology interesting and what makes some of the biology interesting.
O'Reilly & Associates has plans to publish more books on bioinformatics in 2002. Currently, O'Reilly has two bioinformatics publications:
Developing Bioinformatics Computer Skills (April 2001)
Beginning Perl for Bioinformatics (due out in October 2001)