An Interview with Ewan Birney: Keynote Speaker at O'Reilly's Bioinformatics Technology Conference
Ewan Birney will be a keynote speaker at O'Reilly's upcoming Bioinformatics Technology conference. Ewan is a biochemist by training and a programmer by practice, and he is currently a Team Leader for Genomic Annotation at the European Bioinformatics Institute (EBI).
Ewan received his B.A. in biochemistry from Balliol College, Oxford, and his Ph.D. from the Sanger Centre in association with Cambridge University. He is known as the "chief cheerleader" for open source bioinformatics, and he has provided code and leadership to the Bioperl and other Bio* projects. Ewan is also one of the leaders of Ensembl, a joint project between the EBI and the Sanger Centre, that delivers a completely free view of the human genome to the world.
We spoke to Ewan about some of the many projects he is involved in and the future of open source bioinformatics.
Stewart: When did you first get interested in bioinformatics?
Birney: I was working at Cold Spring Harbor Laboratory, one of the pioneering labs in molecular biology work since the 1970s, and I had a problem that I couldn't get the computer people interested in. A friend of mine who had gotten into bioinformatics bought me Kerrigan and Ritchie's The C Programming Language, and I never looked back.
Some of my first programs were written on a rather cranky Unix system and produced PostScript, which as I later learned, is a rather cranky language. The only thing I could do to test a program was print it out; either it printed (and worked) or nothing happened. It was a rather extreme debugging experience.
My real breakthrough into bioinformatics happened the next summer, when I was doing summer jobs at CSHL while studying at Oxford for my undergraduate degree. The EST database had just come out--this was the first really large-scale, automated DNA sequencing to happen. I wanted to find my proteins of interest, RNA binding proteins, inside of it. Sadly no one had written the tool I wanted, an on-the-fly translation profile search. So I settled down with a "Methods in Enzymology" chapter that was written by Bill Pearson and taught myself dynamic programming, which is the bedrock of most DNA sequence analysis.
This was tremendously successful. Yes, I found RNA binding proteins in the EST database, but more importantly I realized that I could cope with frameshift errors in the DNA inside the program. This is standard stuff these days, but back in 1993 it was pretty radical, and unbelievably exciting for a 20-year-old undergraduate who didn't know better.
That program, originally named "PairWise" eventually became "GeneWise," which is the algorithm I am most associated with and which really started my career in this field.
Stewart: What do you think are the most interesting tools and techniques being developed in bioinformatics today?
Birney: It is so tough to choose! Bioinformatics stretches from what looks on the outside as utterly mundane "database design" issues (they are never mundane) to cutting-edge algorithms. What I love about working at EBI is that nearly all of this is represented. We do stuff from "How do you sensibly store 2 terabytes of data for random access" to "Now we've got 200 expression analysis results; what on earth do we do with them?"
I am deeply impressed by Alvis Brazma's group next door to mine, which works on expression--they manage the ArrayExpress database and, more importantly, figure out how to use it. In terms of capturing information, I think work like Mindori Harris's on GO (Gene Ontology project) is incredible--I feel like I'm watching some old cartoon where strange, amorphous biology concepts are being squeezed down single, pithy descriptions, each "bar-coded" with a GO accession number.
Closer to home, Michele Clamp and Val Curwen, working over the road from me at the Sanger Centre take my algorithms (which usually take, say, an odd decade of computer time to run) and figure out how to actually run them in sensible time across the human genome. That impresses me no end.
Stewart: What can be done to improve the current tools?
Birney: All sorts. Like any dispersed software development group we don't do enough code reuse. There is too much reinvention of basic stuff, despite the best efforts of projects like Bioperl and BioJava.
More fundamentally, we are starting in the big bioinformatics set-ups to really push the limits of hardware and algorithms. A consultant came around and looked at our system and was shocked at the level we push our machines--when everything is up and running we cream through a year's worth of CPU cycles each day, and over that time track about 2 million unique processes. The consultant was gobsmacked. Each time we take it to the next level we discover something--the hardware, the network, the OS, the algorithm--scaling not quite right.
Stewart: How much do you think the Internet has had to do with the genome progress?
Birney: An immense amount. Everyday EBI (Europe), NCBI (U.S.) and DDBJ (Japan) synchronize the world DNA database via the Internet. The democratization of the Internet is ideal for getting this data out to the 100,000 individual labs that actually do the wet work. We're not like astronomy where everyone knows each other; we have to have the Web to get our message out and to deliver the actual data.
It is a two-way street--a lot of the Web development was actually fostered by bioinformatics. Lincoln Stein wrote CGI.pm, probably the most common way of interfacing to the Web, because of his needs in bioinformatics. The needs of molecular biologists was probably one of the reasons why many campuses have upgraded their Internet connectivity over the last decade.
Interestingly, the Internet is now not scaling for us . . . Dat tape by DHL has a higher bandwidth than fiber when you are shifting terabytes.
Interestingly, the Internet is now not scaling for us. The NCBI and Hinxton Campus (where EBI and Sanger Centre are located) swap trace files (raw DNA data dumps) across the Atlantic: sadly the Internet just isn't up to the job, and we have to use DAT tape. Tape by DHL has a higher bandwidth than fiber when you are shifting terabytes.
Stewart: What kind of coordination is there between researchers at EBI and NCBI?
Birney: There is a real deep coordination between NCBI and EBI. For the DNA database, which is the main archival dataset shared worldwide, there is nightly synchronization of the data (this also includes DDBJ in Japan). In my area we work very closely with NCBI on providing resources to track the "finishing" of the human genome (finishing here is used in a technical way as well as its common meaning of completing).
In other areas we duke it out as to who is best, but in a pretty good-natured way (if anyone is interested, of course, EBI is the best...).
Stewart: How do you see the relationship between open source, open science, and entrepreneurial enterprise?
Birney: As in some ways, as it always has been. Science should be open--it provides the infrastructure for research and discoveries--but as things head towards something that "does" something, you often need to protect it to actually make extreme investment pay out. So, I think you keep DNA sequence completely open (the infrastructure) but you have to patent diagnostic-level discoveries (say, the precise role of a protein in a particular disease) to allow real, useful products to be developed.
I see software in the same light. The infrastructure, and in particular the libraries, should be open--we can share and reuse without any barriers. However, when people start to complain about the color scheme and expect people to fix it, then you start charging. In software I think people should really focus on paying for the time and effort to have a developer on the project to get something done, not really for the software itself. In that sense I am a real open source believer, but more Eric Raymond than Richard Stallman. I GPL my algorithms, but have licensed the libraries BSD-style. Makes sense to me.
Stewart: Let's talk a little more about patents. Should gene patents be allowed?
Birney: Gene patents when you've just randomly sequenced bits of DNA--no. Gene patents when you have a very good idea of what it does in this disease, then yes, for this gene, for that disease.
It is a pretty obvious distinction in my mind. I can't believe people argue about this the way they do, or support the two extremes--always patent or never patent. It is something I try to stay out of.
As an aside, I think the world should be very thankful to the stance on DNA sequence taken by the Wellcome Trust, the U.K. charity that funded about one-third of the human genome project. They were able to make reasoned decisions about the openness of data as scientists over the latter part of the last decade, and it is their stance, along with John Sulston, who headed the Sanger Centre at the time that kept the genome open. I think we would be living in a different world if it wasn't for the Wellcome Trust.
Pages: 1, 2