Keeping Genome Data Open
by Bruce Stewart
Jim Kent was a graduate student in biology at the University of California, Santa Cruz (UCSC), when he wrote the program that allowed the public human genome team to assemble its fragments just before Celera's private, commercial effort. His program ensured that the human genome data would remain in the public domain. Kent wrote the 10,000-line program in a month, because he didn't want to see the genome data locked up by commercial patents.
Tim O'Reilly has called Kent's crucial role in this project "the most significant work of open source development in the past year." Kent's work illustrates the need to think about more than just open source code; in the scientific community there is a growing awareness of the importance of open data.
Kent will give a keynote speech at O'Reilly's upcoming Open Source Convention, July 22-26, in San Diego. Here we talk to Kent about his genome assembly program, as well as the future of open data and human genome efforts.
Bruce Stewart: First, congratulations on your recent successful PhD dissertation defense. How does it feel to be Dr. Kent?
Jim Kent: It feels good.
Stewart: You're a hero to many for your efforts in making the human genome data freely available. What motivated you to write the program that assembled the genome sequence fragments?
Kent: I needed it for my research. David Haussler needed it for his research. Bob Waterston needed it for his research. Pretty much everyone needed it. There was not a heck of a lot that the human genome project could say about the genome that was more informative than "it's got a lot of As, Cs, Gs, and Ts" without an assembly. We were afraid that if we couldn't say anything informative, and thereby demonstrate "prior art," much of the human genome would end up tied up in patents by Celera and their subscribers.
Stewart: What are your thoughts on the state of the patent process today with regards to scientific research. Is it ever appropriate for patents to be granted for scientific discoveries of naturally occurring phenomena?
Kent: They've tightened up the requirements for genomic patents somewhat. I'm not sure if they've gone as far as I'd like. I don't think discoveries of natural phenomenon themselves should be patentable, though inventions created to make discoveries, or clever processes that exploit discoveries certainly should be patentable.
Stewart: You were essentially competing with Celera Genomics in a race to assemble the genome, and they had procured what was reportedly the most powerful civilian computer in history for their effort. What tools did you use to beat them to the result?
Kent: 100 800 MhZ Pentium processors with 256 Mb RAM each, running Linux, the gcc compiler, the vim editor, a whiteboard, and occasional ice packs for the wrists.
Stewart: Ouch. This obviously involved some very prolonged coding sessions. Did you sleep much that month?
Kent: Probably about half of what I usually do.
Stewart: Did you feel a lot of pressure during this time, and did you have doubts about whether you'd get the assembler finished in time?
Kent: I develop very incrementally. I had something very simple working very quickly. Even the initial working versions were a large improvement over no assembly at all. I didn't have doubts that I'd have something useful in time, it was more a question of how good I could make it in a short time frame. We actually put 50 versions through testing in the first month of development. Even version 50 was far from perfect, but we could do interesting biological analysis even by version 3 or 4. (Currently the software is in version 102.)
Initially most of the pressure I felt was internally generated, because the assembly really wasn't a job I was expected to do. Other people soon came to depend on it, though. It was quite a responsibility! I had to put a lot of my other research on hold for quite some time to support the assemblies.
Thankfully, as of this year the National Center for Biotechnology Information has taken over the human genome assembly, so I have more time for other things. The International Human Sequencing Consortium is an amazing group of people though, and the work is already starting to help a lot of people with medical problems. In the end, I wouldn't trade the experience for anything.
Stewart: How much do you think the Internet has had to do with the rapid genome progress?
Kent: It's been extremely helpful. The genome project has actually helped the Internet as well. Lincoln Stein's code originally developed in the context of sharing data for the genome project is the foundation for many Web applications. I'm in the midst of writing some articles on some general purpose useful code that came out of the genome project as well -- SQL and XML code generators, and the Parasol job control system for Beowulf clusters.
Stewart: We deal a lot with open source software here at O'Reilly, but your important contributions in this area are really more about open data than open software. Do you see a relationship between the two?
Kent: There is a relationship. My own source is open (see my source code files) though it isn't "copy left." I do reserve the right to charge commercial users licensing fees for it, though for non-commercial purposes it's free. Some of it's entirely free. In my experience, if you keep the licensing rates reasonable [then] commercial users are actually happier paying a little something, because then they have the right to bug you some for support and maintenance.
I guess the main model open source people use is that the source, etc., is free, but support costs, and perhaps someone else will do the support. The more robust the code is though, the less support it requires. Making code robust is a lot of work. I'm happier taking money for something I've already debugged, than taking money down the road to debug something someone's already depending on.
That's an answer I guess, but not exactly to the question you asked. I'm not fanatical that source code should be open. I do really like having access to the source for something I'm using, and definitely would pay more for it. Genome data I am fanatical about, though.
The value of the genome has accumulated over 3 billion years of evolution, and it's a value I strongly feel belongs to us all if it belongs to anyone. I view it a lot like the European discovery of the new world. Sure, it was an accomplishment learning how to build ships, and sextants, and so forth, so that you could cross an ocean. It shouldn't mean that you own the continent though! Stop genomic neo-colonialism!!!
Stewart: What will be involved in the next phase in human genome research?
Kent: The very next phase will be identifying all of the roughly 30,000 genes. We've got solid evidence on half of them already, and fragmentary evidence on another third. Once we've got a complete, or near complete, gene catalog we'll be trying to figure out the function of each gene, probably starting with where in the body it is used, and where in the cell it is used. Figuring out the gene functions, and especially how the genes interact with each other will keep us busy for a very long time.
Stewart: What projects are you working on now?
Kent: Our 1,000-CPU-computer cluster is being pretty fritzy at the moment. We've gone through two job control systems -- Condor and Codine. We've looked at several others, including LSF and PBS. I'm not happy with any of them, and am in the midst of writing our own, which I call Parasol. It's a fun project, though not without its challenges.
On the biological side I'm doing work to figure out gene expression -- that is, how genes are switched on and off. Basically every cell in your body has the same genome, but the body has over 200 readily distinguishable types of cells (and likely many more types we cannot yet distinguish). How a single egg develops into a complex organism like ourselves is one of the great mysteries of biology. The genome and other large scale data such as that available from micro-arrays can potentially shed a lot of light on this.
Stewart: What will you be talking about in your keynote at the O'Reilly Open Source Convention?
Kent: I'll be talking about evolution mostly -- how the "blind watchmaker" leads not only to the development of species, but also to the immune response and cancer. I think people will enjoy it, even though it doesn't have that much to do with Linux.
Bruce Stewart is a freelance technology writer and editor.
Return to the O'Reilly Network.