An Interview with Lincoln Steinby Bruce Stewart
It's well known that Perl is making valuable contributions to bioinformatics, but less well known is the contribution that bioinformatics has made to Perl and the World Wide Web. It so happens that one of the most widely used Perl modules, CGI.pm, was written by a researcher who wanted to publish genome maps. The CGI.pm module, in turn, helped the Web develop from a directory of static pages to a dynamic, database-driven medium. And all because Lincoln Stein needed a quick, easy, and cheap tool to manage text. Thus the fruitful marriage between Perl and Bioinformatics began.
Today, Lincoln is a researcher at the Cold Spring Harbor Laboratory in Cold Spring Harbor, New York, as well as a prolific programmer and author. In addition to the significant contributions he has made to Perl and the Web, he writes software for biological databases, data analysis and visualization, and sharing results. He writes for Web Techniques and The Perl Journal, and he has written several books on related subjects.
Lincoln will be a keynote speaker at O'Reilly's upcoming Bioinformatics Technology conference. We spoke to Lincoln about his current projects, his opinion on patent issues in biology, and why Perl has become the programming language of choice in bioinformatics.
Lincoln Stein will deliver his keynote, "Bioinformatics - Building a Nation from a Land of City States," Tuesday, January 29 at the O'Reilly Bioinformatics Technology Conference.
Stewart: When did you first get interested in bioinformatics?
Stein: It was a complete accident. I was a graduate student in cell biology, working on the embryonic development of a parasitic worm, and I had sequenced a gene involved in the switch from the nonparasitic, free-living form of the worm to the infective version. I had sequenced the gene and I wanted to use some sequence-analysis software to assemble it (reassemble the various fragments of the gene into the full gene), but the cost for using the departmental VAX was $15 per month, and I didn't have that kind of money.
I had a Macintosh at home that I used for word processing, so I figured it couldn't be too hard to program it to do sequence assembly. So I learned 68000 Assembly Language, used it to write a sequence assembler, assembled my sequence, published my thesis, and lived happily ever after.
Stewart: Who do you think has a tougher time learning the other's discipline, computer scientists or biologists, and why?
Stein: Computer scientists have a much harder time learning biology than vice versa. This is because biology is an experimental science, and computer scientists have to undergo a full paradigm shift in order to understand it. In contrast, physicists have no problem learning biology.
To the biologist, software development is just another skill to pick up.
Stewart: Can you briefly describe what AcePerl and AceBrowser are?
Stein: AcePerl is a Perl module that acts as the API for the Acedb database. Acedb is an object-oriented database that is widely used for biological data modeling as well as other specialized fields, such as geographic databases. Before AcePerl, the only API was in C. A related project is Jade, which is a Java API.
AceBrowser is a Web-based front end to Acedb. It allows users to browse Acedb databases via the Web.
Stewart: Tell us a little bit about the BoulderIO project.
Stein: This is a defunct project whose goal was to send biological objects around the Web using a simple tag/value syntax. It has been superseded by XML, which promises to do much the same thing.
Stewart: Is that happening now? Is XML emerging as the answer to the problems associated with the disparate databases that currently contain biological data?
Stein: XML is now the preferred solution for exchanging information among biological databases. It doesn't magically solve the problem, of course, but it enables the solution.
Stewart: What projects are you most excited about right now?
XML is now the preferred solution for exchanging information among biological databases.
Stein: I'm very excited by my own pet project, the Distributed Annotation System. It allows genomic annotations (statements about the significance of certain regions of the genome, such as where the genes are) to be shared. This is the only example I know about where one can write a genome-sequence viewer and have it work on multiple different databases, without regard to the nature of the underlying database or data model.
Stewart: What can be done to improve the current tools used in bioinformatics?
Stein: More attention to software engineering: standards compliance, quality control, documentation. Also I am a strong advocate of open source. If more bioinformatics software were organized along the lines of the BioPerl and BioJava projects, we would be in a much better position today than we are. However, in the current mix of open, closed, and proprietary software, we are stuck with a tasteless stew of mismatched components, half solutions, buggy software, and unfulfilled promises.
Stewart: You have a long history of involvement with Perl. Why do you think Perl has emerged as the primary programming language used by bioinformaticians?
Stein: Perl deals well with text data, and DNA and protein sequences are mostly text.
Stewart: Ewan Birney recently stated that a lot of the Web's development was actually fostered by bioinformatics, and your CGI.pm module in particular. CGI.pm is probably the most widely used Perl module in existence. What motivated you to write this module, and do you plan any changes to CGI.pm in the future?
Stein: I wrote CGI.pm when I was at the Whitehead Institute/MIT Center for Genome Research. I needed it to publish the Whitehead's physical maps of the genome. I also wrote the GD module at the same time.
CGI.pm is continually updated to keep up with the changes in the World Wide Web protocols. For example, the latest release adds support for P3P cookies.
Stewart: How do you see the relationship between open source, open science, and entrepreneurial enterprise?
Stein: If a bioinformatics research group publishes work based on the output of a piece of software, the source code for the software should be available for inspection. This is basic for the principles of verifiability and reproducibility that are applied to all aspects of biology. This does not mean that all bioinformatics software should be developed using an open source development model, or that it should be available for use on a royalty-free basis. However, if researchers are going to publish some result that I'm going to work with, I want to be able to reproduce their work.
Stewart: Should gene patents be allowed? Protein patents? Naturally occurring proteins?
Stein: I feel that you can patent the novel use of a naturally occurring product, such as a gene or a protein, but not patent the product itself.
Stewart: What do you think are the unsolved problems in bioinformatics that will yield the greatest scientific advances?
Stein: How are genes regulated? How are proteins targeted to their subcellular destination? How does the brain store memory?
Stewart: What will you be talking about in your keynote at the O'Reilly Bioinformatics Conference?
Stein: Another unsolved problem. I don't know yet!
Editor's Note: We actually have some idea, as Lincoln will be delivering the keynote, "Bioinformatics--Building a Nation from a Land of City States."
Bruce Stewart is a freelance technology writer and editor.
Return to the O'Reilly Network.
Copyright © 2009 O'Reilly Media, Inc.