advertisement
Listen Print

An Interview with Nat Torkington and Lorrie LeJeune

by Bruce Stewart
11/21/2001

Nathan Torkington and Lorrie LeJeune are sharing the program chair responsibilities for O'Reilly's upcoming Bioinformatics Technology Conference. Nathan is an editor in O'Reilly's Web and Scripting group, a Perl hacker, and a mad blogger when it comes to bioinformatics. Lorrie is the editor for O'Reilly's bioinformatics program, and she has worked as a molecular biologist in the biotechnology field.

We squeezed a few minutes out of their busy schedules recently to ask them about O'Reilly's interest in bioinformatics, and about what to expect at the first O'Reilly Bioinformatics Technology Conference to be held January 28-31, 2002, in Tucson, Arizona.

Stewart: Tell us about your backgrounds, and what motivated you both to get involved in bioinformatics?

Torkington: My background is nowhere near as interesting as Lorrie's. I'm a computer geek--I have a bachelor's degree in computer science from Victoria University in Wellington, New Zealand, and I've worked as a system administrator. After writing the Perl Cookbook with Tom Christiansen, I worked for Tom as a Perl trainer, until I was hired by O'Reilly in November 2000. I'm an editor 80 percent of the time and a conference planner 20 percent. Well, in theory--those numbers are more like 120 percent and 80 percent respectively.

LeJeune: I've been a molecular biologist in the biotechnology field, I've worked on a dairy farm, I've designed and typeset books, and I've been a freelance illustrator and a metalsmith. My educational background includes a B.S. in animal science and microbiology, and a lot of art courses. Nat may tell you otherwise, but I've never been an international spy. Though I've been in publishing longer than I was in science, I've always been deeply interested in medicine and biology. When bioinformatics appeared on O'Reilly's radar, I already knew what it was, and I knew I wanted to be part of it.


O'Reilly's Bioinformatics Technology 
Conference O'Reilly's Bioinformatics Technology Conference, January 28-31, 2002, in Tucson, Arizona, will present rich technical training for biologists and computational scientists, with a practical emphasis on how to choose and use the right tools. Register by December 7 and save up to $545.

Stewart: Why is bioinformatics important?

Torkington: The classic scientific process is to formulate a hypothesis ("kicking concrete hurts"), design and conduct experiments to test the hypothesis ("I paid 50 undergraduates to kick concrete and 49 of them said it hurt"), and drawing conclusions about the hypothesis ("yup, kicking concrete hurts").

Biological science follows the same process. Researchers are trying to discover how the human body works at the molecular level. However, experiments take time and cost money. As if that wasn't bad enough, because we don't know a lot about the very low-level mechanics of molecular structure and interactions, there are a zillion hypotheses that seem plausible.

More and more, computers are being used to help the research process. A big part of the genomics work is taking the huge volumes of data collected by researchers and trying to identify patterns and similarities that will suggest hypotheses ("hey, this part of the human genome looks a lot like this gene in donkey DNA that controls evening flatulence--maybe the chunk of human DNA controls the same thing!"). This is an important part of bioinformatics.

Another part is tracking and presenting the huge volumes of diverse data from different types of experiments so that researchers can use the data to check their hypotheses. When experiments generate gigabytes and this kind of data is being collected faster than anyone could ever pore through it, computers become more and more essential to the research.

LeJeune: Computational tools and complex databases are the only way to manage the volumes of data being generated these days. Back in the dark ages--the mid-80s--when I was running sequencing gels, you had to collect the data by hand. You couldn't generate it fast enough to fall behind on the organization or analysis. Now, with data pouring in by the gigabyte, there's no way you can hope to keep track of it without software tools.

The data we're collecting is also very rich, in that if we can find the patterns and relationships at the cellular level, we have a better chance of understanding how systems work at the organismal level. That understanding will hopefully lead to advances in health care, medicine, drug discovery, you name it. But you're not likely to find interesting patterns and relationships in terabytes of data unless you use software tools.

Stewart: Why has O'Reilly decided to enter the bioinformatics space?

LeJeune: Tim O'Reilly has always felt that one of the most important missions of O'Reilly books is to address "information pain." Information pain happens when people want to do something and they can't find the resources to help them. Bioinformatics, being an intersection point between two very different disciplines, is filled with information pain. Biologists are learning to program, programmers are struggling to understand genetics. We saw it as an opportunity to help both communities learn the critical skills of the other.

Stewart: What is the focus of the O'Reilly Bioinformatics Technology Conference?

Torkington: How to do a great job. That's what all our conferences try to focus on: real tools, real situations, real people, resulting in real knowledge that you can take back to your desk and use to solve problems faster. So we looked at the things that bioinformaticians do, and found presentations that address both the here-and-now and emerging technologies in those fields.

Because each research group has a different protocol, a different organization, and different needs for data, there isn't much in the way of "canned" bioinformatics solutions. The most used tools are those that can be tweaked and incorporated into larger solutions. So we have a good mixture of specific technologies, the C++ toolkit from NCBI, for example, and case studies like how the Rat Genome Project managed its bulk data pipeline to curate incoming data.

Stewart: Who is this conference for?

Torkington: We really kept two audiences in mind. The first is obviously bioinformaticians and other molecular biology researchers who use computers to manage and mine their data. We really wanted them to be able to come and learn skills that will let them do their jobs better.

But we also wanted to attract software developers who aren't presently in the biology fields. Bioinformatics and biotech is a major growth area, and skilled professionals are in very high demand. We wanted developers to be able to come, learn about the field and the kinds of needs that researchers have, and be able to make informed choices about whether their existing skills or products will be needed by the biotech companies.

Stewart: What are some of the hot issues right now in bioinformatics? What do they think are the unsolved problems in bioinformatics that will yield the greatest scientific advances?

LeJeune: That's hard to know, since the scientific advances that eventually emerge--like a drug or a treatment that targets a specific condition--may not have an obvious link to bioinformatics. If I had to pick one, I'd say that predicting cellular function from genomic sequence is it. Given a map of all an organism's genes, and what they do, will we be able to predict how a cell--or an organism--will function? Bioinformatics tools will play a significant role in figuring this out.

Torkington: Personally, I'm actually pretty skeptical about the -omics--genomics, proteomics--approach. The "gather a ton of noisy data and then try to find meaning in it" approach doesn't seem very rigorous or focused. I wonder whether it's going to burn out, that we'll fail to deliver on the big promises of machine learning and artificial intelligence.

The biggest unsolved problem in molecular biology and bioinformatics is protein structure prediction. Each protein has a unique three-dimensional shape, determined only by the sequence of amino acid building blocks that comprise it. Discovery of the underlying rules that would let us predict the folds and contortions of the molecule has been incredibly slow in coming. We're hampered at every turn by the difficulty in gathering data at that scale--it's not like we can just whip out the digital camera and take a few snaps to see how a protein folds. At that scale, everything interacts with everything else and there's no obvious way to see how an electron distribution here will attract that nucleus there and fold the protein up to look like a sock puppet.

At a more pragmatic level, there are huge problems to be solved in data linking. Every genome project names its genes differently, so it's not trivial to search for similar genes in different species because in general the function of a gene is not usefully represented in its name. The Gene Ontology Consortium is doing good work in this field, and I'm really happy that they're going to be having a public meeting at our event.

Also, data quality is a major issue. Because data are being generated faster than they can be checked by humans, quality really varies between the different repositories. This noisy, lossy, contradictory data makes it much harder for computers to do the data mining tasks of identifying patterns.

Stewart: How do you see the relationship between open source, open science, and entrepreneurial enterprise?

Torkington: Open science is redundant. Closed science, research that cannot be independently verified, cannot be trusted. The only useful science, science which can be built upon and used to develop more understanding of the way the world works, is open.

For this reason, I really support the Open Bioinformatics petition at openinformatics.org. Science based on computer analysis of data is not open unless the code that analyzed the data is also open. How can reviewers verify the reasoning embodied in the code if the results are published, but the source is not?

I am keenly aware of the need for commercial support, though. I'm a big fan of space exploration, and I think that if biological research was left to the government to fund, we'd see money wither away as it has done at NASA. It's very hard to draw a line and say "this is acceptable, this is wrong".

My line is based on the fundamental distinction between discovered and applied science. If you map out genes from any species, that shouldn't be patentable. That's like claiming you own a planet you saw in the sky. If you identify a chemical that doesn't normally appear in the human body but which has a specific effect, then you should be able to protect that research and benefit from it.

Stewart: Nat, you're the king of posting bioinformatics Weblogs on the O'Reilly Network. Where do you find all that stuff? What do you think are the best resources for keeping current in bioinformatics?

Torkington: I look on Web sites, mainly because I'm too cheap to pay for subscriptions to the various journals. I look at ScienceDaily News: sciencedaily.com) and search Yahoo for "bioinformatics" and "genome". I read snowdeal.org, and search newsisfree.com. Basically, I try to do the dog-work to find interesting stories so that everyone else doesn't have to scruffle through search results and poorly designed framesets.

Stewart: Lorrie, since editing O'Reilly's bioinformatics books, how has your view of bioinformatics changed?

LeJeune: My view hasn't so much changed as it has broadened. Bioinformatics encompasses a wide range of skills, from biochemistry and molecular biology, to evolution and genetics, to mathematics, to computer science, to information science. I'm amazed at how much knowledge a practitioner of the art needs just to begin understanding biological data.

Stewart:What are the biggest challenges you've faced in putting together a conference on bioinformatics?

Torkington: Learning the molecular biology that underpins it all! I don't have the benefit of Lorrie's experience in this area--I actively avoided biology in high school. As soon as it became an elective, I abandoned it. So when I started reading about it again, it was like I was getting interested in French literature and all I had to go on was my high school French.

The other challenge was selecting only 50 or so speakers from the vast pool available. There are so many interesting branches of this field that we honestly could have filled six or ten rooms for the three days and still turned away good talks.

LeJeune: Figuring out what the most important topics are, and how to fit them together into a cohesive story. Bioinformatics is not unlike a three-ring circus: There's so much going on that it's hard to know where to look first. We want this conference to appeal to both programmers and biologists and we've had to put a lot of thought into what to cover and how deeply to cover it.


For a list of O'Reilly's books on bioinformatics, visit bio.oreilly.com.

Stewart: Nat, you have a long history of involvement with the development of Perl. Why do you think Perl has emerged as the primary programming language used by bioinformaticians?

Torkington: Two words: Lincoln Stein. To create a pointless geographical metaphor, Lincoln is the Bering Strait land bridge, connecting the Russia of computing with the Americas of bioinformatics. Because he's a Perl guru, Perl's the language he brought and it has settled and adapted to the new world.

Less colorfully, Perl is a great language for casual programmers. It's not ideologically pure, so you can write whatever kind of code you need to write to get your job done. If we had to train biologists in computer science, we might get better programmers able to code perfectly in everything from Lisp to Java, but we'd only have about five of them. Perl is a very flexible tool that tries hard to Do What You Meant, so even biologists who couldn't tell a method call from a methodology can be productive.

That's not to say that Perl's the only language in biology. Java has a strong following, Python has traction in the structural areas of molecular modeling, and good old C and C++ are widely used. Each language has its strengths, and ultimately it's not the language that's important but the applications that are written in it.

Stewart: Nat, tell us about the Hackathon. What are the expectations for this event?

There are a lot of open source bioinformatics groups, with loose coordination between them. Each code base grows with the needs and experience of the developers in that language. This leads to divergence in object models, functionality, data representation, and capabilities. The Hackathon hopes to bring them together to get them to work on integration and standardization, so it'll be possible to easily develop and install a project with components in different languages.

Stewart: Bioinformatics appeals to both traditional biologists and computer scientists. Which group do you think has a harder time understanding the other's field?

Torkington: My sense is that those rooted in biology understand less of computer science than former computer scientists understand of the biology. This is partially because the biology is the heart of the research--if you manipulate data to find patterns, it's incredibly easy to come up with a finding that has no biological significance. You need to know the biology to design the software processes and evaluate their results.

But I think that the biology is the hardest thing to understand, because it's infinitely detailed. At every step there are things we don't know and exceptions to rules. Even the Central Dogma of molecular biology omits a lot of cell processes. It's a formidable task to learn the biology you need in order to design and evaluate bioinformatics software. The computer science is almost trivial in comparison--at least there aren't huge areas of "here be dragons" unknown territory in algorithms!

LeJeune: I think programmers have a harder time understanding biology than the other way around. Programming is usually logical, but biology isn't. The relationships between molecules, systems, and organisms are complex and subtle, and you have to have a fair bit of knowledge in several subject areas to really understand what's going on, or to begin to make rational hypotheses

Stewart: What part of the conference are you most looking forward to?

Torkington: Everything! I know it sounds a total marketing cop-out to say that, but I am looking forward to almost every aspect of it. I'm really looking forward to talking to attendees and finding out what they do and what they're having problems with at work. I've got enough academic knowledge now that I want to understand what they're doing and how next year's conference can help them.

I'm also looking forward to the talks. The biggest reward to being a conference planner is that you get to put together the conference you'd want to attend! We've had a lot of input on the program from bioinformaticians such as Cynthia Gibas and Per Jambeck, and they've got me fired up about everything. I'm trying to convince them that they need to clone me so I can attend every talk.

LeJeune: I'm looking forward to the talks and learning lots of new things. I'm also looking forward to meeting people whom I only know by reputation or with whom I've only exchanged email. I think it's going to be a great conference.


For more information about O'Reilly's upcoming Bioinformatics conference, visit the O'Reilly Bioinformatics Technology Conference Web site.

O'Reilly & Associates has plans to publish more books on bioinformatics in 2002. Currently, O'Reilly has two bioinformatics publications: