Jim Kent Keynotes O'Reilly's Bioinformatics Conference

by Daniel H. Steinberg

Ben Franklin is remembered for his many great ideas, inventions, and accomplishments. In presenting this year's Ben Franklin Award on behalf of Bioinformatics.Org, J.W. Bizarro reminded us that Franklin also refused to profit from many of his ideas and instead made them freely available. Bizarro pointed out that Jim Kent, this year's recipient of the award, embodies those characteristics in his work, which includes creating a browser for the human genome that is used by many thousands of biomedical researchers every day.

Why We Need Detailed Information About Genetics

"You've got to admire a man who will flirt with lightning," Kent begins. Franklin was one of his childhood heroes. As if a train of thought has begun, Kent mentions other childhood memories: a rock collection, a shell collection, a collection of bottle caps, and, his pride and joy, a bug collection. You can see a pattern in Kent's life of collecting information, looking for patterns, and asking more questions. He gives the audience a quick smile as he displays some computer code on the screen and notes that he now has a modern bug collection.

You can try to follow Kent's mind as he asks and starts to answer a string of questions. He casts them as naive questions in biology and then muses briefly on each one in turn. "Is an ant an individual or is it the hive? Do dolphins talk with each other? How do amphibians and worms regenerate? How does an animal develop out of an egg?"

O'Reilly Bioinformatics Technology Conference Coverage
Get latest weblogs, photos, and articles from the O'Reilly Bioinformatics Conference in San Diego.

The last question leads him towards the core of his talk. A single fertilized cell eventually differentiates into the 300 different types of cells that make up an adult body. The cells have the same genome but each only expresses a subset of the genes. We often don't think about the fact that much of the egg doesn't become part of the actual animal. Most of it actually becomes placenta and amnion. Contrast this with an adult liver cell. They can only become liver cells. Taken to the extreme we find neurons that can't reproduce. The cell type is determined by the parent types and interactions with other cells.

Once a cell is differentiated it is possible to dedifferentiate. Kent offers the extreme example of the cloning of Dolly the sheep. The technique had been used before with less success. The scientists introduce material into an egg. The fear is that because the egg is huge, the existing material will swamp out the little piece you are introducing. Kent follows this thread of regeneration. He considers the regeneration of amphibian limbs. First the stubs consist of cells with little differentiation and then the limb begins to reform much like its former shape.

If we knew more about how to regulate this regeneration in humans we may be able to treat diseases that involve small populations of cells. For example, Parkinson's disease results from the death of a small portion of cells that are the dopamine-producing neurons in the substantia negra. Similarly, Type I diabetes is the death of the insulin-producing cells in the pancreas.

Related Reading

Sequence Analysis in a Nutshell: A Guide to Tools
A Guide to Common Tools and Databases
By Scott Markel, Darryl León

Kent thinks that treating some of these diseases may turn out to be easier than we think. Think of the many types of stem cells that we have. These are cells that do divide and reproduce and are less differentiated. Kent explains that if you put a stem cell into the proper environment, it will go along and find its place there. As an example, he points to the research done on a small population of bone marrow transplant patients that go on to develop strokes. There are new neurons in the brains of the transplant recipients that show that they developed from the bone marrow.

This means that like amphibians, humans have a lot of capability for regeneration. Unlike amphibians, however, the level of regeneration is too low to be useful. Kent puts it plainly, "This is why we need very detailed information about genetics."

What Genetic Information Do We Need?

Kent identifies the central requirements for genetic information as: the genome, a comprehensive list of genes, gene expression data, and protein localization data. The good news is that the human genome is about 95 percent complete and will be about 98 percent complete in April 2003. The last two percent is hard and Kent suggests that maybe our efforts are better spent downstream filling in some of the other information that is needed. Through RefSeq we know about 75 percent of the coding regions but less than half of the transcription start sites. When it comes to gene expression data, there is currently publicly available data for about one third of the genes and commercially available data for another third. Information on protein localization tends to be a bit spotty.

Kent suggests that our time is best spent identifying the rest of the genes. The actual title of his address is "The Genes, the Whole Genes, and Nothing But the Genes." The second part of the title refers to Kent's emphasis that it is not enough to just have the coding region of the gene. Without the whole gene you end up with fragmentation and fusion artifacts. Finally, this sort of research is expensive and time consuming. It is important to guard against unreal genes. The "nothing but the genes" admonition is that one bogus gene can lead to another, as much annotation is done via homology.

How Do We Identify Genes?

The remainder of the talk looked at the different techniques for finding genes and their advantages and disadvantages. Many still look at cDNA sequencing as the gold standard but Kent cautions that it is subject to many artifacts. The remaining genes tend to be harder to find than the ones that have been identified. In addition to looking at different methods, Kent sees benefits in using computational results to identify areas for scientists, using more traditional methods to investigate.

With cDNA sequencing you begin by extracting RNA from cells and use a retral viral enzyme to convert it to RNA. The cDNA is inserted into vectors that grow in E. coli. Then a sequence is read (EST) from one or both sides of insert. If the EST looks to be new then maybe it is or maybe there's an artifact. Although there has been much success with this method, there are problems. For rarely expressed genes, little RNA is available. Also, because splicing is not instantaneous you can get retained introns. A third problem is that reverse transcriptase has a high error rate and is prone to small deletions. These three problems can be addressed by more careful techniques, but Kent cautions that a fourth problem is that "at a low level the cell seems to tolerate a certain degree of nonsense transcription and splicing." There is no way to get around this difficulty unless you ignore everything that's not coding.

Other techniques include Whole Genome Microarrays, Model Organism Genetics, Cross-Species Genome Comparisons, and Computational Gene Finding. A problem with Whole Genome Microarrays is that what you're seeing may not be cross hybridization. You need other tools to help you decide what is and isn't real.

With Model Organism Genetics, you zap yeast, worms, flies, and mice, and then inbreed offspring and look for interesting results. The good news is that you get hints of the functions right away. The problem is finding the part of the DNA that is actually mutated and is suspected to be responsible for the results. In mice this process can take three years. Another issue in using mice is that five percent of human genes don't have clear mouse orthologs.

Computational Gene Finding is getting more sophisticated. For bacteria the process is straightforward. You look for open reading frames by identifying long stretches between start and stop codons. This is harder in Eukaryotes because there are introns that are often larger than the stuff you're looking for. One technique is to look for coding exons that are bounded by AG/GT. Hidden Markov Models (HMM) can model coding regions and splice sites simultaneously. Again the difficulty is that introns are big in comparison with GT/AG splice signals. Pure HMM approaches tend to overpredict while pure homology approaches tell you what you already know.

A second-generation composite approach combines different techniques. GenomeScan, Ensembl, and fbenesh++ all use protein-homology information on top of HMMs. Genie uses EST information to constrain HMMs. SLAM, SGP, fgenesh2, and twinscan use cross-species genomic alignment on top of HMMs. Finally, there are benefits in taking the results of computational genefinding and taking them into the wet lab. Take a predicted exon and use it in the laboratory as a pointer to where to look. Sanger has been doing this on chromosome 22 using results from every genscan prediction, spliced EST, and pufferfish homology.

Kent's keynote was very well-received and it is clear this community greatly values his work on the Human Genome project.

Daniel H. Steinberg is the editor for the new series of Mac Developer titles for the Pragmatic Programmers. He writes feature articles for Apple's ADC web site and is a regular contributor to Mac Devcenter. He has presented at Apple's Worldwide Developer Conference, MacWorld, MacHack and other Mac developer conferences.

Return to the Bioinformatics Technology Conference Coverage Page.