ONJava.com    
 Published on ONJava.com (http://www.onjava.com/)
 See this if you're having trouble printing code examples


Java for Bioinformatics

by Stephen Montgomery
09/24/2003

Bioinformatics challenges developers to create widely distributable tools that elucidate biological relationships. The principal value of these tools is measured in the contribution they provide to research biologists. Furthermore, as the amount and nature of available data increases, bioinformaticians are forced to provide rapidly evolving tools to their user communities. Java has allowed bioinformaticians to rapidly develop user-friendly, cross-platform applications that are accessible to users at all levels of computational ability.

The Bioinformatics Challenge

The creation of many new bioinformatic applications is largely a response to the biological community aggregating types of knowledge and developing new experimental techniques. The development of bioinformatic applications can be broken down into two main tasks: data management and biological analysis. The first task generally involves aggregating sequences (strings representing DNA or proteins, the former being represented by non-random but complex patterns of As, Cs, Ts and Gs) and/or annotation information (the known properties of a given sequence; for instance, annotation consists of the locations of genes and other biologically relevant features).

Genome browsers, like EnsEMBL and UCSC, actively aggregate and serve annotation for several genomes (the chemical sequences that make up an organism's DNA). Bioinformaticians wishing to exploit new data constantly need to know what data exists, how they can access this data, and when to provide it to their users. Genome browsers provide bioinformaticians with a central location for exploring the common types of annotation that are readily available.

The second bioinformatics task is primarily concerned with determining the structure and function of genomic sequences. Researchers want to know what they have sequenced and how it is used in biological processes. The first task, aggregation of the data, is designed to provide researchers with the largest context for answering this question. But bioinformaticians have been actively developing applications that allow them to ask more sophisticated questions. In this article, we will explore some Java-based applications that fit into each of these main tasks.

O'Reilly Emerging Technology Conference.

When to Use Java

Traditionally, the language of choice for bioinformaticians has been Perl. Perl allows the rapid collection and analysis of data to answer directed questions such as "How many genes exist in a specific chromosome?" This is primarily due to the fact that, with little effort, Perl developers can quickly leverage the power of regular expressions and the large collection of bioinformatics-based modules. Furthermore, Perl allows bioinformaticians to rapidly prototype Internet-based methods for delivering data. However, the value of standalone bioinformatics applications created with Perl is limited in scope in its contribution back to research biologists. Perl scripts usually require prerequisite dependency installations and they lack the dynamic GUI interactions inherent in Java. For this reason, bioinformaticians have been using Java to deliver applications to researchers at all levels of computational ability, most of whom want to use computational approaches quickly to supplement other types of work.

How to Get DNA Information

The quintessential bioinformatics task can be accomplished using several different Java APIs. Sequence information is stored in three large repositories worldwide: DDBJ in Japan, EMBL in Europe, and GenBank in the U.S.. Code examples here show how to use Apache's Axis SOAP Engine to access remote data in the DDBJ and EMBL databases. However, mechanisms of accessing this data also exist using HTML parsers, direct database access, and intermediate databases with their APIs. For the EMBL database, also look at Ethan Cerami's "Web Services for Bioinformatics."

// Required Imports

import java.net.URL;
import javax.xml.rpc.ParameterMode;
import javax.xml.namespace.QName;
import org.apache.axis.client.Call;
import org.apache.axis.client.Service;
import org.apache.axis.encoding.XMLType;
import org.apache.axis.utils.Options;

Example 1. DDBJ Remote Access

public class GetDDBJ {
    try {
      String recordId     = "AB000001";
      String URL          = "http://xml.nig.ac.jp/wsdl/GetEntry.wsdl";
      String namespace    = "http://www.themindelectric.com/wsdl/GetEntry/";
      String serviceName  = "GetEntry";
      String functionName = "getDDBJEntry";

      QName serviceQName = new QName(namespace, serviceName);
      QName portQName    = new QName(namespace, serviceName);
      Service service    = new Service(new URL(URL), serviceQName);

      Call call     = (Call) service.createCall(portQName, functionName);
      String result = (String) call.invoke(new Object[] {recordId});
      System.out.println(result);
    } catch (Exception e) {
      System.out.println();
    }
}

Example 2. EMBL Remote Access

      String recordId     = "L34020";
      String URL          = "http://www.ebi.ac.uk/axis/services/Dbfetch";
      String functionName = "fetchData";
      String database     = "embl";
      String format       = "fasta";
      String style        = "raw";

      Call call = (Call)new Service().createCall();
      call.setTargetEndpointAddress(new java.net.URL(
          "http://www.ebi.ac.uk/axis/services/Dbfetch"));
      call.setOperationName(new QName("urn:Dbfetch", functionName));
      String[] result = (String[]) call.invoke(new Object[] {
	  	database, recordId, format, style});
      for (int count = 0; count < result.length; count++) {
        System.out.println(result[count]);
      }

How to Get Genomic Annotation

As mentioned, EnsEMBL and the UCSC web sites collate and warehouse an extensive amount of genomic annotation. Individuals seeking to data-mine specific organisms can, as a minimum, visit the following sites:

This list is by no means comprehensive or representative of all the large genome sequencing and analysis projects that are currently in progress.

Most users access these individual sites through a web browser. Once at a site, the most common tasks are tracking down a specific gene annotation in the selected organism and identifying its corresponding annotations and underlying sequence. However, bioinformaticians usually want to access these annotations automatically when performing various analyses. We will demonstrate how to access genomic annotation from EnsEMBL's MySQL databases using the ENSJ API.

// Required Imports

import java.util.ArrayList;
import java.util.Iterator;
import java.util.Properties;
import org.ensembl.datamodel.ExternalDatabase;
import org.ensembl.datamodel.ExternalRef;
import org.ensembl.datamodel.Gene;
import org.ensembl.driver.AdaptorException;
import org.ensembl.driver.ConfigurationException;
import org.ensembl.driver.Driver;
import org.ensembl.driver.DriverManager;
import org.ensembl.driver.GeneAdaptor;

Example 3. How to Get External Database References for a Gene Implicated in Cystic Fibrosis using EnsEMBL's Java API

  try {
      Properties configProps = new Properties();
      configProps.setProperty("ensembl_driver",
                              "org.ensembl.driver.plugin.standard.MySQLDriver");
      configProps.setProperty("host", "ensembldb.ensembl.org");
      configProps.setProperty("user", "anonymous");
      configProps.setProperty("database", "homo_sapiens_core_16_33");

	  // Connect to EnsEMBL
      Driver driver = DriverManager.load(configProps);

      try {
        GeneAdaptor ga      = driver.getGeneAdaptor();
        ArrayList cftrGenes = (ArrayList) ga.fetchBySynonym("CFTR");
        Iterator i          = cftrGenes.iterator();

        while (i.hasNext()) {
           Gene aCftrGene = (Gene) i.next();
           System.out.println(aCftrGene.getDisplayName());

           ArrayList externalRefs = (ArrayList) aCftrGene.getExternalRefs();
           Iterator j             = externalRefs.iterator();

           while (j.hasNext()) {
             ExternalRef exRef     = (ExternalRef) j.next();
             ExternalDatabase exdb = exRef.getExternalDatabase();
             System.out.println("DB:" + exdb.getName() + " Entry:"
               + exRef.getDisplayID());
           }
        }
      }
      catch (AdaptorException ae) {
        ae.printStackTrace();
      }
    }
    catch (ConfigurationException ce) {
      ce.printStackTrace();
    }

ENSJ allows a Java developer to hone in on specific annotations through different data adaptors. In Example 3, we used the GeneAdaptor to fetch gene annotations that contained the word "CFTR" (Cystic Fibrosis Transmembrane Protein) in their descriptors. After identifying which genes have met this criterion, we printed out the list of external database references that associate with this gene. This task allows us to investigate whether a function or structure is previously known. An extensive description of the services ENSJ supports is beyond the scope of this article; Java developers are encouraged to visit the Driver specification to explore what common adaptors are available ENSJ API.

Which Bioinformatics Applications Use Java

As mentioned, the main bioinformatics tasks are data retrieval and analysis. We have discussed several Java-based data retrieval methods for accessing genomic sequence and annotation. But what types of tools are being constructed to perform novel bioinformatics analysis? How are they delivering their analyses to users of varying computational abilities?

While certainly not a complete list, these applications and their descriptions embody several important styles of Java-based bioinformatics development. They represent the push to construct integrated genomics platforms; they are capable of a wide variety of tasks. They are also applications that provide their users with large amounts of data in an affordable and novel way. Users can navigate long sequences and have extensive control over their environment.

Future Directions

Java-based bioinformatics development is a new but rapidly growing industry. Java is facilitating academia's transfer from script-based development to heavy-duty applications development; these applications integrate large amounts of data and a variety of analysis algorithms to target specific niches of genomics research. Java is also changing the nature of how bioinformaticians and biologists work. With specialized APIs like JavaHelp, Java3D, and Web Services, developers can rapidly integrate tutorials, deliver more types of data, and provide novel views. With bioinformatics-driven APIs like EnsEMBL-Java and Biojava, Java developers can quickly access complex biological objects and integrate them into their applications.

Stephen Montgomery is currently a genetics graduate student at Canada's Michael Smith Genome Sciences Centre in Vancouver, BC.


Return to ONJava.com.

Copyright © 2009 O'Reilly Media, Inc.