ONJava.com    
 Published on ONJava.com (http://www.onjava.com/)
 See this if you're having trouble printing code examples


SearchAssist: A Portable Search Engine in Java

by Ashwin Jayaprakash
10/01/2003

2257! That's how many bookmarks I have in my Mozilla browser right now, all of which I upload to my web site every month. Uploading them is not a problem, but displaying them nicely on a web site without server-side support is. That simply rules out SQL, any JSPs, and other fancy stuff. Sure, DHTML is good, but it has its limitations. So, the only solution that remained was to download everything to the client's machine and run it there — an applet, that's it! But that presents its own challenges: the applet has to be small, the index must be pre-generated, and the applet should be able to read the index, retrieve the results, and then display them — all without having to make any calls back to the server. This looks like an ideal candidate for a three-tier web application.

After four weeks of meddling with XML, XPath, Swing, and LiveConnect, I came up with SearchAssist, a portable search engine written in Java. It's a Java applet of 30KB, using a ternary search tree. The UI components are a Swing interface and HTML, as shown in Figure 1, and the input format is XML. Other cool features are:

SearchAssist using the XPathNodeAccessHTMLRenderer
Figure 1. SearchAssist

Let's go through the entire process of design and development of SearchAssist.

SearchAssist Design

Input Data Format

When you consider the functionality that is expected, the potential applications of this project become obvious. For example, using XML (instead of the HTML format of Mozilla bookmarks) opens up new usage scenarios: a lightweight help system for another application, perhaps, or a photo album that can be searched using the person's name. The possibilities are endless. The only trick is designing a suitable data format able to support multiple nested levels of bookmarks and topics, such as Mozilla allows (see Figure 2).

Mozilla Bookmark manager
Figure 2. The Mozilla bookmark manager

The XML will have a Category tag as the root. See Figure 3 for an example. A Category can have nested categories or Items as leaves. Both Category and Item have Title tags. Each Item has an additional Link tag that, as the name suggests, contains the link to the actual text, web site, or photo, as the case may be. Each Category or Item is uniquely addressable. Both Category and Item tags have an attribute called uniqueID that contains a string that does not repeat in the document. Such a format is also ideal for storing nested topics for a help system.

XML input file generated from the Mozilla bookmarks
Figure 3. An XML input file generated from Mozilla bookmarks

Indexing

The Index, like a book's index, is a map of keywords to locations (much like a book's index points to page numbers where the index entries appear in the main text). In our XML file, the keywords are the text in the Title tags of Category and Item tags. Since each Category and Item tag carries a uniqueID, they are similar to the page numbers in our book index example.

For a seasoned Java programmer, a HashMap should come to mind, as it can be serialized to a file. Although this is the right way of thinking, remember that the index must be compact and easily searchable in as few steps as possible. Although widely used (and sometimes overused), the HashMap does not completely satisfy the requirements at hand. It does not differentiate between an Integer, String, or Array, as long as the key is an Object. If there are three keywords words in the XML input file (such as "book," "bookmark," and "booklet"), then a HashMap will treat all three of them as just Objects and hence they will be hashed to completely different locations. Although all three words have a common prefix — "book" — the HashMap cannot use this pattern effectively, being unaware of the nature and content of its keys.

A ternary search tree, on the other hand, is designed so that the nodes with the same prefix share the same parent in the search tree. Instead of treating a keyword as a single, unbreakable Object, it is split into an array of Characters. Keys that share common prefixes reuse the same characters. The remaining, unmatched part of the keyword branches off. This way of storing saves a lot of space, as the tree is aware of the keyword's content. Such a data structure automatically lends itself to providing another very useful UI feature — auto-completion.

Auto-completion

Since the index is aware of the keywords, the search program navigates the tree so that the string of characters that have been visited in the tree match the text that the user has typed. If the user has typed "boo", then the tree has seen b, o, and o. The tree can then guess that the user is typing book, bookmark, or booklet, if the tree contains only these three words starting with boo. The tree will supply the closest match that appears first in alphabetical order. If there is no match, there will be no auto-completion, so the user can stop without having to type the entire search word.

Displaying Search Results

The search results should be displayed if the user's typed text matches a keyword in the index. Because the actual information can be as varied as simple text, web sites, or photos, the display of search results must be customizable. To achieve this, the actual implementation must be kept separate from the interface — a simple programming idiom.

UI Components and Layout

Now that we've hammered out most of the challenges, it's time to design the UI layout. The layout is very minimal, as seen in Figure 4. The most crucial Swing components are a custom JTextField to do auto-completion, a JTextArea to display the list of nearest matches and a JPanel containing a custom JComponent to display the search results.

Layout of the SearchAssist UI
Figure 4. The layout of the UI

Generating the Ternary Search Tree Index File

Source Code

Download the example source code for this article.

The Mozilla bookmarks file must be converted to XML format. A careful observation of the HTML bookmarks file reveals a simple structure that can be easily transformed to the XML structure using SAX. The HTML file contains several attributes that are not required for the transformation, and so can be safely ignored. See MozillaBookmark2XMLConverter.java.

The resulting XML file, containing the Category, Title, Item, and Link tags, is fed to a small Java program that just echoes the XML tags and text content into another file. It is a simple SAX Handler. It also generates a uniqueID for each Category and Item, adding it as an attribute and value pair. The IDs are generated such that the parent's ID is also preserved in the child's ID. If the parent XML node has an ID of 1_2, its children will have IDs of 1_2_1, 1_2_2, 1_2_3, and so on, as shown in Figure 5. This makes it possible to provide a "Go to Parent" link beside each subtree in the search results. Since every node carries within itself the parent's ID as a prefix, it is very easy to navigate to the parent node. I haven't yet implemented the navigation, though. See XMLIDGenerator.java.

index generation algorithm
Figure 5. The index-generation algorithm, illustrated

The code uses a combination of Stacks and HashMaps to accomplish this. The Stack is used because the parent's ID is a prefix for the child's ID. So, it helps keep track of the depth of the tree. The HashMap keeps track of which sibling the current node is, for that level. A node could be 1_1_3 if it's the second sibling, or 1_1_3, if it's the third.

A list of keywords and their corresponding uniqueIDs is required for the index. Since the uniqueIDs have been generated in the previous step, the next step is to generate the keywords that will point to the uniqueIDs, as shown in Figure 6.

XML input file with Keywords tag containing keywords picked from Mozilla bookmarks
Figure 6. The XML input file with keywords

Again, a simple SAX handler retrieves the text of the Title tags and tokenizes them. This list of tokens tag will contain non-keywords or stop words such as articles, pronouns, and prepositions that can be weeded out. The stop words are stored in a file that can be referred to by the SAX Handler. The filtered list of keywords is written back as a comma-separated list under a Keywords tag of the parent Category or Title tag. See KeywordTagAdder.java, excerpted here:

public void startElement(
    String namespaceURI,
    String lName, // local name
    String qName, // qualified name
    Attributes attrs) throws SAXException
{
    indentLevel++;
    nl();

	// element name
    String eName = lName;

    if ("".equals(eName))
    {
		// namespaceAware = false
        eName = qName;
    }

    currTagName = eName;
    emit("<" + eName);

    if (attrs != null)
    {
        for (int i = 0; i < attrs.getLength(); i++)
        {
            String aName = attrs.getLocalName(i);

            // Attr name
            if ("".equals(aName))
            {
                aName = attrs.getQName(i);
            }

            emit("\t");
            emit(aName);
            emit("=\"");
            emit(attrs.getValue(i));
            emit("\"");
        }
    }
    emit(">");
}

public void characters(char buf[], int offset, int len)
    throws SAXException
{
    String s = new String(buf, offset, len).trim();
    if (!s.equals(""))
    {
        //      nl();
        emit(s);
        keywords = null;
        if (currTagName
            .equals(getTagNameContainingKeywords()))
        {
            keywords = s;
        }
    }
}

public void endElement(
    String namespaceURI,
    String sName, // simple name
    String qName // qualified name
) throws SAXException
{
	// element name
    String eName = sName;

    if ("".equals(eName))
    {
        eName = qName;
        // namespaceAware = false
    }

    nl();
    emit("</" + eName + ">");

    //if (currTagName.equals(getTagNameContainingKeywords())) 
    //causes problems

    if (eName.equals(getTagNameContainingKeywords()))
    {
        nl();
        emit("<" + getTagNameToPutKeywordsIn() + ">");
        nl();
        emit(convertToKeywords(keywords));
        nl();
        emit("</" + getTagNameToPutKeywordsIn() + ">");
    }
    indentLevel--;
}

With the XML file containing both the Keywords and the uniqueIDs, the ternary search tree can be generated. As a keyword can occur in more than one Category or Item, the ternary search tree stores a List of uniqueIDs instead of just one. This List can provide another useful UI feature. The size() method on the List can determine the number of hits for the corresponding keyword, displayable in square brackets in the JTextArea. The ternary search tree thus populated is serialized into a file and stored on the disk. See TSTGenerator.java.

The steps preceding the actual index generation are required only if the Mozilla bookmarks file is used as the source of information. The index can be generated from any source — it is just a mapping of keywords to lists of unique IDs.

The Mozilla bookmarks file may contain some characters that cause the XML parser to fail. Because HTML allows unclosed tags such as <P>, <BR>, <HR>, and <DT>, these unclosed tags must be replaced by their XHTML equivalents — <p />, <br />, <hr />, and <dt />. As URLs may contain ampersands (&) as part of HTTP GET requests, they too must be converted to an XML entity equivalent, such as &#038;. A small Java class finds and replaces such troublemaking strings in the XML file before it is fed to the XML parser.

The steps described above constitute a simple information transformation pipeline. See SpecialCharactersFilter.java.

Search Engine Applet Implementation

Under the Hood

The ternary search tree is the heart of the applet. The index is deserialized from the file and loaded into memory in the applet's init() method. To navigate the tree (i.e., to query the index), the text content of the JTextField is used. It is wired to listen to keyPressed and keyReleased events using the addKeyListener(KeyListener) method. For every character pressed, the current position in the index moves deeper into the tree, provided the prefix exists.

To provide the auto-completion functionality, the JTextField is customized through inheritance. The AutoCompleteTextField has an anonymous inner class that extends the KeyAdapter to listen to keyPressed and keyReleased events. The AutoCompleteTextField uses an AutoCompleteDictionary to do a lookup(textFieldContents) of the typed text. The AutoCompleteDictionary is an interface, so that different implementations can be plugged in. The lookup(textFieldContents) method returns a string that is the auto-completion text suggested by the AutoCompleteDictionary. If the string is empty, textField remains unchanged, indicating that the search word does not exist. See AutoCompleteTextField.java, excerpted here:

protected void setUp()
{
    addKeyListener(new KeyAdapter()
    {
        public void keyPressed(KeyEvent e)
        {
            isTextSelected =
                getSelectionStart()
                    != getSelectionEnd();
        }
        
        public void keyReleased(KeyEvent e)
        {
            char charPressed = e.getKeyChar();
            int charCodePressed = e.getKeyCode();
            if (charCodePressed == KeyEvent.VK_DELETE
                || charPressed == KeyEvent.CHAR_UNDEFINED)
            {
                return;
            }
            if (getSelectionStart()
                != getSelectionEnd())
            {
                setText(
                    getText().substring(
                        0,
                        getSelectionStart()));
            }
            String input = getText();
            if (lookup(input) != null)
            {
                setText(lookup(input));
                setSelectionStart(input.length());
                setSelectionEnd(getText().length());
                isTextSelected = true;
            }
            else
            {
                isTextSelected = false;
            }
            if (charCodePressed
                == KeyEvent.VK_BACK_SPACE
                && isTextSelected
                && input.length() > 0)
            {
                setText(
                    input.substring(0, input.length()));
            }
        }
    });
}

Figure 7 shows the package structure of the applet.

Package structure
Figure 7. The package structure

Because the ternary search tree is the dictionary here, it can be used directly to provide the auto-completion text. The TSTDictionary implements the AutoCompleteDictionary's lookup(textFieldContents) method by invoking the matchPrefixString(textFieldContents) on the ternary search tree it contains internally. If the tree has a keyword whose prefix matches or completely matches the textFieldContents, the keyword is returned. Otherwise, if there are no matches it returns null. Apart from returning the closest match, the TSTDictionary populates the JTextArea with the remaining near-matches. The size() of the lists of keywords contained is displayed alongside the keywords in square brackets indicating the number of hits. See TSTDictionary.java, excerpted here:

public String lookup(String s)
{
    String textFieldContents = textField.getText();
    if (textFieldContents == ""
        || textFieldContents == null)
    {
        return null;
    }

    String matchingKeys = tst.matchPrefixString(textFieldContents);
    textArea.setText(matchingKeys);
    String match = tst.matchPrefixString(textFieldContents, 1);

    if (match != null)
    {
        if (match == "")
        {
            match = s;
        }
        else
        {
            int markerLocation =
                match.indexOf(
                    TernarySearchTree.MARKER_HIT_OPEN)
                    - 1;
            if (markerLocation > 0)
            {
                match = match.substring(0, markerLocation);
            }
        }
    }
    return match;
}

The Custom JTextField

When the user presses the Enter key in the AutoCompleteTextField, the ActionListener's actionPerformed() method will be invoked. This method retrieves the list of unique IDs from the ternary search tree using the textField contents. If the textField contains a keyword that exists in the tree, it will be auto-completed. The list of unique IDs thus obtained is rendered on the right-hand side. See TSTApplet.java, excerpted here:

public void init()
{
    super.init();
    final String serTSTFile =
        getParameter(PARAM_SERIALIZED_TST_FILE);
    final String idRendererImpl =
        getParameter(PARAM_ID_RENDERER_IMPL);
    try
    {
        ObjectInputStream ois =
            new ObjectInputStream(
                getClass().getResourceAsStream(
                    serTSTFile));
        tst = (TernarySearchTree) ois.readObject();
        Class clazz =
            getClass().getClassLoader().loadClass(
                idRendererImpl);
        idRenderer = (IDRenderer) clazz.newInstance();

        // Anonymous inner class implementing
        // the Private Interface pattern.
        Attributes attrs = new Attributes()
        {
            public String getParameter(String paramName)
            {
                return TSTApplet.this.getParameter(
                    paramName);
            }
            public JSObject getWindow()
            {
                return JSObject.getWindow(
                    TSTApplet.this);
            }
        };
        idRenderer.init(attrs);
        renderingComponent =
            idRenderer.getRenderingComponent();
    }

...

    initComponents();
    dictionary = new TSTDictionary(tst, textField, textArea);
    textField.setDictionary(dictionary);
}

private void textFieldActionPerformed(
    java.awt.event.ActionEvent evt)
{
    String str = textField.getText();
    if (str.equals(""))
    {
        return;
    }

    List list = tst.get(str);
    if (list == null)
    {
        return;
    }
    idRenderer.renderNodes(list);
}

Customizable Search Result Renderer

From the beginning, one of the main goals has been to keep the rendering customizable. To facilitate that, the TSTApplet uses an interface on which it invokes the renderNodes(List ids) method. This IDRenderer interface has two lifecycle methods.

The init(Attributes params) method initializes the component with the Attributes instance, and the getRenderingComponent() method returns the actual Swing component that renders the search results. This component is added to the JPanel on the right-hand side. To TSTApplet is made aware of the class that implements IDRenderer by specifying its fully qualified class name as the IDRendererImpl Applet parameter. Using this, the applet, in the init() method, dynamically loads the IDRenderer implementation class. See IDRenderer.java (shown below), TST.html, and TSTApplet.java.

/**
 * Order of invocation: (setApplet)             - Once
 *                      (getRenderingComponent) - Once
 *                      (renderNodes)           - Zero or more
 */
public interface IDRenderer
{
  public void init(Attributes params);
  public JComponent getRenderingComponent();
  public void renderNodes(List ids);
}

Rendering HTML Using JEditorPane

The index, in this case, is a map of keywords present in the Title tags of Categorys and Items to their unique IDs. Consequently, the IDRenderer implementation must display the XML fragment marked by the uniqueID. In order to do that, the XML file containing only the unique IDs (without the keywords) must be available to the Applet. Although the XML file with the IDs and keywords will serve the purpose, the additional keyword tags increase the file size and the memory required to load it. The IDRenderer that does this is XPathNodeAccessHTMLRenderer.

On instantiation, the XML file is loaded as a DOM document. To allow the XML file to be switched or replaced with an updated file, the name and location are not hardcoded. The XMLWithIdsFile Applet parameter points to the actual location. Even the index file's location is not hardcoded. It is instead fetched using the SerializedTSTFile Applet parameter. These two input files are compressed and placed in a separate .zip file called DataFiles.zip that should be present in the Applet's classpath. See XPathNodeAccessHTMLRenderer (shown here) and TST.html.

public void init(Attributes attrs)
{

...

        // Create a builder factory
        DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
        factory.setValidating(false);
        DocumentBuilder docBuilder = factory.newDocumentBuilder();

        // Create the builder and parse the file
        final String xmlIDsFile = attrs.getParameter(PARAM_XML_WITH_IDS_FILE);
        Document inDoc = docBuilder.parse(
                getClass().getResourceAsStream( xmlIDsFile ));
        doc = inDoc;

...

    postInit(attrs);
}

protected void preRenderNodes(
    List ids,
    StringBuffer strBuf)
{
    editorPane.setCursor(waitCursor);
    strBuf.append(
        "<head><base></head><body>\n");
}

public void renderNodes(List ids)
{
    StringBuffer stringBuffer = new StringBuffer();
    preRenderNodes(ids, stringBuffer);

    for (Iterator iter = ids.iterator(); iter.hasNext();)
    {
        String id = (String) iter.next();
        try
        {
            String part = "[@uniqueID='" + id + "']";
            renderNode(
                "//category"
                    + part
                    + " | //item"
                    + part,
                stringBuffer);
        }
        catch (TransformerException e)
        {
            e.printStackTrace();
        }
    }

    postRenderNodes(ids, stringBuffer);
}

protected void postRenderNodes(
    List ids,
    StringBuffer strBuf)
{
    strBuf.append("</body></html>");
    editorPane.setText(strBuf.toString());
    editorPane.setCursor(cursor);
}

XPath to the Rescue

The XPathNodeAccessHTMLRenderer uses a JEditorPane to render the XML fragment as HTML with links. That's why getRenderingComponent() returns a JEditorPane. When the user presses the Enter key in the textField to request the search results, the AutoCompleteTextField's ActionListener fetches the list of unique IDs for the search word, if it exists, sending it to the IDRenderer's renderNodes(List ids).

For each uniqueID in the list, the renderNodes(...) method uses XPath to retrieve the XML fragment pointed to by the ID. Since the elements can only be either Category or Item tags, the XPath query is fine-tuned to improve search speeds. With the Element thus obtained, simple HTML lists are generated by iterating through it recursively. The Title tags become the List items in bold, and the Links become HTML anchor tags displayed as nested HTML lists. See XPathNodeAccessHTMLRenderer.java.

Java-JavaScript Communication

When a Link is clicked in the JEditorPane, a new browser window opens to display the URL. This requires Netscape LiveConnect to do the Java-JavaScript communication. See LinkClickListener.java, excerpted here:

public class LinkClickListener implements HyperlinkListener
{
    private JSObject window;
    public LinkClickListener(Attributes attrs)
    {
        window = attrs.getWindow();
    }

    public void hyperlinkUpdate(HyperlinkEvent event)
    {
        if (event.getEventType()
            == HyperlinkEvent.EventType.ACTIVATED)
        {
            String[] args =
                new String[] {
                    event.getURL().toExternalForm(),
                    "",
                    "" };
            window.call("open", args);
        }
    }
}

HTML Renderer Using the Mozilla Sidebar

Because the search result rendering is designed to be customizable, transforming the applet into a Mozilla Sidebar is quite simple. The XPathNodeAccessHTMLRenderer is extended by another class, the MozillaSidebar, to reuse most of the XML to HTML transformation. Instead of rendering the resulting HTML in the right-hand-side JEditorPane, it is rendered in a small browser window using JavaScript through LiveConnect, as shown in Figure 8.

SearchAssist using the MozillaSidebar renderer
Figure 8. SearchAssist in the Mozilla Sidebar

The getRenderingComponent() returns a JPanel of zero size. So the right-hand side is almost invisible, leaving only the JTextField and the JTextArea visible, which fits into the Sidebar panel on the left-hand side of the browser. See MozillaSidebar.java and TST v1.1.html, excerpted here:

public class MozillaSidebar
    extends XPathNodeAccessHTMLRenderer
{
    private static final boolean DEBUG = true;
    private JSObject window;
    public MozillaSidebar()
    {
        super();
    }

    public JComponent getRenderingComponent()
    {
        JPanel panel = new JPanel(null);
        panel.setSize(0, 0);
        return panel;
    }

    protected void postInit(Attributes attrs)
    {
        window = attrs.getWindow();
    }

    protected void preRenderNodes(
        List ids,
        StringBuffer strBuf)
    {
        strBuf.append(
            "<head><title>Search results</title>"
                + "<base></head><body>\n");
    }

    protected void postRenderNodes(
        List ids,
        StringBuffer strBuf)
    {
        strBuf.append("</body></html>");
        window.call(
            "renderHTML",
            new Object[] { strBuf.toString()});
    }
}

<!-- JavaScript code accompanying the Applet to render the search results
in a Browser pop-up window -->
<script language="JavaScript">  
function renderHTML(str) 
{     
  newWindow = window.open("","SearchResults",
              "height=300,width=350,resizable");
  newWindow.document.write(str);
  newWindow.document.close();
  
  newWindow.focus();
  return true;
} 
</script>

The .jar file must be signed, because the XML parser reads some System properties that are not available to unsigned Applets.

Customization Through Applet Parameters

The SerializedTSTFile, IDRendererImpl, and the XMLWithIdsFileApplet parameters are provided to make the necessary customizations. The Archive parameter is used to supply DataFiles.zip, which contains all the data files, especially the serialized index and the XML file.

<!-- Applet code using XPathNodeAccessHTMLRenderer -->
<APPLET CODE = "searchassist.ui.TSTApplet" 
    CODEBASE = "." 
    ARCHIVE = "SSearchAssist.jar, DataFiles.zip" 
    WIDTH = "590" HEIGHT = "600" 
    ALIGN = "baseline">
  <PARAM NAME = SerializedTSTFile 
    VALUE ="/resources/tst.ser">
  <PARAM NAME = IDRendererImpl 
    VALUE ="searchassist.ui.XPathNodeAccessHTMLRenderer">
  <PARAM NAME = XMLWithIdsFile 
    VALUE ="/test/output_index_mod.xml">
</APPLET>

<!-- Applet code using MozillaSidebar -->
<APPLET CODE = "searchassist.ui.TSTApplet" 
    CODEBASE = "." 
    ARCHIVE = "SSearchAssist.jar, DataFiles.zip" 
    WIDTH = "163" HEIGHT = "450" 
    ALIGN = "baseline">
    
  <PARAM NAME = SerializedTSTFile 
    VALUE ="/resources/tst.ser">
  
  <PARAM NAME = IDRendererImpl 
    VALUE ="searchassist.ui.MozillaSidebar">
  
  <PARAM NAME = XMLWithIdsFile 
    VALUE ="/test/output_index_mod.xml">
</APPLET>

Conclusion

SearchAssist is limited by memory consumption, as the whole index is loaded all at once. The XML file also takes up memory when loaded using DOM. More mature index systems like JavaHelp should be used for larger indices.

Resources

Requirements

Ashwin Jayaprakash is a software engineer at BEA Systems R & D Centre, Bangalore, using Java/J2EE to develop WebLogic Integration products.


Return to ONJava.com.

Copyright © 2009 O'Reilly Media, Inc.