Web DevCenter    
 Published on Web DevCenter (http://www.oreillynet.com/javascript/)
 See this if you're having trouble printing code examples


Essential JavaScript

Parsing and DOM-Tree Building With JavaScript

04/13/2001

Also in Essential JavaScript:

Extending Dreamweaver: Let Dreamweaver Create Your Menus

Extending Dreamweaver with its JavaScript API

Accessing Dreamweaver's JavaScript API

Creating Themes with CSS and JavaScript

Document Mathematics: Count Your Words

Remember the first time you heard about the DOM? Do you still get this shiver running down your back at the mere mention of it?

You may call it a revolution, especially compared to the former pseudo-standard interfaces. However, as important as the DOM is, there are utilities missing in the standard implementation that we would like to have.

For example -- for many years the only way to insert dynamic content into your documents was to use the document's writeln() method. It's still there and still has the old disadvantage of destroying your documents when writing to an already closed document. So what we need is a standards-compliant method to output any text, interspersed with HTML markup, at any position of a document.

I'm going to shy away from using proprietary extensions to solve this problem. Instead, I'll create a simple parser and talk about the process of building a DOM tree from a plain string. In case you're totally new to the DOM, you may want to read the last Essential JavaScript column, Document Mathematics: Count Your Words, for a brief introduction. Also, before we start, I want to spend a few minutes explaining why I think taking the standards approach is so important.

Standards: to be or not to be?

I have a philosophical question: Is it best to enhance your pages with a tempting proprietary extension (remember those weird filters in IE) or stick with the standards? On one hand, proprietary extensions are convenient and do offer handy functionality for those in your audience who have the right configuration to use them. On the other hand, standards open your pages to the greatest possible audience, but with the drawback of possibly limited functionality and probably more work.

On one level, this article is much more than pure philosophy. You probably wouldn't even be able to read this article without many of the critical standards that serve as the foundation for the Web. Viewed from this perspective, it's hard to argue against standards-compliant authoring.

Astonishingly enough, however, many developers have wandered away from standards-compliant authoring. Possibly they've been tempted by the ease of using proprietary extensions that save time and effort. Yet by using proprietary extensions, those developers are sending all the efforts of those guys creating standardized layers of communication and transport to the /dev/null device. That's because they're blocking independent communication by proprietary presentation on a higher level.

From my point of view, the solution is simple: Just stick to the standards and use them whenever possible. Given this scenario, you're ready for new browsers or other applications that support the standards you're already using.

OK, that's enough of that for now -- let's get technical again!

Creating elements

To interact with the DOM, you have to think in terms of nodes and elements. A string containing markup can't be inserted into the tree without transforming it. Even a pure plain-text string can't be inserted into the tree without creating a text node first. To create a text node you use a method of the document object:

var text_node = document.createTextNode(a_string);

Now you have a text node object, which inherits from the Node and CharacterData classes defined in the DOM Level 1 spec.

Creating an HTML element is just as easy:

var html_element = document.createElement(element_name);

When working with HTML documents, the type of the returned object depends on the given element name. The DOM Level 1 HTML defines special objects for all HTML elements. These objects were designed to address two requirements: backwards compatibility to older, non-standard DOM implementations and providing an easier view to HTML elements, as you can access the properties directly on the objects rather than dealing with special attribute nodes.

Simple parsing

So, how to transform a plain string into DOM nodes? It sounds like a complicated task, but due to the strict and simple syntax of XML-based languages, it's not. Just remember two of the basic rules:

The XML syntax differentiates between two types of elements, empty and non-empty. A non-empty element is an element that can have some content, i.e., it can contain a set of child nodes; thus a non-empty element always has a start and an end tag.

The second type is the empty element. It is indicated by a single start tag and can't have child nodes other than attribute nodes. Empty elements are easily recognized by the slash, /, preceding the closing angle bracket.

The syntactical difference between the two types is very important for automated processing of XML-formatted data; more on that later.

As tags are always indicated by a trailing and leading angle bracket, it's easy to find them in a given string. Just scan for an opening angle bracket, then a closing one, and access everything in between:

s = "Some <b>bold</b> text";
for (i=0; i < s.length; i++) {
  c = s.charAt(i);
  if (c == "<") {
    j = s.indexOf(">", i+1);
    if (j != -1) {
      alert("tag="+s.substring(i,j+1);
      i = j;
    }
  }
}

This simple algorithm finds any tag regardless of its type. However, for building your own DOM tree, you only need to consider the start tags. An end tag is nothing more than an indicator marking the end of an element. This means that the content following the closing tag of a given element can only be part of this element's parent node or its descendants.

Building the tree

Let's have a closer look at the process of building a tree. Every tree has a single root node, which forms the base of the tree.

As we want to create a method to write into an element, we have this root point already. What we actually need to do is build a smaller sub-tree inside the document's complete tree and consider the element we're gonna write into as the root node of this tree.

Here's an example of a visualized DOM tree:

A visual represntation of a DOM tree.

The first step for an algorithm is to search for all tags (as described above) and textual data. In terms of computer science, this process is called scanning.

This scanning happens in a linear fashion, from left to right. If we consider the string as a two-dimensional object, our task is to add another dimension in height, thus expressing the different levels of the tree.

How would that look on a very simple example -- let's say adding some text under a paragraph element?

An algorithm must take care to add characters to text nodes only. So, when we start with a p element, we have to create a new node, assign the text as the value, and append the node to the p element node.

If we had some (non-empty) element following the text, we'd have to climb up the tree again and append the element to the parent node. Then we climb down the tree again into the appended node and expect further text or markup. Everything before the closing tag of this element is a descendant node of the actual element.

Empty elements, on the other hand, must only be appended to the actual pointer node (which holds our position in the tree), since they can't have child nodes.

As you can see, building a tree is rather simple after becoming familiar with the structure of the DOM. Now we must be able to create elements from an extracted substring. An algorithm with this behavior is implemented in writeCode().

Creating elements

After finding a tag, we have to transform it into a DOM node. The structure of a start tag is defined as follows:

An opening angle bracket, followed by the element's name. Optionally, it can contain a set of attributes, which consist of key/value pairs separated by an equality sign and separated from the name using white space. The tag is closed with a right angle bracket. Empty elements are indicated by a slash before the closing angle.

This pattern can be expressed easily with a regular expression like this (we'll handle empty elements later):

/<(\w+)(\s+)?([^>]+)?>/

The parentheses are used to group sub-patterns which can be accessed after executing the regular expression:

var a = str.match(/<(\w+)(\s+)?([^>]+)?>/);

The variable a now holds an array, which contains the matched sub-patterns starting with the left-most pattern at a[1]:

var node = document.createElement(a[1]);

Now that we have created an element, the next step is finding the attributes and adding each to the element. If present, the complete set of white-space separated attributes can be found in a[3]; just split it to get all key/value pairs:

var attrs = a[3].split(" ");

Using a loop, we access all items and split them once again, this time at the equality sign that separates the key and the value. Then we create a new attribute node from the key. Creating a new attribute node is a method of document (like creating elements). Before we can actually set the value, we have to remove the quotes of the value string:

var a_n = document.createAttribute(att[0]);
  a_n.value = att[1].replace(/^['"](.+)['"]$/, "$1");

That's it. After connecting the new attribute node to our formerly created element node, we're done:

node.setAttributeNode(a_n);

The actual library function additionally performs some checks; I only outlined the basic process here. See the complete source code of createElementFromString().

The algorithm is simple but effective and can handle every valid markup.

A few final words ...

We're done! We have a function to parse strings and create elements from it, createElementFromString(), and a function that creates a sub-tree under a given element, writeCode().

Actually, the DOM Level 1 spec defines a class for sub-trees. These are called DocumentFragments. Unfortunately, current implementations we're working with don't support this very well.

Of course, this wouldn't be a real browser application unless we had to deal with some wrong behavior in the DOM implementation.

Fortunately, this application only contains a single line that had to be changed to work in Internet Explorer also. This is because IE doesn't implement the appendData() method defined as a method of CharacterData in the DOM. The solution is to operate on the node's properties directly by accessing the nodeValue.

For instructions on how to use the code in your own pages, please refer to the O'Reilly JavaScript Library.

Claus Augusti is O'Reilly Network's JavaScript editor.


Read more Essential JavaScript columns.

Return to the JavaScript and CSS DevCenter.

Copyright © 2009 O'Reilly Media, Inc.