Web DevCenter
oreilly.comSafari Books Online.Conferences.
MySQL Conference and Expo April 14-17, 2008, Santa Clara, CA

Sponsored Developer Resources

Web Columns
Adobe GoLive
Essential JavaScript
Megnut

Web Topics
All Articles
Browsers
ColdFusion
CSS
Database
Flash
Graphics
HTML/XHTML/DHTML
Scripting Languages
Tools
Weblogs

Atom 1.0 Feed RSS 1.0 Feed RSS 2.0 Feed

Learning Lab






Essential JavaScript

Document Mathematics: Count Your Words

03/30/2001

Document Mathematics: Count Your Words

OK, finally they convinced you to use HTML as the document format of your choice. Sure, it's a great format, as more and more applications make use of it -- not to mention the Web. But, to be honest, working with HTML documents is like living in the pre-word-processing age in many cases.

Have you ever tried to find out the number of words or letters in an element or the whole document? Or ever tried to include this number dynamically into your document? A DOM implementation, a bit of JavaScript, and this article will tell you -- by the way, this document contains  words.

Introduction

OK, let's say we want to create an application that counts the number of words in a paragraph, table, or the whole document, and prints this number exactly where we want it to. We want to be able to format the result, of course. Naturally, we need a reusable solution.

This is what we basically need to solve this task:

Also in Essential JavaScript:

Extending Dreamweaver: Let Dreamweaver Create Your Menus

Extending Dreamweaver with its JavaScript API

Accessing Dreamweaver's JavaScript API

Creating Themes with CSS and JavaScript

Parsing and DOM-Tree Building With JavaScript

  1. The ability to retrieve the text of an element and all its children
  2. A mechanism to scan for words and letters in the retrieved text.

The second sub-task is an easy one; a few lines of code and some regular expressions will do it. We'll talk about that later.

Since we want to be good guys, we're only using standards-compliant methods to work with documents when solving the first sub-task. We'll not use some (well, I have to admit it -- handy) property of style like text or innerText (who ever said that text was supposed to be part of the style object?).

Instead, I'll show you how to climb any DOM tree. You'll find this much handier, because you can recycle the techniques described for a lot of other tasks. I'll end our journey through the DOM with some final words and talk about the drawbacks of a different solution or buggy behavior.

A quick note...

Sometimes the term DOM might seem confusing. Depending on the context in which it's used, it can refer to:

  • The DOM (Document Object Model) Level 1 specification by the W3C. The recent working draft is DOM Level 3, but as support for DOM Level 1 is the least common denominator for Mozilla/Netscape 6 and Internet Explorer, we'll only use this version.
  • The implementation of the DOM specification in an application
  • A live object model of a document.

To keep things simple, I'll give only a very basic overview of the concepts behind the Document Object Model; we'll only talk about the things we need to solve our task.

Climb the DOM

The DOM is a (very) basic system to organize your elements in a set of parent-child relations, with every element having one parent element (except the root element) and exactly one element being the root element. To put it simply for our case, an element is created by some markup, indicated by the two angle brackets, < and >, or made of the text in between the markup.

When you visualize such a model, you'll find it forms a flipped tree, i.e., the root is at the top. Visualizing this model on your own is pretty easy, as you're already familiar with this system. Every time you mark up a document with HTML you're using it, as you're nesting elements within elements, thus creating parent-child relations. Picture 1 shows some HTML code and the corresponding part of the DOM tree.


Picture 1: Some HTML code and the corresponding part of the DOM tree

As you can see, the code translates into the graph almost automatically.

The node object
The basic unit of the Document Object Model is the node. This basic unit is extended for various purposes; e.g., there are element nodes, text nodes, attribute nodes, etc. Here are some important properties (and there's more detailed information available from www.w3.org):

  • nodeName, which contains the name of an element (e.g., table, a, meta, img)
  • nodeValue, which holds the containing text of a text node
  • nodeType, a number indicating one of the 12 node types. Text nodes share number 3.
  • parentNode, a reference to the parent node
  • firstChild, a reference to the first child node of a node. Corresponds to childNodes[0]
  • childNodes, a list of nodes containing all child nodes in sequential order. A nodelist is actually its own object; however, current DOM implementations allow you to access a nodelist like an array with JavaScript.

In addition to these properties, the DOM also defines a lot of methods for the individual objects. I'll introduce them only when necessary for our purposes. In case you're interested in the complete specs, you should not only read the descriptive texts, but also the ECMA Script Language Binding. The binding describes what a DOM implementation should look like from the JavaScripter's side.

Now that we have the description of the node object, how to connect to a browser's live DOM tree?

The document object
If you ever tried one of these "Hello World" examples, you're already familiar with this object. It's the object associated with every document since the introduction of JavaScript (and it contains a writeln() method to print out some string). The document object has been quite different on both major platforms ever since. Fortunately, we don't need to fork the code for basic actions like retrieving the root element of a document:

var doc_root = document.documentElement;

doc_root now contains a live connector to the document's root element (the HTML element for HTML documents). As every element is derived from the node object, you can simply climb around the DOM tree by using only the properties described above.

To find an individual element for which the ID attribute is defined, use the getElementById() method of document:

var foo_element = document.getElementById("foo");

And access the tree:

var parent = foo_element.parentNode;
var grandpa = parent.parentNode;
var the_bunch = parent.childNodes;
var Jon_Boy = parent.childNodes[0];

OK, back to our word counter problem. We're looking for text, which is always stored in a text node (don't worry about your JavaScript code, as this goes into a different kind of node), so we can compare the nodeType to 3 (which is the pre-defined constant value for text nodes):

var e = document.getElementById(src);
  if (e.nodeType == 3) {
    var content = e.nodeValue;
  } 

Now, things get complicated a bit. Our solution must handle any appearance of a DOM tree, because we want to count all words under a given element (remember the flipped tree model). The answer is actually simple: recursion.

Simple recursion

Recursion is one of the most powerful mechanisms in programming. It tends to be quite confusing, and has not been necessary for most of pre-DOM JavaScript applications. However, we only need some basic information about it, so I think its use will be clear for this application.

Basically, a recursive function is a function that calls itself while it's executed. The way a recursive function takes through the code must be well-defined to avoid an endless re-call.

The interesting part is: What happens when a function interrupts itself at some point by calling itself? You get the same behavior that you'd expect from any other function call. After returning from some method, you'll find your local variables untouched. Even after returning from a call to the same method, you'll find them untouched, as every function call operates with its own local variables.

Let's look at a non-recursive code to retrieve all (child) text nodes of some element:

var content = "";
if (node.childNodes != null) {
  for (var i=0, content; i < node.childNodes.length; i++) {
    if (e.nodeType == 3) { content += e.nodeValue; }
  }
}

To retrieve all child text nodes, you'd need some code that calls itself for every child node found, as each of these could contain more elements or text nodes.The solution is obvious: We wrap the code snippet above into a function:

var words = "";
traverse( document.getElementById("p1") );

function traverse(node) {
  if (node.nodeType == 3) { words += node.nodeValue; }
  if (node.childNodes != null) {
    for (var i=0; i < node.childNodes.length; i++) {
      traverse(node.childNodes.item(i));
    }
  }
}

That's it, a recursive function to walk through any DOM tree, no matter which form it takes. Adding text was shifted outside the loop to catch all nodes and, instead of this, the function calls itself in the loop. Because each opened loop is continued after returning from a (nested) method call, this algorithm catches all nodes under the root node given in the first call to traverse(), e.g., the element with the ID p1.

You may wonder why I initialized the words variable. Take line 5 of the previous example without an initialized string variable. This code would add a node's value to an undefined variable for the first run, thus yielding undefined when being cast to a string. That's why the words variable is initialized with an empty string.

We're almost finished now. Before wrapping the new code into another function, let's crack the next nut.

Regular expressions

As I mentioned before, finding words in a text is an easy task using regular expressions (sometimes called regexps). A regexp is a pattern to match against a string. This pattern contains literal characters, i.e., characters that are not interpreted, and special characters.

These special characters can be used to group and build classes of characters, but can also act as quantifiers for their leading character. With quantifiers, you can write patterns that say: "I need at least one letter of that" (+), or "I need at least 2, but a maximum of 5 letters" ({2,5}). Quantifiers used in JavaScript's regular expression language are shown in Table 1.

* Take as many characters as you can, matches for zero appearance of the character
+ Find at least one character and as many as you can
? Find exactly one character
{n,} Find at least n characters
{,n} Find a maximum of n characters
{n,m} Find at least n characters and at most m characters
Table 1: JavaScript's regular expression quantifiers.

In addition to that, the regular expression language defines a couple of special characters. A special character is indicated by a backslash (\). This special special character exists:

  • to include characters into patterns that are part of the special character set (including the backslash).
  • to express characters that can't be (directly) displayed, like the new line, \n, or \b for a word-boundary.
  • to build classes of characters, e.g., all characters that are numbers are part of the \d class; all characters that are letters, numbers, or the underscore (_) are grouped in the \w class.

You probably noticed that I already introduced everything we need. This is a simple definition of a word:

Find a word boundary, at least one letter, and then a word boundary.

Which translates into this regexp:

/\b\w+\b/

A regexp is defined in between a pair of slashes (/), so I've included them here. To find all occurrences of this pattern inside a string, we have to add a modifier to make the search global:

/\b\w+\b/g

With JavaScript's regexps, you have two options on how to invoke pattern matching. You can either invoke matching directly on some RegExp object, or you can use the methods defined for strings to work with regular expressions. We only need the match() method, which is defined as a method of String since JavaScript 1.2. The match() method takes a regexp as the first argument and returns an array that holds each found occurrence of the pattern. Thus you can retrieve the numbers of words simply by accessing the length property:

var w_a = words.match(/\b\w+\b/g);
var l = w_a.length;

For a detailed introduction to JavaScript's regular expression syntax, you should consult the JavaScript Guide.

Putting it together

Now, we know how to retrieve all text nodes and create a string of them, as well as how to count the words. The next step is to create another function where we wrap the recursive function and the variables described above. This method also handles the pattern matching. It expects an element to start the search at and returns the number of words; see countWords().

Display the result

Since we want the result to appear in the document, we need to define where it should appear. You can just use any HTML tag with its own ID. This ID is supplied to the method, which finds the appropriate element by document.getElementById(id). The correct way to insert new text into a DOM tree is creating a text node first. The document object has a method to create such a node (as well as for creating a lot of other node types):

var text_node = document.createTextNode( countWords(start_node) );

One more new method and we're done. text_node now holds the freshly created node with the result of countWords(). We don't want to destroy the element we're displaying the result in, so we append the text node to its child nodes, thus inserting the new (and then independent) node into the tree:

var counter = document.getElementByID(counter);
counter.appendChild(text_node);

We're finally done! To make this a handy real-world application, I've created another function, showWords(), which takes two arguments, an element to display the result in and an element in which to count the words. This function checks its arguments (and makes heavy use of JavaScript's loose typing), which means that you can supply an element's ID as well as a element. When no element to search is supplied, this function selects the document's root element. See the source code of showWords().

Now, you can easily include this function into your own documents. Just include the script and call the method after the DOM tree has built, i.e. after the document has been loaded. You can find the complete script and instructions in the O'Reilly JavaScript Library.

Some final words

As I already mentioned, one might use a different initial approach to create a working solution. However, here are some possible solutions and methods together with an explanation why I didn't use these.

Perhaps you looked at the ECMA Script Binding and found a method of Element I didn't talk about, getElementsByTagName(). This method takes a string argument and returns a NodeList with all elements found. Playing with some browser's DOM implementation, you might have also found that it gives you #text as the value for text nodes. So, you might induce that you can get all text nodes by saying:

document.getElementsByTagName("#text");

However, this does not work and is a non-standard way to access the DOM anyway.

Internet Explorer 5.x for Windows contains a bug that deletes white space after inserting a new node into the tree. You can avoid this behavior by using the HTML entity (&nbsp;) instead of a simple white space.

There is another point to mention. The classical way to insert your own text into a document is by using document.writeln(). However, this method can only be used after the text has been created, which means that you could print the result only behind the elements being counted. On the other hand, you can't write into a document after it has been created -- which makes writeln() unusable for our purpose.

The DOM implementations of the two major players, Mozilla/Netscape 6 and Internet Explorer, also store the textual content of the DOM elements in properties of the style object associated with every element. I'm not even gonna mention these diabolic properties again, as this is not standards-conforming behavior. Well, it might be handy to use these properties, but on the other side, now you're able to cope with the Document Object Model because you can walk through any tree with just a simple function and some easy recursion. Important knowledge for a lot of future topics to appear in Essential JavaScript.

Claus Augusti is O'Reilly Network's JavaScript editor.


Read more Essential JavaScript columns.

Return to the JavaScript and CSS DevCenter.