|MySQL Conference and Expo April 14-17, 2008, Santa Clara, CA|
Document Mathematics: Count Your Words03/30/2001
Document Mathematics: Count Your Words
OK, finally they convinced you to use HTML as the document format of your choice. Sure, it's a great format, as more and more applications make use of it -- not to mention the Web. But, to be honest, working with HTML documents is like living in the pre-word-processing age in many cases.
OK, let's say we want to create an application that counts the number of words in a paragraph, table, or the whole document, and prints this number exactly where we want it to. We want to be able to format the result, of course. Naturally, we need a reusable solution.
This is what we basically need to solve this task:
The second sub-task is an easy one; a few lines of code and some regular expressions will do it. We'll talk about that later.
Since we want to be good guys, we're only using standards-compliant methods to work with documents when solving the first sub-task. We'll not use some (well, I have to admit it -- handy) property of
Instead, I'll show you how to climb any DOM tree. You'll find this much handier, because you can recycle the techniques described for a lot of other tasks. I'll end our journey through the DOM with some final words and talk about the drawbacks of a different solution or buggy behavior.
A quick note...
Sometimes the term DOM might seem confusing. Depending on the context in which it's used, it can refer to:
To keep things simple, I'll give only a very basic overview of the concepts behind the Document Object Model; we'll only talk about the things we need to solve our task.
Climb the DOM
The DOM is a (very) basic system to organize your elements in a set of parent-child relations, with every element having one parent element (except the root element) and exactly one element being the root element. To put it simply for our case, an element is created by some markup, indicated by the two angle brackets,
When you visualize such a model, you'll find it forms a flipped tree, i.e., the root is at the top. Visualizing this model on your own is pretty easy, as you're already familiar with this system. Every time you mark up a document with HTML you're using it, as you're nesting elements within elements, thus creating parent-child relations. Picture 1 shows some HTML code and the corresponding part of the DOM tree.
As you can see, the code translates into the graph almost automatically.
The node object
Now that we have the description of the node object, how to connect to a browser's live DOM tree?
The document object
To find an individual element for which the ID attribute is defined, use the
And access the tree:
Now, things get complicated a bit. Our solution must handle any appearance of a DOM tree, because we want to count all words under a given element (remember the flipped tree model). The answer is actually simple: recursion.
Basically, a recursive function is a function that calls itself while it's executed. The way a recursive function takes through the code must be well-defined to avoid an endless re-call.
The interesting part is: What happens when a function interrupts itself at some point by calling itself? You get the same behavior that you'd expect from any other function call. After returning from some method, you'll find your local variables untouched. Even after returning from a call to the same method, you'll find them untouched, as every function call operates with its own local variables.
Let's look at a non-recursive code to retrieve all (child) text nodes of some element:
To retrieve all child text nodes, you'd need some code that calls itself for every child node found, as each of these could contain more elements or text nodes.The solution is obvious: We wrap the code snippet above into a function:
That's it, a recursive function to walk through any DOM tree, no matter which form it takes. Adding text was shifted outside the loop to catch all nodes and, instead of this, the function calls itself in the loop. Because each opened loop is continued after returning from a (nested) method call, this algorithm catches all nodes under the root node given in the first call to
You may wonder why I initialized the words variable. Take line 5 of the previous example without an initialized string variable. This code would add a node's value to an undefined variable for the first run, thus yielding
We're almost finished now. Before wrapping the new code into another function, let's crack the next nut.
As I mentioned before, finding words in a text is an easy task using regular expressions (sometimes called regexps). A regexp is a pattern to match against a string. This pattern contains literal characters, i.e., characters that are not interpreted, and special characters.
These special characters can be used to group and build classes of characters, but can also act as quantifiers for their leading character. With quantifiers, you can write patterns that say: "I need at least one letter of that" (
In addition to that, the regular expression language defines a couple of special characters. A special character is indicated by a backslash (
You probably noticed that I already introduced everything we need. This is a simple definition of a word:
Find a word boundary, at least one letter, and then a word boundary.
Which translates into this regexp:
A regexp is defined in between a pair of slashes (/), so I've included them here. To find all occurrences of this pattern inside a string, we have to add a modifier to make the search global:
Putting it together
Now, we know how to retrieve all text nodes and create a string of them, as well as how to count the words. The next step is to create another function where we wrap the recursive function and the variables described above. This method also handles the pattern matching. It expects an element to start the search at and returns the number of words; see
Since we want the result to appear in the document, we need to define where it should appear. You can just use any HTML tag with its own ID. This ID is supplied to the method, which finds the appropriate element by
One more new method and we're done.
As I already mentioned, one might use a different initial approach to create a working solution. However, here are some possible solutions and methods together with an explanation why I didn't use these.
Perhaps you looked at the ECMA Script Binding and found a method of
However, this does not work and is a non-standard way to access the DOM anyway.
Internet Explorer 5.x for Windows contains a bug that deletes white space after inserting a new node into the tree. You can avoid this behavior by using the HTML entity (
There is another point to mention. The classical way to insert your own text into a document is by using
The DOM implementations of the two major players, Mozilla/Netscape 6 and Internet Explorer, also store the textual content of the DOM elements in properties of the