About a year ago I started a series of articles for O’Reilly on Artificial Intelligence topics with Python. The first of these articles was on decision trees and their uses in data mining and was rather well received. However, after getting off to a great start, my real life took over and my hobbies took a back seat and as a result this series stalled right out of the gate. Well, now that things have calmed down a bit I’m hoping to tack a few more articles onto this series and to start off I thought I might go over an email I recently received concerning the decision trees article, discuss a problem that was found in my implementation, go over a solution to the problem, and just try to give you all a better overall view of how this powerful technique works and maybe even give a little insight into its limitations as well.

To first sum up the email, the question within dealt with what happens in a situation where the decision tree built by the algorithm encounters a record that contains a value never before seen for one of its attributes. The example that was given in the email is described below.

given the following training dataset:

0, 0, LOW
0, 0, MEDIUM
0, 0, MEDIUM
1, 0, MEDIUM
1, 0, MEDIUM

classify the following record:

0, 1, LOW

The reason I wanted to share this question with everyone is that its a really good one for pointing out how a decision tree works and how the algorithm that builds one goes about its business. Look at the record we are tying to classify and see if you notice something about it that differs from the records in the training set. Do you see it? It’s in the second field of the record. Notice that, to date, there have only been zeroes in that field. Because of this, the decision tree has no idea how to classify the record, because it has no idea, from looking at the training set, that the value 1 could ever show up in the second attribute. This really brings to the surface two properties of decision trees that everyone should understand. First, its easy to see from this example that it is absolutely necessary to have a very large training set. Keep in mind that the decision tree algorithm uses probabilities to determine how well each attribute classifies data, and just like anything else in statistics, the larger the data set, the better the accuracy of the algorithm in determining these attributes. So, moral of the story, while small datasets are great for toy problems and examples, in the real world this technique is only trustworthy if there is a sufficiently large set of training data.

The second concept I wanted to point out was the general idea behind decision trees. Decision trees are classifiers and that’s it. The basic idea behind the algorithm is to take a look at a bunch of data and try to figure out which attributes of the data best classify each record in that data. That means if one attribute divides the data into two large groups and another divides the data into twenty small groups, the first is the better classifier since it does a better job of classifying large numbers of data points for each attribute value. So, what does this mean for our question? Well, what that means, essentially, is that the decision tree is no crystal ball. It can’t predict the classification of a record if it comes across one that has a value it’s never before seen and, therefore, has no idea of its existence. Makes since right? If our algorithm has never seen the value, how can it even know of the value’s existence?

So, where is all this talk leading us with respect to the question at hand? Well, there is actually a very simple answer to our problem of classifying a record with previously unseen attribute values–take a best guess. There are many different ways for doing this, some of which can be seen at the Missing Attributes section of this link (http://decisiontrees.net/node/34?PHPSESSID=b6335a10234cac9a895c31c6f139b8cc), however, the solution we are going to use is a very simple one. What we are going to do in our decision tree program is take a guess on the record’s classification based on the most frequently seen outcome from our training dataset (in the case of our sample data above, the default outcome would of course be MEDIUM). How do we go about altering our code to make this guess? Well that part is really easy. Take a look at line 150 of the dtree.py file. If you look at the original code from the article, you should see the following line:

tree = {best:{}}

This code creates a new node in the tree, which is a simple empty dict object. This means that when we search the tree, if we have a record that has no classification, meaning that we reach an empty node in our tree, we will get an error when we try to pass a key to this empty dict object. What we want our tree to do is return the default outcome whenever this situation is encountered. This can be easily accomplished by adding that default value to all of the empty nodes in our tree. Take a look at the following replacement code:

tree = {best:collections.defaultdict(lambda: default)}

What does this code do exactly? Well, essentially this code does exactly the same thing as before (i.e., creates an empty dict object as the next node in the tree), but it does so using the defaultdict method of the collections module. What does that do? It allows us to specify a function that can be ran whenever a key is passed to the dict object that does not already exist. In our code we simply create a lambda function that always returns the default CHOICE (i.e., most frequent outcome for our training dataset). This will ensure that whenever you try to classify a record that has no matching branch in the tree, you will get the tree’s best guess back as to how
it should be classified.

So that’s it, simple, eh? What we’ve done here is hopefully get a little better feel for how d-trees work and at the same time we’ve altered our original algorithm a bit to make it a bit more robust. I’ve attached to this post a new version of the original d-tree code from the article (get it here: Download file). In the tarball you’ll find the same files as in the original code, however, the dtree.py file will have the change we just discussed and the test.py file has also been altered to make a bit more usable. Now you can run the test and pass in a training dataset and a test dataset for classification, whereas it used to load these automatically and you needed to change the test.py file by hand if you wanted to run your own datasets. If you want to test out the new algorithm, just use the following line of code at the command prompt:

\$ python test.py training-data test-data

where training-data and test-data are the names of the datasets you wish to use with the algorithm. One more thing, I used the ‘with’ statement in the new test.py file, so if you’re using a version of Python before 2.5, you’ll either have to alter the new test code or use the test.py file from the original code.

Well, I think that’s it for this session. Hopefully everyone got a little bit out of this post and everyone reading this now has a little better overall understanding of this very important data mining technique. If you’re interested in learning some more artificial intelligence/machine learning techniques, keep your eyes peeled for a few more articles in the series and also keep an eye on this weblog, since I am planning on also posting some information here on these topics from time to time as well. I hope to see you all here again very soon.