Building Decision Trees in Pythonby Christopher Roach
You have a great idea to start selling the most marvelous widget ever known. You're so sure of your enterprising idea that you decide to go into business for yourself and begin manufacturing said widgets. A few years pass and you're a success. However, lately you've noticed a slump in sales, and you decide that you need a better way of focusing your marketing dollars toward your target audience. How do you do this?
This article introduces a popular and easy-to-use datamining tool called a decision tree that should help you solve your marketing dilemma.
Decision trees are a topic of artificial intelligence. More specifically, they belong to the subfield of machine learning. This is due to their ability to learn--through example--to classify individual records in a data set.
This article will give you a closer look at what decision trees are, and how they can help you in your endeavors to more effectively target your marketing campaign toward your core customers. In so doing, I'll go over a few different uses for decision trees (aside from the aforementioned marketing scenario), and I'll discuss one popular heuristic that can be used during the learning process to create the most effective decision tree. Finally, I'll implement a simple decision tree program using the Python programming language.
If you've ever had any interest in machine learning or artificial intelligence; if you find the idea of writing a program that has the ability to learn simply fascinating; or if you just happen to manufacture the world's best widgets, then this article is for you.
Decision Tree ... What's That?
Decision trees fall under the subfield of machine learning within the larger field of artificial intelligence. Decision trees are mainly used for classification purposes, but they are also helpful in uncovering features of data that were previously unrecognizable to the eye. Thus, they can be very useful in datamining activities as well as data classification. They work well in many different areas, from typical business scenarios (such as the widget example described previously) to airplane autopilots and medical diagnoses.
A decision tree is essentially a series of if-then statements, that, when applied to a record in a data set, results in the classification of that record. Therefore, once you've created your decision tree, you will be able to run a data set through the program and get a classification for each individual record within the data set. What this means to you, as a manufacturer of quality widgets, is that the program you create from this article will be able to predict the likelihood of each user, within a data set, purchasing your finely crafted product.
Though the classification of data is the driving force behind creating a decision tree program, it is not what makes a decision tree special. The beauty of decision trees lies in their ability to learn. When finished, you will be able to feed your program a test set of data, and it essentially will learn how to classify future sets of data from the examples.
Hopefully, if I've done my job well enough, you're champing at the bit to start coding. However, before you go any further with your code monkey endeavors, you need a good idea of what a decision tree looks like. It's one of those data structures that is easiest to understand with a good visual representation. Figure 1 contains a graphical depiction of a decision tree.
Figure 1. A decision tree
Notice that it actually is a true tree structure. Because of this, you can use recursive techniques to both create and traverse the decision tree. For that reason, you could use just about any tree representation that you remember from your Data Structures course to represent the decision tree. In this article, though, I'm going to keep everything very simple and create my decision tree out of only Python's built-in dictionary object.
Fleshing Out the Scenario
If you recall the example scenario, you are the manufacturer of the world's finest widgets and are currently looking for a way to focus your marketing efforts to more closely entice your target demographic. To do this, you need a set of test data to use to train the decision tree program.
I assume that, being the concerned entrepreneur that you undoubtedly are, you've already gathered plenty of demographic information through, perhaps, anonymous email surveys. Now, what you need to do is organize all this data into a large set of user records. Here is a table containing a sampling of the information you collected during your email survey:
|18-35||high school||low||single||won't buy|
|< 18||high school||low||single||will buy|
|> 55||bachelor's||high||single||will buy|
|> 55||master's||low||married||will buy|
|> 55||master's||high||single||will buy|
|< 18||high school||high||single||won't buy|
|36-55||high school||low||single||will buy|
|< 18||high school||low||married||will buy|
|> 55||high school||high||married||will buy|
|> 55||bachelor's||low||single||will buy|
|36-55||high school||high||married||won't buy|
Given the decision tree in Figure 1 and the set of data, it should be somewhat easy to see just how a decision tree can classify records in a data set. Starting with the top node (Age), check the value of the first record in the field matching that of the top node (in this case, 36-55). Then follow the link to the next node in the tree (Marital Status) and repeat the process until you finally reach a leaf node (a node with no children). This leaf node holds the answer to the question of whether the user will buy your product. (In this example, the user will buy, because his marital status is single). It's also quite easy to see that this type of operation lends itself to a recursive process (not necessarily the most efficient way to program, I know--but a very elegant way).
The decision tree in the figure is just one of many decision tree structures you could create to solve the marketing problem. The task of finding the optimal decision tree is an intractable problem. For those of you who have taken an analysis of algorithms course, you no doubt recognize this term. For those of you who haven't had this pleasure (he says, gritting his teeth), essentially what this means is that as the amount of test data used to train the decision tree grows, the amount of time it takes to do so grows as well--exponentially. While it may be nearly impossible to find the smallest (or more fittingly, the shallowest) decision tree in a respectable amount of time, it is possible to find a decision tree that is "small enough" using special heuristics. It is the job of the heuristic you choose to accomplish this task by choosing the "next best" attribute by which to divide the data set according to some predefined criteria. There are many such heuristics (C4.5, C5.0, gain ratio, GINI, and others). However, for this article I've used one of the more popular heuristics for choosing "next best" attributes based on some of the ideas found in information theory. The ID3 (information theoretic) heuristic uses the concept of entropy to calculate which attribute is best to use for dividing the data into subgroups.
The next section quickly covers the basic idea behind how this heuristic works. Don't worry; it's not too much math. Following this discussion, you'll finally get a chance to get your hands dirty by writing the code that will create the decision tree and classify the users in your data set as a "will buy" or "won't buy," thereby making your company instantly more profitable.