Data Mining with Weka (3.4: Decision trees)

Hello everyone! This is course 3.4. Let's continue to learn simple classifiers, The classifier that generates the decision tree. Let's look at J48. We have used this classifier many times. Let us learn how it works. J48 is based on a top-down strategy, a recursive divide-and-conquer strategy. Select an attribute and place it at the root node, for every possible attribute value Generate a branch, divide the instance into multiple subsets, Each subset corresponds to a branch of the root node. Then repeat this process recursively on each branch, we can only use the instance that reached this branch, To select attributes for the node. When all instances have the same classification, stop. The trick or question here is how to choose the attributes of the root node. If we look at the weather data, we can see that outlook is taken as the root node. There are four possibilities: outlook, windy, humidity and temperature. These are the results of the split in each attribute. What we want to see is pure splitting, that is, splitting into pure nodes.

We hope to find an attribute, one of its nodes is all yes, The other node is all no, maybe the third node It's all yes again. This can be the best case. We don’t want a mixture of yes and no, because mixed nodes need Split again. You can see that splitting with outlook will be neat. One of the branches has two yes and three no, and the branch overcast is all yes, There are three yes and two no on the branch rainy. How do we quantify the attributes that can produce the purest child nodes? We need to calculate purity. The goal is to get the smallest decision tree, and the top-down tree induction method uses some heuristic methods. The most famous heuristic that produces pure nodes is the heuristic based on information theory. We are not going to explain information theory here, it will be another online course, Very interesting one. The founder of information theory is Claude Shannon, an American mathematician, Passed away 12 years ago.

He is an amazing person, Do a lot of amazing things. I think one of them is that he can be in his 80s Juggling while riding a unicycle. superb. He proposed information theory and quantified information entropy. Information entropy is expressed in bits To measure information. This is the formula for information entropy: the sum of all possible outcomes p log p. We will not explain this formula in detail. The formulas are all negative signs, because when the number is less than 1, its logarithm is negative. And the probability is always less than 1. In the end, the obtained information entropy will be positive. We only look at information gain. How many bits of information gain will you get by knowing the value of an attribute? The information entropy of the distribution before the split is subtracted Information entropy of the split distribution.

Take weather data as an example. These are the values ​​of bits. The information gain of splitting by outlook is 0.247 bits. You may find it strange that these values ​​are all decimals. usually We will see 1bit, 8bit, 32bit, but in information theory you will see decimal bits. These are decimal bits. I will not go into details here. As you can see, knowing that the value of windy is only 0.048 bits of information gain. The humidity is much better, and the temperature is as low as 0.029bits. We have to choose the attribute with the most information gain, The outlook attribute. At the top of the tree, the root node is split with outlook attributes.

After splitting the outlook attribute, we need to look at each branch of outlook, They correspond to the three values ​​of outlook (outlook), consider each branch Further branches. In the first branch, we can choose to split by temperature, windy and humidity. We will not further split with outlook because we know its All instances are sunny. We do the same for the other three attributes. We evaluate the information gain of temperature, windy and humidity at this node. Select the attribute with the most information gain. The value of humidity (humidity) is 0.971 bits. It can be seen that if we branch by humidity, we will get pure nodes: one is 3 no, The other is 2 yes. When this result is obtained, there is no need to split again. We are looking for pure nodes. This is the principle of decision trees. The split will continue until all nodes are pure nodes. Open Weka and try the nominal weather data. Of course, we did it before, but now we will do it again. It will not take too long. J48 is a common tool for data mining algorithms. This is data.

Let's choose J48. Tree classifier. After running, we get a decision tree, the same as the tree we saw before, First split with outlook: sunny, overcast and rainy. Then, if it is sunny, split with humidity, and the node has 3 instances. Split with normal again, get 3 yes, and so on. We can view the decision tree by clicking Visualize the tree in the menu. This is the decision tree. This is the number of yes and no instances on this node. For this decision tree, we do cross-validation. Run for the 11th time on the entire data set. Using the training data set, we get these data. In fact, this is a pure node. Sometimes you will get two numbers: 3/2 or 3/1. The first number represents the correct number of instances of this node, here is The number of instance no. If there is a number after 3, it means the number of instances of yes, That is, the number of instances where this node is incorrect. But there is no second number in this simple data set. Now we know that J48 is a top-down inductive decision tree, This is completely based on information theory, It is a very good data mining algorithm.

Ten years ago, I would say this is the best data mining algorithm, but Now there are some better algorithms. In any case, J48 has high credibility, and most importantly, Its decision tree is simple and easy to understand. The results of J48 are very easy to understand. This is important when applying data mining. You can use many criteria to select attributes. Here we use information gain. In reality, the results of different standards will not differ much. To use this algorithm in practice, you need to do some important modify. I just explained the most basic rules. In fact, J48 also contains some more complex algorithms to ensure that in different scenarios Run smoothly.

We will solve it in the next class. Chapter 4.3 Divide-and-conquer: Constructing decision trees in the book introduces Basic J48. Please do the exercises for this class. good luck! See you next time!.

As found on YouTube