C4.5 builds decision trees from a set of training data in the same way as
ID3, using the concept of
information entropy. The training data is a set S = {s_1, s_2, ...} of already classified samples. Each sample s_i consists of a p-dimensional vector (x_{1,i}, x_{2,i}, ...,x_{p,i}) , where the x_j represent attribute values or
features of the sample, as well as the class in which s_i falls. At each node of the tree, C4.5 chooses the attribute of the data that most effectively splits its set of samples into subsets enriched in one class or the other. The splitting criterion is the normalized
information gain (difference in entropy). The attribute with the highest normalized information gain is chosen to make the decision. The C4.5 algorithm then
recurses on the
partitioned sublists. This algorithm has a few
base cases. • All the samples in the list belong to the same class. When this happens, it simply creates a leaf node for the decision tree saying to choose that class. • None of the features provide any information gain. In this case, C4.5 creates a decision node higher up the tree using the expected value of the class. • Instance of previously unseen class encountered. Again, C4.5 creates a decision node higher up the tree using the expected value.
Pseudocode In
pseudocode, the general algorithm for building decision trees is: • Check for the above base cases. • For each attribute
a, find the normalized information gain ratio from splitting on
a. • Let
a_best be the attribute with the highest normalized information gain. • Create a decision
node that splits on
a_best. • Recurse on the sublists obtained by splitting on
a_best, and add those nodes as children of
node. ==Improvements from ID3 algorithm==