Decision Tree (ID3 & C4.5)
Both ID3 and C4.5 are decision tree algorithms used for classification. They build a tree top-down by repeatedly choosing the best feature to split the dataset.
A decision tree has:
- root/internal nodes = feature tests
- branches = test outcomes
- leaf nodes = final class labels
The key difference is:
- ID3 chooses splits using Information Gain
- C4.5 improves ID3 by using Gain Ratio and can handle continuous features better
1. Decision Tree Idea
At each step, we want to ask the best question about the data. A good split should make the child groups more pure.
If a node contains both classes mixed together, it is impure. If a node contains only one class, it is pure.
Decision trees reduce impurity step by step until leaves are formed.
Part A: ID3
What ID3 Does
ID3 = Iterative Dichotomiser 3 It builds the tree recursively:
- compute entropy of current dataset
- compute information gain of every feature
- choose the feature with the highest information gain
- split the data on that feature
- repeat for each subset
ID3 mainly works best with categorical features.
ID3 Core Equations
Entropy
Entropy measures uncertainty in a dataset: where:
- = current dataset
- = proportion of class
For binary classification:
Meaning:
- entropy = 0 perfectly pure
- entropy is high classes are mixed
Information Gain
If we split dataset using feature : where:
- = subset where feature takes value
Meaning:
- higher information gain = better split
- ID3 chooses the feature with maximum information gain
Short and Clean ID3 Code
import pandas as pd
import numpy as np
class ID3:
def __init__(self):
self.tree = None
self.default_class = None
def entropy(self, y):
values, counts = np.unique(y, return_counts=True)
p = counts / counts.sum()
return -np.sum(p * np.log2(p + 1e-12))
def information_gain(self, X, y, feature):
parent_entropy = self.entropy(y)
values, counts = np.unique(X[feature], return_counts=True)
child_entropy = 0
for v, c in zip(values, counts):
y_sub = y[X[feature] == v]
child_entropy += (c / len(X)) * self.entropy(y_sub)
return parent_entropy - child_entropy
def best_feature(self, X, y):
gains = {f: self.information_gain(X, y, f) for f in X.columns}
return max(gains, key=gains.get)
def build(self, X, y):
if len(np.unique(y)) == 1:
return y.iloc[0]
if X.empty:
return y.mode()[0]
best = self.best_feature(X, y)
tree = {best: {}}
for v in X[best].unique():
mask = X[best] == v
X_sub = X.loc[mask].drop(columns=[best])
y_sub = y.loc[mask]
if len(X_sub) == 0:
tree[best][v] = y.mode()[0]
else:
tree[best][v] = self.build(X_sub, y_sub)
return tree
def fit(self, X, y):
self.default_class = y.mode()[0]
self.tree = self.build(X, y)
return self
def _predict_one(self, x, tree):
if not isinstance(tree, dict):
return tree
feature = next(iter(tree))
value = x.get(feature)
if value not in tree[feature]:
return self.default_class
return self._predict_one(x, tree[feature][value])
def predict(self, X):
return X.apply(lambda row: self._predict_one(row, self.tree), axis=1)
Small Example for ID3
We use a tiny categorical dataset:
data = pd.DataFrame({
"Weather": ["Sunny", "Sunny", "Overcast", "Rain", "Rain", "Rain", "Overcast", "Sunny"],
"Wind": ["Weak", "Strong", "Weak", "Weak", "Weak", "Strong", "Strong", "Weak"],
"Play": ["No", "No", "Yes", "Yes", "Yes", "No", "Yes", "No"]
})
X = data[["Weather", "Wind"]]
y = data["Play"]
model = ID3().fit(X, y)
print("ID3 Tree:", model.tree)
print("Predictions:", model.predict(X).tolist())
Solving the ID3 Example Step by Step
Dataset: Target classes:
Yes= 4No= 4
So root entropy:
So the root is maximally mixed.
Step 1: Try splitting on Weather
Possible values:
- Sunny
- Overcast
- Rain
Subsets:
Sunny[No, No, No]Overcast[Yes, Yes]Rain[Yes, Yes, No]
Entropies:
Sunny subset
All are No, so:
Overcast subset
All are Yes, so:
Rain subset
2 Yes, 1 No:
Weighted entropy after splitting on Weather:
Information gain:
Step 2: Try splitting on Wind
Possible values:
- Weak
- Strong
Subsets:
Weak[No, Yes, Yes, Yes, No]Strong[No, No, Yes]
Entropies:
Weak
3 Yes, 2 No:
Strong
1 Yes, 2 No:
Weighted entropy:
Information gain:
Step 3: Choose the best feature
Since: ID3 chooses: as the root feature.
So the first tree becomes:
Weather
├── Sunny -> No
├── Overcast -> Yes
└── Rain -> split again
Step 4: Recurse on the Rain subset
Rain subset:
- Weak -> Yes
- Weak -> Yes
- Strong -> No
Now only one feature remains: Wind
If split on Wind:
Weak-> all YesStrong-> all No
So final tree:
Weather
├── Sunny -> No
├── Overcast -> Yes
└── Rain
├── Weak -> Yes
└── Strong -> No
This is exactly how ID3 builds the tree recursively.
ID3 Concept -> Equation -> Code
1. Measure impurity
Concept: We first measure how mixed the labels are.
Equation:
Code:
def entropy(self, y):
values, counts = np.unique(y, return_counts=True)
p = counts / counts.sum()
return -np.sum(p * np.log2(p + 1e-12))
2. Score each feature
Concept: Check how much uncertainty reduces after splitting on a feature.
Equation:
Code:
def information_gain(self, X, y, feature):
parent_entropy = self.entropy(y)
values, counts = np.unique(X[feature], return_counts=True)
child_entropy = 0
for v, c in zip(values, counts):
y_sub = y[X[feature] == v]
child_entropy += (c / len(X)) * self.entropy(y_sub)
return parent_entropy - child_entropy
3. Pick the best feature
Concept: Choose the feature with maximum information gain.
Code:
def best_feature(self, X, y):
gains = {f: self.information_gain(X, y, f) for f in X.columns}
return max(gains, key=gains.get)
4. Build tree recursively
Concept: After choosing the best feature, split data and build smaller trees.
Code:
def build(self, X, y):
if len(np.unique(y)) == 1:
return y.iloc[0]
if X.empty:
return y.mode()[0]
best = self.best_feature(X, y)
tree = {best: {}}
for v in X[best].unique():
mask = X[best] == v
X_sub = X.loc[mask].drop(columns=[best])
y_sub = y.loc[mask]
tree[best][v] = self.build(X_sub, y_sub)
return tree
Meaning:
- if node is pure -> make leaf
- if no features left -> return majority class
- else split and recurse
ID3 Pseudocode
ID3(D, features):
if all labels same:
return leaf
if no features left:
return majority class
best = feature with max information gain
create node(best)
for each value v of best:
recurse on subset where best = v
ID3 Advantages
- simple and easy to understand
- tree rules are interpretable
- good for categorical data
- useful in exam explanations because steps are very clear
ID3 Limitations
- biased toward features with many distinct values
- not naturally suited for continuous features
- can overfit
- sensitive to noise
Part B: C4.5
What C4.5 Does
C4.5 is an improved version of ID3. It fixes major weaknesses of ID3.
Main improvements:
- uses Gain Ratio instead of pure Information Gain
- handles continuous features
- can handle missing values better
- usually produces more practical trees
So:
- ID3 = older, simpler
- C4.5 = smarter extension of ID3
Why ID3 Needs Improvement
Information Gain can favor a feature with many distinct values.
Example:
- a feature like
StudentIDmay split every row separately - this gives very high information gain
- but it does not generalize
So C4.5 divides Information Gain by a quantity called Split Information.
C4.5 Core Equations
Entropy
Same as ID3:
Information Gain
Same as ID3:
Split Information
Gain Ratio
C4.5 chooses the feature with the highest gain ratio.
Short and Clean C4.5 Code
import pandas as pd
import numpy as np
class C45:
def __init__(self):
self.tree = None
self.default_class = None
def entropy(self, y):
values, counts = np.unique(y, return_counts=True)
p = counts / counts.sum()
return -np.sum(p * np.log2(p + 1e-12))
def information_gain(self, X, y, feature):
parent_entropy = self.entropy(y)
values, counts = np.unique(X[feature], return_counts=True)
child_entropy = 0
for v, c in zip(values, counts):
y_sub = y[X[feature] == v]
child_entropy += (c / len(X)) * self.entropy(y_sub)
return parent_entropy - child_entropy
def split_info(self, X, feature):
values, counts = np.unique(X[feature], return_counts=True)
p = counts / counts.sum()
return -np.sum(p * np.log2(p + 1e-12))
def gain_ratio(self, X, y, feature):
ig = self.information_gain(X, y, feature)
si = self.split_info(X, feature)
return 0 if si == 0 else ig / si
def best_feature(self, X, y):
ratios = {f: self.gain_ratio(X, y, f) for f in X.columns}
return max(ratios, key=ratios.get)
def build(self, X, y):
if len(np.unique(y)) == 1:
return y.iloc[0]
if X.empty:
return y.mode()[0]
best = self.best_feature(X, y)
tree = {best: {}}
for v in X[best].unique():
mask = X[best] == v
X_sub = X.loc[mask].drop(columns=[best])
y_sub = y.loc[mask]
if len(X_sub) == 0:
tree[best][v] = y.mode()[0]
else:
tree[best][v] = self.build(X_sub, y_sub)
return tree
def fit(self, X, y):
self.default_class = y.mode()[0]
self.tree = self.build(X, y)
return self
def _predict_one(self, x, tree):
if not isinstance(tree, dict):
return tree
feature = next(iter(tree))
value = x.get(feature)
if value not in tree[feature]:
return self.default_class
return self._predict_one(x, tree[feature][value])
def predict(self, X):
return X.apply(lambda row: self._predict_one(row, self.tree), axis=1)
Small Example for C4.5
We use a dataset with a high-cardinality feature to see why Gain Ratio helps:
data = pd.DataFrame({
"ID": ["S1", "S2", "S3", "S4", "S5", "S6"],
"Weather": ["Sunny", "Sunny", "Rain", "Rain", "Overcast", "Overcast"],
"Play": ["No", "No", "Yes", "Yes", "Yes", "Yes"]
})
X = data[["ID", "Weather"]]
y = data["Play"]
model = C45().fit(X, y)
print("C4.5 Tree:", model.tree)
print("Predictions:", model.predict(X).tolist())
Solving the C4.5 Example Step by Step
Root labels:
Yes= 4No= 2
Entropy:
Feature 1: ID
Every ID is unique. So each split contains exactly 1 sample. That means every child subset is pure.
Weighted child entropy: Thus:
This looks perfect for ID3.
But it is misleading because ID is just a unique label.
Now compute split information. Since there are 6 equally-sized branches:
So gain ratio:
Feature 2: Weather
Values:
- Sunny -> [No, No]
- Rain -> [Yes, Yes]
- Overcast -> [Yes, Yes]
All child subsets are pure, so:
Split information: There are 3 equally-sized groups of size 2:
Gain ratio:
Choose the best feature
Although both features have equal information gain, gain ratio is larger for Weather:
So C4.5 correctly chooses:
instead of ID.
This is the key improvement over ID3.
C4.5 Concept -> Equation -> Code
1. Compute entropy
Concept: Measure uncertainty at the current node.
Equation:
Code:
def entropy(self, y):
values, counts = np.unique(y, return_counts=True)
p = counts / counts.sum()
return -np.sum(p * np.log2(p + 1e-12))
2. Compute information gain
Concept: Measure reduction in entropy after split.
Equation:
Code:
def information_gain(self, X, y, feature):
parent_entropy = self.entropy(y)
...
return parent_entropy - child_entropy
3. Compute split information
Concept: Measure how broadly the split divides the data.
Equation:
Code:
def split_info(self, X, feature):
values, counts = np.unique(X[feature], return_counts=True)
p = counts / counts.sum()
return -np.sum(p * np.log2(p + 1e-12))
4. Compute gain ratio
Concept: Normalize information gain so features with too many distinct values are not unfairly favored.
Equation:
Code:
def gain_ratio(self, X, y, feature):
ig = self.information_gain(X, y, feature)
si = self.split_info(X, feature)
return 0 if si == 0 else ig / si
5. Choose best feature and recurse
Concept: Build the decision tree exactly like ID3, but use Gain Ratio instead of Information Gain.
Code:
def best_feature(self, X, y):
ratios = {f: self.gain_ratio(X, y, f) for f in X.columns}
return max(ratios, key=ratios.get)
Handling Continuous Features in C4.5
A major improvement of C4.5 is continuous-value handling. For a numeric feature, C4.5 tries thresholds: and chooses the threshold with the best gain ratio.
Example split:
So unlike ID3, C4.5 can naturally work with continuous attributes by converting them into binary threshold splits.
A simplified threshold idea in code would be:
threshold = 120
left = X[feature] <= threshold
right = X[feature] > threshold
Then entropy, IG, and GR are computed for that split.
ID3 vs C4.5
Main Difference
ID3 chooses: C4.5 chooses:
Comparison Table
| Aspect | ID3 | C4.5 |
|---|---|---|
| Split criterion | Information Gain | Gain Ratio |
| Handles categorical features | Yes | Yes |
| Handles continuous features | Poorly / not directly | Yes |
| Bias toward many-valued features | High | Reduced |
| Missing values | Weak | Better |
| Complexity | Simpler | Slightly more advanced |
Full Combined Example
import pandas as pd
data = pd.DataFrame({
"Weather": ["Sunny", "Sunny", "Overcast", "Rain", "Rain", "Rain", "Overcast", "Sunny"],
"Wind": ["Weak", "Strong", "Weak", "Weak", "Weak", "Strong", "Strong", "Weak"],
"Play": ["No", "No", "Yes", "Yes", "Yes", "No", "Yes", "No"]
})
X = data[["Weather", "Wind"]]
y = data["Play"]
id3_model = ID3().fit(X, y)
c45_model = C45().fit(X, y)
print("ID3 Tree:", id3_model.tree)
print("C4.5 Tree:", c45_model.tree)
print("ID3 Predictions:", id3_model.predict(X).tolist())
print("C4.5 Predictions:", c45_model.predict(X).tolist())
How the Tree Predicts
Suppose tree is:
Weather
├── Sunny -> No
├── Overcast -> Yes
└── Rain
├── Weak -> Yes
└── Strong -> No
For a new row:
Weather = Rain, Wind = Weak
Path:
- check
Weather - move to
Rain - check
Wind Weak-> leaf =Yes
So predicted class is:
Why These Algorithms Work
Both algorithms work by repeatedly reducing uncertainty. They ask:
which feature makes the class labels as pure as possible after splitting?
ID3 answers this using information gain. C4.5 answers this using gain ratio.
The recursion stops when:
- node becomes pure
- no features remain
- or no useful split is possible
So a large classification problem becomes a sequence of small rule-based decisions.
Advantages of ID3 and C4.5
- easy to interpret
- produces human-readable rules
- no heavy math during prediction
- useful for categorical classification tasks
- good for exam answers because the logic is visual and recursive
Limitations
ID3
- biased toward attributes with many values
- struggles with continuous data
- can overfit
- sensitive to noise
C4.5
- more complex than ID3
- deeper trees can still overfit without pruning
- threshold search for continuous features adds computation
Exam-Oriented Summary
ID3 Definition
ID3 is a top-down greedy decision tree algorithm that chooses the feature with the highest information gain at each step.
C4.5 Definition
C4.5 is an improved decision tree algorithm that extends ID3 by using gain ratio and supporting continuous features.
ID3 Formula
Entropy: Information Gain:
C4.5 Formula
Split Information: Gain Ratio:
Common Steps
- compute impurity of current dataset
- score every feature
- choose best feature
- split dataset
- repeat recursively
- stop when node is pure or features are exhausted
Very Short Revision
ID3
- uses entropy
- computes information gain
- picks feature with max gain
- recursive tree construction
C4.5
- starts like ID3
- adds split information
- uses gain ratio
- handles continuous features better
Final Takeaway
ID3 and C4.5 both build decision trees by recursively selecting the best splitting feature. ID3 uses Information Gain, while C4.5 improves it using Gain Ratio to avoid unfair preference for features with many distinct values. So:
- use ID3 to understand the core decision tree idea
- use C4.5 as the more practical and improved version