Linear Regression
Simple Linear Regression predicts a continuous value using one input feature by fitting a straight line: where:
- = intercept
- = slope
- = predicted output
The goal is to find the line that minimizes the sum of squared errors:
Intuition
We want the “best” straight line through the data.
- If , the line goes upward
- If , the line goes downward
- tells where the line cuts the -axis
For a dataset with one feature: we add a bias column of 1s:
Then the parameters are computed using the Normal Equation: where:
Short and Clean Code
import numpy as np
import matplotlib.pyplot as plt
class SimpleLinearRegression:
def __init__(self):
self.intercept_ = 0.0
self.coef_ = 0.0
self.r2_ = 0.0
def fit(self, X, y):
X = np.asarray(X).reshape(-1, 1)
y = np.asarray(y).reshape(-1, 1)
Xb = np.c_[np.ones((len(X), 1)), X]
theta = np.linalg.inv(Xb.T @ Xb) @ Xb.T @ y
self.intercept_ = theta[0, 0]
self.coef_ = theta[1, 0]
y_pred = Xb @ theta
ss_res = np.sum((y - y_pred) ** 2)
ss_tot = np.sum((y - y.mean()) ** 2)
self.r2_ = 1 - ss_res / ss_tot
return self
def predict(self, X):
X = np.asarray(X).reshape(-1, 1)
return self.intercept_ + self.coef_ * X
X = np.array([1,2,3,4,5,6,7,8,9,10])
y = np.array([2,4,5,4,5,7,8,9,10,12])
model = SimpleLinearRegression().fit(X, y)
y_pred = model.predict(X)
print("Intercept:", round(model.intercept_, 4))
print("Slope:", round(model.coef_, 4))
print("R^2:", round(model.r2_, 4))
plt.scatter(X, y, label="Data")
plt.plot(X, y_pred, label="Regression line")
plt.xlabel("X")
plt.ylabel("y")
plt.title("Simple Linear Regression")
plt.legend()
plt.show()
Output Meaning
From this code, the fitted line is approximately:
So:
- intercept
- slope
This means:
- when , predicted
- for every increase of 1 in , increases by about
The score is approximately: which means the model explains about 95.25% of the variance in the target.
Step-by-Step Algorithm
Step 1: Prepare the data
We start with:
X = np.array([1,2,3,4,5,6,7,8,9,10])
y = np.array([2,4,5,4,5,7,8,9,10,12])
Concept:
- is the input feature
- is the actual output
Dataset pairs:
Step 2: Add the bias column
To learn both intercept and slope together, we transform into:
Code:
Xb = np.c_[np.ones((len(X), 1)), X.reshape(-1, 1)]
Concept:
- first column of 1s handles the intercept
- second column stores the feature values
Step 3: Compute parameters using the Normal Equation
Equation:
Code:
theta = np.linalg.inv(Xb.T @ Xb) @ Xb.T @ y.reshape(-1, 1)
This gives:
So:
Meaning:
Step 4: Predict values
For each input, substitute into:
Code:
y_pred = self.intercept_ + self.coef_ * X.reshape(-1, 1)
Let us compute a few predictions manually.
For :
For :
For :
For :
So the model predictions are approximately:
Step 5: Measure performance using
The coefficient of determination is:
Where:
- = residual sum of squares
- = total sum of squares
- = mean of actual
Code:
ss_res = np.sum((y - y_pred) ** 2)
ss_tot = np.sum((y - y.mean()) ** 2)
self.r2_ = 1 - ss_res / ss_tot
For this dataset:
Interpretation:
- close to 1 means strong fit
- close to 0 means poor fit
Code Explanation: Concept -> Equation -> Code
1. Store model parameters
Concept: We need to remember the learned intercept, slope, and model score.
Code:
class SimpleLinearRegression:
def __init__(self):
self.intercept_ = 0.0
self.coef_ = 0.0
self.r2_ = 0.0
Meaning:
intercept_storescoef_storesr2_stores
2. Convert input into column form
Concept: Matrix equations require and in proper shapes.
Code:
X = np.asarray(X).reshape(-1, 1)
y = np.asarray(y).reshape(-1, 1)
Meaning:
reshape(-1, 1)makes data a column vector
Example:
3. Add the bias term
Concept: The intercept must be part of the matrix multiplication.
Equation:
Code:
Xb = np.c_[np.ones((len(X), 1)), X]
This builds:
4. Learn the best-fit line
Concept: Choose parameters that minimize squared error.
Equation:
Code:
theta = np.linalg.inv(Xb.T @ Xb) @ Xb.T @ y
This is the heart of the algorithm.
Then:
self.intercept_ = theta[0, 0]
self.coef_ = theta[1, 0]
This maps:
5. Predict outputs
Concept: Once parameters are known, plug them into the line equation.
Equation:
Code:
y_pred = Xb @ theta
or in predict():
return self.intercept_ + self.coef_ * X
Both do the same thing.
6. Evaluate model fit
Concept: Compare predictions with actual values.
Equation:
Code:
ss_res = np.sum((y - y_pred) ** 2)
ss_tot = np.sum((y - y.mean()) ** 2)
self.r2_ = 1 - ss_res / ss_tot
Worked Example
Using: Prediction:
Actual value in dataset:
Error:
Squared error:
This is how the model measures how far prediction is from truth.
Why This Algorithm Works
Simple Linear Regression assumes:
- the relationship is approximately linear
- one feature is enough to explain the target
- the best line is the one with minimum squared error
By minimizing squared error, the algorithm finds a line that stays as close as possible to all points overall.
Final Summary
Simple Linear Regression learns: using:
For this dataset, the learned model is: with:
So the model fits the data well and captures a strong positive linear relationship.
Exam-Oriented Points
- Used for predicting a continuous value
- Works with one input feature
- Equation of model:
- Parameters are found using the Normal Equation
- Performance is commonly measured with
- Best when the relationship between feature and target is approximately linear
Very Short Revision
- Add bias column
- Compute parameters using Normal Equation
- Form regression line
- Predict output
- Evaluate using
Formula: Model:
Logistic Regression
Logistic Regression is a binary classification algorithm used when the output belongs to one of two classes:
It does not predict the class directly using a line. Instead, it predicts a probability using the sigmoid function, then converts that probability into class 0 or 1.
The model is:
where:
- = linear score
- = sigmoid function
- = predicted probability that class is 1
If:
Main Idea
Linear Regression gives any real number as output, but classification needs a value between 0 and 1. So Logistic Regression first computes a linear combination: then applies the sigmoid: This maps any real number into: So the output can be interpreted as a probability.
Short and Clean Code
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
class LogisticRegressionScratch:
def __init__(self, lr=0.1, epochs=1000):
self.lr = lr
self.epochs = epochs
self.w = None
self.b = 0.0
self.loss_history = []
def _sigmoid(self, z):
z = np.clip(z, -500, 500)
return 1 / (1 + np.exp(-z))
def _loss(self, y, p):
eps = 1e-9
p = np.clip(p, eps, 1 - eps)
return -np.mean(y * np.log(p) + (1 - y) * np.log(1 - p))
def fit(self, X, y):
X = np.asarray(X)
y = np.asarray(y)
m, n = X.shape
self.w = np.zeros(n)
for _ in range(self.epochs):
z = X @ self.w + self.b
p = self._sigmoid(z)
dw = (X.T @ (p - y)) / m
db = np.mean(p - y)
self.w -= self.lr * dw
self.b -= self.lr * db
self.loss_history.append(self._loss(y, p))
return self
def predict_proba(self, X):
X = np.asarray(X)
return self._sigmoid(X @ self.w + self.b)
def predict(self, X):
return (self.predict_proba(X) >= 0.5).astype(int)
np.random.seed(42)
X = np.random.rand(200, 2) * 10
y = (X[:, 0] + X[:, 1] > 10).astype(int)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
model = LogisticRegressionScratch(lr=0.1, epochs=1000)
model.fit(X_train, y_train)
pred = model.predict(X_test)
acc = np.mean(pred == y_test)
print("Weights:", np.round(model.w, 4))
print("Bias:", round(model.b, 4))
print("Accuracy:", round(acc, 4))
plt.plot(model.loss_history)
plt.title("Loss Convergence")
plt.xlabel("Iteration")
plt.ylabel("Loss")
plt.grid(True)
plt.show()
What This Code Does
This example creates 2D points: and labels them using: So the true decision boundary is: This is a binary classification problem.
Step-by-Step Algorithm
Step 1: Create the dataset
Code:
np.random.seed(42)
X = np.random.rand(200, 2) * 10
y = (X[:, 0] + X[:, 1] > 10).astype(int)
Concept:
Xcontains 200 samples- each sample has 2 features:
- class label depends on whether the sum is greater than 10
Equation:
Meaning:
- points above the line belong to class 1
- points below the line belong to class 0
Step 2: Split into train and test sets
Code:
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
Concept:
- training data is used to learn parameters
- test data is used to check performance on unseen data
Here:
- 80% for training
- 20% for testing
Step 3: Standardize features
Code:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
Concept: Gradient descent works better when features are on similar scales.
Standardization formula: where:
- = mean of the feature
- = standard deviation
Why it helps:
- faster convergence
- more stable updates
- one feature does not dominate another due to large magnitude
Step 4: Compute the linear score
For each sample, Logistic Regression first computes: In vector form:
Code:
z = X @ self.w + self.b
Concept: This is the same linear part used in linear models. But here it is not the final output. It is only the input to the sigmoid function.
Step 5: Apply sigmoid to get probability
Equation:
Code:
p = self._sigmoid(z)
Concept: The sigmoid compresses any real number into a value between 0 and 1.
Examples:
- if :
- if is large positive, probability is close to 1
- if is large negative, probability is close to 0
So:
Step 6: Measure error using cross-entropy loss
For Logistic Regression, we do not use mean squared error. We use cross-entropy loss:
Code:
def _loss(self, y, p):
eps = 1e-9
p = np.clip(p, eps, 1 - eps)
return -np.mean(y * np.log(p) + (1 - y) * np.log(1 - p))
Concept:
- if actual class is 1, we want close to 1
- if actual class is 0, we want close to 0
- wrong confident predictions are penalized heavily
Why clipping is used:
log(0)is undefined- so probabilities are clipped slightly away from 0 and 1
Step 7: Compute gradients
To reduce loss, we update weights and bias using gradient descent.
Gradient formulas:
Code:
dw = (X.T @ (p - y)) / m
db = np.mean(p - y)
Concept:
dwtells how weights should changedbtells how bias should change- if prediction is too large, parameters are pushed downward
- if prediction is too small, parameters are pushed upward
Step 8: Update parameters
Gradient descent update rule: where is the learning rate.
Code:
self.w -= self.lr * dw
self.b -= self.lr * db
Concept:
- move parameters in the direction that reduces loss
- repeat many times until learning stabilizes
Step 9: Convert probabilities to classes
After training, predicted probability is converted to class label.
Rule:
Code:
return (self.predict_proba(X) >= 0.5).astype(int)
Concept:
- probabilities are continuous
- classification needs discrete labels
Step 10: Measure accuracy
Code:
pred = model.predict(X_test)
acc = np.mean(pred == y_test)
Equation:
Concept: Accuracy tells what fraction of test samples were classified correctly.
Concept -> Equation -> Code Mapping
1. Model parameters
Concept: The model must learn weights and bias.
Equation:
Code:
self.w = np.zeros(n)
self.b = 0.0
Meaning:
- start with all weights as 0
- start with bias as 0
2. Probability model
Concept: Turn linear score into probability.
Equation:
Code:
def _sigmoid(self, z):
z = np.clip(z, -500, 500)
return 1 / (1 + np.exp(-z))
Why clip:
- avoids overflow for very large positive or negative values
3. Forward pass
Concept: Compute predictions from current parameters.
Equation:
Code:
z = X @ self.w + self.b
p = self._sigmoid(z)
Meaning:
z= raw scorep= predicted probability
4. Loss calculation
Concept: See how wrong the predictions are.
Equation:
Code:
self.loss_history.append(self._loss(y, p))
Meaning:
- each iteration stores loss
- useful for checking whether training is improving
5. Backward pass
Concept: Find how parameters affect the loss.
Equation:
Code:
dw = (X.T @ (p - y)) / m
db = np.mean(p - y)
Meaning: These gradients guide the update step.
6. Learning step
Concept: Improve the model gradually.
Equation:
Code:
self.w -= self.lr * dw
self.b -= self.lr * db
Meaning: Repeated updates make the model better at classification.
Worked Example on One Sample
Suppose after some training, for one sample: and the model has:
Step 1: Compute score
Step 2: Apply sigmoid
Step 3: Classify
Since: prediction is:
So this sample is classified as class 1.
Why Logistic Regression Works
The model learns a boundary where the probability changes from class 0 to class 1. For two features, the decision boundary is: Because: So:
- if , class tends toward 1
- if , class tends toward 0
This creates a linear decision boundary.
For your dataset, true labels come from: So Logistic Regression is a good fit because the classes are separable by a line.
What the Loss Curve Means
Code:
plt.plot(model.loss_history)
plt.title("Loss Convergence")
plt.xlabel("Iteration")
plt.ylabel("Loss")
plt.grid(True)
plt.show()
Concept:
- at the beginning, loss is high
- during learning, loss should decrease
- a downward curve means gradient descent is working
If the curve:
- decreases smoothly -> learning is stable
- oscillates wildly -> learning rate may be too high
- decreases very slowly -> learning rate may be too low
Practical Notes
1. Feature scaling matters
Because Logistic Regression uses gradient descent, features with large values can slow down convergence. That is why: is important.
2. Learning rate matters
If learning rate is:
- too high -> training may diverge
- too low -> training becomes very slow
Typical starting values:
3. Linear boundary limitation
Logistic Regression assumes the boundary is linear: If data is non-linear, you may need:
- feature engineering
- polynomial features
- another model
4. Correlated features can cause issues
If features are strongly correlated, learning may become unstable. This is called multicollinearity.
Exam-Oriented Summary
Definition
Logistic Regression is a supervised learning algorithm used for binary classification.
Model
Decision Rule
Loss Function
Gradient Descent Updates
Uses
- spam detection
- disease prediction
- pass/fail prediction
- yes/no classification tasks
Very Short Revision
- Compute linear score:
- Apply sigmoid:
- Compute cross-entropy loss
- Update weights and bias using gradient descent
- Convert probability to class using threshold 0.5
Final Takeaway
Logistic Regression is a simple but powerful classification algorithm. It learns a linear decision boundary, uses sigmoid to output probabilities, and improves itself by minimizing cross-entropy loss using gradient descent. For your example, it works well because the classes are generated by a rule that is approximately linearly separable.
Principal Component Analysis (PCA)
PCA is a dimensionality reduction algorithm. It creates new features called principal components that keep as much information as possible from the original dataset. The main idea is:
- find directions where the data varies the most
- rank those directions
- keep only the most important ones
If the original data has features, PCA can produce up to principal components.
Why PCA is Needed
High-dimensional data causes problems such as:
- harder visualization
- more computation
- more difficult learning
- curse of dimensionality
PCA solves this by projecting the data onto fewer dimensions while preserving maximum variance.
Core Idea
Suppose the data matrix is: where:
- = number of samples
- = number of features
PCA finds a new set of orthogonal directions: such that:
- captures the maximum variance
- captures the next maximum variance
- and so on
These directions are the eigenvectors of the covariance matrix. Their importance is given by the eigenvalues.
Main Equations
1. Standardization
Each feature is standardized using Z-score:
2. Covariance matrix
For centered data:
3. Eigen decomposition
where:
- = eigenvector
- = eigenvalue
4. Projection
If contains the top principal components, then:
Intuition
Each principal component is a direction in feature space. If data spreads a lot along a direction, that direction contains a lot of information. So PCA keeps the directions with the largest variance.
That is why PCA chooses eigenvectors with the largest eigenvalues.
Short and Clean Code
import numpy as np
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA as SklearnPCA
class PCAFromScratch:
def __init__(self, n_components):
self.n_components = n_components
self.mean_ = None
self.std_ = None
self.components_ = None
self.eigenvalues_ = None
self.explained_variance_ratio_ = None
def fit(self, X):
X = np.asarray(X, dtype=float)
self.mean_ = X.mean(axis=0)
self.std_ = X.std(axis=0)
Xs = (X - self.mean_) / self.std_
C = (Xs.T @ Xs) / len(Xs)
eigenvalues, eigenvectors = np.linalg.eigh(C)
order = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[order]
eigenvectors = eigenvectors[:, order]
self.eigenvalues_ = eigenvalues[:self.n_components]
self.components_ = eigenvectors[:, :self.n_components]
self.explained_variance_ratio_ = eigenvalues / eigenvalues.sum()
return self
def transform(self, X):
X = np.asarray(X, dtype=float)
Xs = (X - self.mean_) / self.std_
return Xs @ self.components_
def fit_transform(self, X):
self.fit(X)
return self.transform(X)
iris = load_iris()
X = iris.data
pca = PCAFromScratch(n_components=2)
X_proj = pca.fit_transform(X)
print("Top eigenvalues:", np.round(pca.eigenvalues_, 4))
print("Principal components:\n", np.round(pca.components_, 4))
print("Explained variance ratio:", np.round(pca.explained_variance_ratio_, 4))
sk_pca = SklearnPCA(n_components=2)
X_sk = sk_pca.fit_transform((X - X.mean(axis=0)) / X.std(axis=0))
print("Sklearn components:\n", np.round(sk_pca.components_.T, 4))
What This Code Does
This code:
- loads the Iris dataset
- standardizes all features
- computes the covariance matrix
- finds eigenvalues and eigenvectors
- sorts them from largest to smallest
- selects the top
n_components - projects the original data onto those components
So 4-dimensional Iris data becomes 2-dimensional.
Dataset
The Iris dataset has:
- 150 samples
- 4 features
- 3 flower classes
The four features are:
- sepal length
- sepal width
- petal length
- petal width
PCA uses only the feature matrix:
Since PCA is unsupervised, labels are not needed.
Step-by-Step Algorithm
Step 1: Standardize the dataset
PCA is sensitive to scale. If one feature has larger values, it can dominate the variance.
So each column is standardized:
Code:
self.mean_ = X.mean(axis=0)
self.std_ = X.std(axis=0)
Xs = (X - self.mean_) / self.std_
Concept:
- subtract column mean
- divide by column standard deviation
- now every feature has roughly comparable scale
Why important:
- PCA is variance-based
- variance depends on feature scale
- standardization ensures fairness among features
Step 2: Compute covariance matrix
The covariance matrix measures how features vary together.
Equation:
Code:
C = (Xs.T @ Xs) / len(Xs)
Concept:
- diagonal entries = variance of each standardized feature
- off-diagonal entries = covariance between pairs of features
For Iris:
Because there are 4 original features.
Step 3: Find eigenvalues and eigenvectors
PCA solves:
Code:
eigenvalues, eigenvectors = np.linalg.eigh(C)
Concept:
- each eigenvector gives a direction
- each eigenvalue tells how much variance is captured in that direction
Why eigh and not eig:
- covariance matrix is symmetric
np.linalg.eighis better for symmetric matrices
Interpretation:
- large eigenvalue important component
- small eigenvalue less important component
Step 4: Sort eigenvalues and eigenvectors
The most useful components are the ones with the largest eigenvalues.
Code:
order = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[order]
eigenvectors = eigenvectors[:, order]
Concept:
argsortsorts indices[::-1]reverses order to descending- now the first column of
eigenvectorsis the first principal component
Step 5: Select top principal components
If we want only components:
Code:
self.eigenvalues_ = eigenvalues[:self.n_components]
self.components_ = eigenvectors[:, :self.n_components]
Concept:
- keep only the dominant directions
- discard weaker directions
- dimension reduces from to
For example:
- original data: 4 features
- choose 2 components
- reduced data: 2 features
Step 6: Project data onto the new space
Projection formula:
Code:
return Xs @ self.components_
Concept:
- each sample is re-expressed in terms of principal components
- this gives lower-dimensional data
- information loss is minimized as much as possible for the chosen number of components
If: then:
Concept -> Equation -> Code Mapping
1. Equalize feature scales
Concept: All features should contribute fairly.
Equation:
Code:
self.mean_ = X.mean(axis=0)
self.std_ = X.std(axis=0)
Xs = (X - self.mean_) / self.std_
2. Measure variance structure
Concept: We need a matrix that summarizes how features vary together.
Equation:
Code:
C = (Xs.T @ Xs) / len(Xs)
3. Find important directions
Concept: The best projection directions are the eigenvectors of the covariance matrix.
Equation:
Code:
eigenvalues, eigenvectors = np.linalg.eigh(C)
4. Rank the directions
Concept: Directions with larger variance are more useful.
Equation:
Code:
order = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[order]
eigenvectors = eigenvectors[:, order]
5. Keep only the top directions
Concept: Dimensionality reduction means retaining only the most informative directions.
Equation:
Code:
self.components_ = eigenvectors[:, :self.n_components]
6. Project data
Concept: Convert old features into principal component coordinates.
Equation:
Code:
return Xs @ self.components_
Why Eigenvectors and Eigenvalues Appear
PCA wants to maximize the variance of projected data. If we project onto a unit vector , projected variance is: subject to:
This is an optimization problem. Using Lagrange multipliers: Taking derivative and setting to zero gives:
So:
- the best directions are eigenvectors of
- the amount of retained variance is given by eigenvalues
This is the mathematical reason behind PCA.
Why the Largest Eigenvalue Matters
The projected variance along direction is: For an eigenvector: So: Since: we get:
Therefore:
- projected variance equals the eigenvalue
- maximizing variance means choosing the largest eigenvalue
That is why PCA selects top eigenvalues first.
Explained Variance Ratio
A useful quantity is:
Code:
self.explained_variance_ratio_ = eigenvalues / eigenvalues.sum()
Concept: This tells how much total information each principal component retains.
Example interpretation:
- PC1 = 72%
- PC2 = 23%
- then first two PCs keep 95% of the total variance
This helps decide how many components to keep.
Worked Mini Example
Suppose after covariance computation, eigenvalues are:
Then:
- first principal component captures the most variance
- second captures the next most
- third and fourth contribute little
Total variance:
Explained variance ratios:
So:
- PC1 keeps about 72.75%
- PC2 keeps about 23%
- first two together keep about 95.75%
This means reducing from 4D to 2D is very reasonable.
Understanding the Components Matrix
If the selected components are: then:
- first column = first principal component
- second column = second principal component
Each column shows how the original features combine to form the new axis.
For example, first component:
So a principal component is a linear combination of original features.
Comparing with Scikit-Learn
Code:
sk_pca = SklearnPCA(n_components=2)
X_sk = sk_pca.fit_transform((X - X.mean(axis=0)) / X.std(axis=0))
print(np.round(sk_pca.components_.T, 4))
Concept: This checks whether the scratch implementation gives similar principal components.
Important note: Principal components may differ by sign. That means if one library gives: another may give: This is still correct because both represent the same direction.
So when comparing PCA outputs, sign flips are normal.
Why PCA Works
PCA works because:
- it identifies directions of maximum spread
- those directions preserve the most information
- the directions are orthogonal, so they do not duplicate information
- low-variance directions can often be removed with minimal information loss
So PCA compresses data while keeping the most useful structure.
Limitations
1. PCA is linear
PCA only finds linear combinations of features. If structure is highly non-linear, PCA may not capture it well.
2. PCA is sensitive to scale
Without standardization, large-scale features dominate.
3. Components may be hard to interpret
The new axes are combinations of original features, so they may not be as interpretable as raw columns.
4. Variance does not always mean usefulness
PCA keeps directions with high variance, but high variance does not always mean high predictive importance for a target variable.
Exam-Oriented Summary
Definition
PCA is an unsupervised dimensionality reduction technique that transforms correlated features into orthogonal principal components.
Goal
Reduce the number of features while retaining maximum variance.
Steps
- standardize data
- compute covariance matrix
- find eigenvalues and eigenvectors
- sort them in descending order
- select top components
- project data onto them
Important Equations
Standardization: Covariance: Eigen equation: Projection: Explained variance ratio:
Interpretation
- eigenvectors = directions of principal components
- eigenvalues = variance captured by those directions
- larger eigenvalue = more important component
Very Short Revision
PCA reduces dimensions by:
- standardizing data
- computing covariance matrix
- finding eigenvectors/eigenvalues
- sorting by largest eigenvalue
- keeping top components
- projecting data onto them
Main idea:
Final Takeaway
PCA transforms high-dimensional data into a lower-dimensional form by finding the most informative orthogonal directions. These directions are the eigenvectors of the covariance matrix, and their importance is measured by eigenvalues. The larger the eigenvalue, the more variance that principal component preserves, so the better it is for dimensionality reduction.
K-Nearest Neighbor
KNN is an instance-based or lazy learning algorithm. It does not learn an explicit equation during training. Instead, it:
- stores the training data
- waits until a new test point comes
- finds the nearest stored examples
- predicts using those neighbors
There are two common versions:
- KNN Classification -> predict the most common class
- KNN Regression -> predict the mean of neighbor target values
Main Idea
Each sample is a point in -dimensional space: To predict for a new point, KNN measures the distance from that point to all training points.
The most common distance used is Euclidean distance:
Then:
- in classification, choose the majority class among the nearest neighbors
- in regression, choose the average target among the nearest neighbors
Part A: KNN Classification
What It Does
Given a new test point:
- compute its distance to every training point
- sort distances
- pick the nearest
- return the majority class
So the prediction rule is:
Short and Clean KNN Classification Code
import numpy as np
from collections import Counter
class KNNClassifier:
def __init__(self, k=3):
self.k = k
self.X_train = None
self.y_train = None
def fit(self, X, y):
self.X_train = np.asarray(X, dtype=float)
self.y_train = np.asarray(y)
def _distance(self, a, b):
return np.sqrt(np.sum((a - b) ** 2))
def _predict_one(self, x):
distances = [self._distance(x, x_train) for x_train in self.X_train]
k_idx = np.argsort(distances)[:self.k]
k_labels = self.y_train[k_idx]
return Counter(k_labels).most_common(1)[0][0]
def predict(self, X):
X = np.asarray(X, dtype=float)
return np.array([self._predict_one(x) for x in X])
X_train = np.array([
[1, 1],
[2, 1],
[4, 3],
[5, 4],
[6, 5]
], dtype=float)
y_train = np.array([0, 0, 1, 1, 1])
X_test = np.array([
[3, 2],
[5, 5]
], dtype=float)
clf = KNNClassifier(k=3)
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
print("Predictions:", pred)
Small Classification Example
Training data:
Test point:
Take:
Solving the Classification Example Step by Step
We compute distance from to every training point.
Distance to
Distance to
Distance to
Distance to
Distance to
Now sort the distances:
| Point | Label | Distance |
|---|---|---|
| 0 | 1.414 | |
| 1 | 1.414 | |
| 0 | 2.236 | |
| 1 | 2.828 | |
| 1 | 4.243 |
Nearest 3 labels:
Majority class:
So:
That is exactly how KNN classification works.
KNN Classification Algorithm
Step 1: Store training data
Unlike linear models, KNN does not compute weights during training.
Code:
def fit(self, X, y):
self.X_train = np.asarray(X, dtype=float)
self.y_train = np.asarray(y)
Concept: Training in KNN simply means memorizing the dataset.
Step 2: Compute distance
Equation:
Code:
def _distance(self, a, b):
return np.sqrt(np.sum((a - b) ** 2))
Concept: This measures how close a training sample is to the test sample.
Smaller distance means more similar.
Step 3: Rank neighbors
Code:
distances = [self._distance(x, x_train) for x_train in self.X_train]
k_idx = np.argsort(distances)[:self.k]
Concept:
- compute all distances
- sort them
- keep the indices of the nearest
Step 4: Vote by majority
Code:
k_labels = self.y_train[k_idx]
return Counter(k_labels).most_common(1)[0][0]
Concept: Among the nearest neighbors, whichever class occurs most is chosen.
Equation:
Concept -> Equation -> Code Mapping for Classification
1. Represent samples as points
Concept: Each row is a point in feature space.
Equation:
Code:
self.X_train = np.asarray(X, dtype=float)
2. Measure closeness
Concept: Similarity is measured using Euclidean distance.
Equation:
Code:
def _distance(self, a, b):
return np.sqrt(np.sum((a - b) ** 2))
3. Select nearest neighbors
Concept: Prediction depends only on nearby training samples.
Code:
k_idx = np.argsort(distances)[:self.k]
4. Use majority vote
Concept: Classification chooses the most frequent class.
Equation:
Code:
return Counter(k_labels).most_common(1)[0][0]
Why KNN Classification Works
The assumption is:
points that are close in feature space tend to have the same class
So instead of learning a global rule, KNN makes a local decision around the test point.
That is why it is called instance-based learning.
Part B: KNN Regression
What It Does
KNN regression follows the same steps as classification:
- compute distance to all training points
- choose the nearest
- average their target values
Prediction rule: where are the target values of the nearest neighbors.
Short and Clean KNN Regression Code
import numpy as np
class KNNRegressor:
def __init__(self, k=3):
self.k = k
self.X_train = None
self.y_train = None
def fit(self, X, y):
self.X_train = np.asarray(X, dtype=float)
self.y_train = np.asarray(y, dtype=float)
def _distance(self, a, b):
return np.sqrt(np.sum((a - b) ** 2))
def _predict_one(self, x):
distances = [self._distance(x, x_train) for x_train in self.X_train]
k_idx = np.argsort(distances)[:self.k]
k_values = self.y_train[k_idx]
return np.mean(k_values)
def predict(self, X):
X = np.asarray(X, dtype=float)
return np.array([self._predict_one(x) for x in X])
X_train = np.array([[1], [2], [3], [6], [7]], dtype=float)
y_train = np.array([30000, 35000, 40000, 70000, 75000], dtype=float)
X_test = np.array([[4]], dtype=float)
reg = KNNRegressor(k=3)
reg.fit(X_train, y_train)
pred = reg.predict(X_test)
print("Prediction:", pred)
Small Regression Example
Suppose training data is:
| YearsExperience | Salary |
|---|---|
| 1 | 30000 |
| 2 | 35000 |
| 3 | 40000 |
| 6 | 70000 |
| 7 | 75000 |
Test point:
Take:
Solving the Regression Example Step by Step
We compute the distance from 4 to each training point.
Distance to 1
Distance to 2
Distance to 3
Distance to 6
Distance to 7
Sorted distances:
| Point | Target | Distance |
|---|---|---|
| 3 | 40000 | 1 |
| 2 | 35000 | 2 |
| 6 | 70000 | 2 |
| 1 | 30000 | 3 |
| 7 | 75000 | 3 |
Nearest 3 target values:
Prediction is their mean:
So the predicted salary is:
KNN Regression Algorithm
Step 1: Store training data
Code:
def fit(self, X, y):
self.X_train = np.asarray(X, dtype=float)
self.y_train = np.asarray(y, dtype=float)
Concept: Just store examples and target values.
Step 2: Compute all distances
Equation:
Code:
distances = [self._distance(x, x_train) for x_train in self.X_train]
Concept: Measure how close the new point is to all training samples.
Step 3: Pick nearest
Code:
k_idx = np.argsort(distances)[:self.k]
Concept: Keep the closest neighbors only.
Step 4: Average their target values
Equation:
Code:
k_values = self.y_train[k_idx]
return np.mean(k_values)
Concept: Regression prediction is the average of nearby outputs.
Concept -> Equation -> Code Mapping for Regression
1. Use the same distance idea
Concept: Near points should have similar target values.
Equation:
Code:
def _distance(self, a, b):
return np.sqrt(np.sum((a - b) ** 2))
2. Find local neighborhood
Concept: Prediction uses local samples rather than a global fitted line.
Code:
k_idx = np.argsort(distances)[:self.k]
3. Average local outputs
Concept: Regression uses the mean of neighbor target values.
Equation:
Code:
return np.mean(k_values)
KNN Classification vs KNN Regression
| Aspect | KNN Classification | KNN Regression |
|---|---|---|
| Output type | class label | continuous value |
| Final rule | majority vote | mean of neighbors |
| Formula | mode of nearest labels | average of nearest targets |
Unified Intuition
KNN does not learn a formula like: or:
Instead, it says:
for this new point, let me look around nearby training points and decide locally
So every test sample gets its own local prediction rule.
That is why KNN is called:
- lazy learning
- instance-based learning
Choosing the Value of
The value of controls smoothness.
Small
- sensitive to noise
- more flexible
- may overfit
Large
- smoother decision
- may underfit
- local details may be lost
A common rough idea: where is the number of training samples
But in practice, is usually chosen using validation.
Why Feature Scaling Matters
KNN depends entirely on distance. So features with larger numeric ranges dominate the distance.
Example:
- Age may range from 20 to 80
- Glucose may range from 0 to 200
Then Glucose can dominate Euclidean distance.
So scaling is often important:
Without scaling, nearest neighbors may be misleading.
Time Cost of KNN
KNN is cheap during training but expensive during prediction.
Training
- just store the data
Prediction
For every test point:
- compute distance to all training points
- sort them
- then predict
So prediction can be costly when the training set is large.
This is one of the main disadvantages of KNN.
Advantages of KNN
- very simple
- no training optimization needed
- easy to understand
- works for both classification and regression
- naturally handles multi-class classification
Limitations of KNN
- prediction is slow on large datasets
- sensitive to feature scale
- sensitive to irrelevant features
- choice of matters a lot
- can perform poorly in very high dimensions
This last issue is related to the curse of dimensionality.
Exam-Oriented Summary
Definition
KNN is an instance-based supervised learning algorithm that predicts a new sample using the nearest stored training samples.
Distance Formula
Classification Rule
Regression Rule
Steps
- store training data
- compute distance from test point to all training points
- sort by distance
- select nearest
- classify by majority vote or regress by mean
Very Short Revision
KNN Classification
- find nearest
- take majority class
- output class label
KNN Regression
- find nearest
- take mean of target values
- output continuous value
Main formula:
Final Takeaway
KNN is one of the simplest machine learning algorithms. It does not build a model during training, but predicts by comparing a new point to stored training examples. For classification, it uses the majority class of the nearest neighbors. For regression, it uses the mean target value of the nearest neighbors. Its simplicity makes it excellent for understanding local learning, but its prediction cost and sensitivity to scaling are important limitations.
Decision Tree (ID3 & C4.5)
Both ID3 and C4.5 are decision tree algorithms used for classification. They build a tree top-down by repeatedly choosing the best feature to split the dataset.
A decision tree has:
- root/internal nodes = feature tests
- branches = test outcomes
- leaf nodes = final class labels
The key difference is:
- ID3 chooses splits using Information Gain
- C4.5 improves ID3 by using Gain Ratio and can handle continuous features better
1. Decision Tree Idea
At each step, we want to ask the best question about the data. A good split should make the child groups more pure.
If a node contains both classes mixed together, it is impure. If a node contains only one class, it is pure.
Decision trees reduce impurity step by step until leaves are formed.
Part A: ID3
What ID3 Does
ID3 = Iterative Dichotomiser 3 It builds the tree recursively:
- compute entropy of current dataset
- compute information gain of every feature
- choose the feature with the highest information gain
- split the data on that feature
- repeat for each subset
ID3 mainly works best with categorical features.
ID3 Core Equations
Entropy
Entropy measures uncertainty in a dataset: where:
- = current dataset
- = proportion of class
For binary classification:
Meaning:
- entropy = 0 perfectly pure
- entropy is high classes are mixed
Information Gain
If we split dataset using feature : where:
- = subset where feature takes value
Meaning:
- higher information gain = better split
- ID3 chooses the feature with maximum information gain
Short and Clean ID3 Code
import pandas as pd
import numpy as np
class ID3:
def __init__(self):
self.tree = None
self.default_class = None
def entropy(self, y):
values, counts = np.unique(y, return_counts=True)
p = counts / counts.sum()
return -np.sum(p * np.log2(p + 1e-12))
def information_gain(self, X, y, feature):
parent_entropy = self.entropy(y)
values, counts = np.unique(X[feature], return_counts=True)
child_entropy = 0
for v, c in zip(values, counts):
y_sub = y[X[feature] == v]
child_entropy += (c / len(X)) * self.entropy(y_sub)
return parent_entropy - child_entropy
def best_feature(self, X, y):
gains = {f: self.information_gain(X, y, f) for f in X.columns}
return max(gains, key=gains.get)
def build(self, X, y):
if len(np.unique(y)) == 1:
return y.iloc[0]
if X.empty:
return y.mode()[0]
best = self.best_feature(X, y)
tree = {best: {}}
for v in X[best].unique():
mask = X[best] == v
X_sub = X.loc[mask].drop(columns=[best])
y_sub = y.loc[mask]
if len(X_sub) == 0:
tree[best][v] = y.mode()[0]
else:
tree[best][v] = self.build(X_sub, y_sub)
return tree
def fit(self, X, y):
self.default_class = y.mode()[0]
self.tree = self.build(X, y)
return self
def _predict_one(self, x, tree):
if not isinstance(tree, dict):
return tree
feature = next(iter(tree))
value = x.get(feature)
if value not in tree[feature]:
return self.default_class
return self._predict_one(x, tree[feature][value])
def predict(self, X):
return X.apply(lambda row: self._predict_one(row, self.tree), axis=1)
Small Example for ID3
We use a tiny categorical dataset:
data = pd.DataFrame({
"Weather": ["Sunny", "Sunny", "Overcast", "Rain", "Rain", "Rain", "Overcast", "Sunny"],
"Wind": ["Weak", "Strong", "Weak", "Weak", "Weak", "Strong", "Strong", "Weak"],
"Play": ["No", "No", "Yes", "Yes", "Yes", "No", "Yes", "No"]
})
X = data[["Weather", "Wind"]]
y = data["Play"]
model = ID3().fit(X, y)
print("ID3 Tree:", model.tree)
print("Predictions:", model.predict(X).tolist())
Solving the ID3 Example Step by Step
Dataset: Target classes:
Yes= 4No= 4
So root entropy:
So the root is maximally mixed.
Step 1: Try splitting on Weather
Possible values:
- Sunny
- Overcast
- Rain
Subsets:
Sunny[No, No, No]Overcast[Yes, Yes]Rain[Yes, Yes, No]
Entropies:
Sunny subset
All are No, so:
Overcast subset
All are Yes, so:
Rain subset
2 Yes, 1 No:
Weighted entropy after splitting on Weather:
Information gain:
Step 2: Try splitting on Wind
Possible values:
- Weak
- Strong
Subsets:
Weak[No, Yes, Yes, Yes, No]Strong[No, No, Yes]
Entropies:
Weak
3 Yes, 2 No:
Strong
1 Yes, 2 No:
Weighted entropy:
Information gain:
Step 3: Choose the best feature
Since: ID3 chooses: as the root feature.
So the first tree becomes:
Weather
├── Sunny -> No
├── Overcast -> Yes
└── Rain -> split again
Step 4: Recurse on the Rain subset
Rain subset:
- Weak -> Yes
- Weak -> Yes
- Strong -> No
Now only one feature remains: Wind
If split on Wind:
Weak-> all YesStrong-> all No
So final tree:
Weather
├── Sunny -> No
├── Overcast -> Yes
└── Rain
├── Weak -> Yes
└── Strong -> No
This is exactly how ID3 builds the tree recursively.
ID3 Concept -> Equation -> Code
1. Measure impurity
Concept: We first measure how mixed the labels are.
Equation:
Code:
def entropy(self, y):
values, counts = np.unique(y, return_counts=True)
p = counts / counts.sum()
return -np.sum(p * np.log2(p + 1e-12))
2. Score each feature
Concept: Check how much uncertainty reduces after splitting on a feature.
Equation:
Code:
def information_gain(self, X, y, feature):
parent_entropy = self.entropy(y)
values, counts = np.unique(X[feature], return_counts=True)
child_entropy = 0
for v, c in zip(values, counts):
y_sub = y[X[feature] == v]
child_entropy += (c / len(X)) * self.entropy(y_sub)
return parent_entropy - child_entropy
3. Pick the best feature
Concept: Choose the feature with maximum information gain.
Code:
def best_feature(self, X, y):
gains = {f: self.information_gain(X, y, f) for f in X.columns}
return max(gains, key=gains.get)
4. Build tree recursively
Concept: After choosing the best feature, split data and build smaller trees.
Code:
def build(self, X, y):
if len(np.unique(y)) == 1:
return y.iloc[0]
if X.empty:
return y.mode()[0]
best = self.best_feature(X, y)
tree = {best: {}}
for v in X[best].unique():
mask = X[best] == v
X_sub = X.loc[mask].drop(columns=[best])
y_sub = y.loc[mask]
tree[best][v] = self.build(X_sub, y_sub)
return tree
Meaning:
- if node is pure -> make leaf
- if no features left -> return majority class
- else split and recurse
ID3 Pseudocode
ID3(D, features):
if all labels same:
return leaf
if no features left:
return majority class
best = feature with max information gain
create node(best)
for each value v of best:
recurse on subset where best = v
ID3 Advantages
- simple and easy to understand
- tree rules are interpretable
- good for categorical data
- useful in exam explanations because steps are very clear
ID3 Limitations
- biased toward features with many distinct values
- not naturally suited for continuous features
- can overfit
- sensitive to noise
Part B: C4.5
What C4.5 Does
C4.5 is an improved version of ID3. It fixes major weaknesses of ID3.
Main improvements:
- uses Gain Ratio instead of pure Information Gain
- handles continuous features
- can handle missing values better
- usually produces more practical trees
So:
- ID3 = older, simpler
- C4.5 = smarter extension of ID3
Why ID3 Needs Improvement
Information Gain can favor a feature with many distinct values.
Example:
- a feature like
StudentIDmay split every row separately - this gives very high information gain
- but it does not generalize
So C4.5 divides Information Gain by a quantity called Split Information.
C4.5 Core Equations
Entropy
Same as ID3:
Information Gain
Same as ID3:
Split Information
Gain Ratio
C4.5 chooses the feature with the highest gain ratio.
Short and Clean C4.5 Code
import pandas as pd
import numpy as np
class C45:
def __init__(self):
self.tree = None
self.default_class = None
def entropy(self, y):
values, counts = np.unique(y, return_counts=True)
p = counts / counts.sum()
return -np.sum(p * np.log2(p + 1e-12))
def information_gain(self, X, y, feature):
parent_entropy = self.entropy(y)
values, counts = np.unique(X[feature], return_counts=True)
child_entropy = 0
for v, c in zip(values, counts):
y_sub = y[X[feature] == v]
child_entropy += (c / len(X)) * self.entropy(y_sub)
return parent_entropy - child_entropy
def split_info(self, X, feature):
values, counts = np.unique(X[feature], return_counts=True)
p = counts / counts.sum()
return -np.sum(p * np.log2(p + 1e-12))
def gain_ratio(self, X, y, feature):
ig = self.information_gain(X, y, feature)
si = self.split_info(X, feature)
return 0 if si == 0 else ig / si
def best_feature(self, X, y):
ratios = {f: self.gain_ratio(X, y, f) for f in X.columns}
return max(ratios, key=ratios.get)
def build(self, X, y):
if len(np.unique(y)) == 1:
return y.iloc[0]
if X.empty:
return y.mode()[0]
best = self.best_feature(X, y)
tree = {best: {}}
for v in X[best].unique():
mask = X[best] == v
X_sub = X.loc[mask].drop(columns=[best])
y_sub = y.loc[mask]
if len(X_sub) == 0:
tree[best][v] = y.mode()[0]
else:
tree[best][v] = self.build(X_sub, y_sub)
return tree
def fit(self, X, y):
self.default_class = y.mode()[0]
self.tree = self.build(X, y)
return self
def _predict_one(self, x, tree):
if not isinstance(tree, dict):
return tree
feature = next(iter(tree))
value = x.get(feature)
if value not in tree[feature]:
return self.default_class
return self._predict_one(x, tree[feature][value])
def predict(self, X):
return X.apply(lambda row: self._predict_one(row, self.tree), axis=1)
Small Example for C4.5
We use a dataset with a high-cardinality feature to see why Gain Ratio helps:
data = pd.DataFrame({
"ID": ["S1", "S2", "S3", "S4", "S5", "S6"],
"Weather": ["Sunny", "Sunny", "Rain", "Rain", "Overcast", "Overcast"],
"Play": ["No", "No", "Yes", "Yes", "Yes", "Yes"]
})
X = data[["ID", "Weather"]]
y = data["Play"]
model = C45().fit(X, y)
print("C4.5 Tree:", model.tree)
print("Predictions:", model.predict(X).tolist())
Solving the C4.5 Example Step by Step
Root labels:
Yes= 4No= 2
Entropy:
Feature 1: ID
Every ID is unique. So each split contains exactly 1 sample. That means every child subset is pure.
Weighted child entropy: Thus:
This looks perfect for ID3.
But it is misleading because ID is just a unique label.
Now compute split information. Since there are 6 equally-sized branches:
So gain ratio:
Feature 2: Weather
Values:
- Sunny -> [No, No]
- Rain -> [Yes, Yes]
- Overcast -> [Yes, Yes]
All child subsets are pure, so:
Split information: There are 3 equally-sized groups of size 2:
Gain ratio:
Choose the best feature
Although both features have equal information gain, gain ratio is larger for Weather:
So C4.5 correctly chooses:
instead of ID.
This is the key improvement over ID3.
C4.5 Concept -> Equation -> Code
1. Compute entropy
Concept: Measure uncertainty at the current node.
Equation:
Code:
def entropy(self, y):
values, counts = np.unique(y, return_counts=True)
p = counts / counts.sum()
return -np.sum(p * np.log2(p + 1e-12))
2. Compute information gain
Concept: Measure reduction in entropy after split.
Equation:
Code:
def information_gain(self, X, y, feature):
parent_entropy = self.entropy(y)
...
return parent_entropy - child_entropy
3. Compute split information
Concept: Measure how broadly the split divides the data.
Equation:
Code:
def split_info(self, X, feature):
values, counts = np.unique(X[feature], return_counts=True)
p = counts / counts.sum()
return -np.sum(p * np.log2(p + 1e-12))
4. Compute gain ratio
Concept: Normalize information gain so features with too many distinct values are not unfairly favored.
Equation:
Code:
def gain_ratio(self, X, y, feature):
ig = self.information_gain(X, y, feature)
si = self.split_info(X, feature)
return 0 if si == 0 else ig / si
5. Choose best feature and recurse
Concept: Build the decision tree exactly like ID3, but use Gain Ratio instead of Information Gain.
Code:
def best_feature(self, X, y):
ratios = {f: self.gain_ratio(X, y, f) for f in X.columns}
return max(ratios, key=ratios.get)
Handling Continuous Features in C4.5
A major improvement of C4.5 is continuous-value handling. For a numeric feature, C4.5 tries thresholds: and chooses the threshold with the best gain ratio.
Example split:
So unlike ID3, C4.5 can naturally work with continuous attributes by converting them into binary threshold splits.
A simplified threshold idea in code would be:
threshold = 120
left = X[feature] <= threshold
right = X[feature] > threshold
Then entropy, IG, and GR are computed for that split.
ID3 vs C4.5
Main Difference
ID3 chooses: C4.5 chooses:
Comparison Table
| Aspect | ID3 | C4.5 |
|---|---|---|
| Split criterion | Information Gain | Gain Ratio |
| Handles categorical features | Yes | Yes |
| Handles continuous features | Poorly / not directly | Yes |
| Bias toward many-valued features | High | Reduced |
| Missing values | Weak | Better |
| Complexity | Simpler | Slightly more advanced |
Full Combined Example
import pandas as pd
data = pd.DataFrame({
"Weather": ["Sunny", "Sunny", "Overcast", "Rain", "Rain", "Rain", "Overcast", "Sunny"],
"Wind": ["Weak", "Strong", "Weak", "Weak", "Weak", "Strong", "Strong", "Weak"],
"Play": ["No", "No", "Yes", "Yes", "Yes", "No", "Yes", "No"]
})
X = data[["Weather", "Wind"]]
y = data["Play"]
id3_model = ID3().fit(X, y)
c45_model = C45().fit(X, y)
print("ID3 Tree:", id3_model.tree)
print("C4.5 Tree:", c45_model.tree)
print("ID3 Predictions:", id3_model.predict(X).tolist())
print("C4.5 Predictions:", c45_model.predict(X).tolist())
How the Tree Predicts
Suppose tree is:
Weather
├── Sunny -> No
├── Overcast -> Yes
└── Rain
├── Weak -> Yes
└── Strong -> No
For a new row:
Weather = Rain, Wind = Weak
Path:
- check
Weather - move to
Rain - check
Wind Weak-> leaf =Yes
So predicted class is:
Why These Algorithms Work
Both algorithms work by repeatedly reducing uncertainty. They ask:
which feature makes the class labels as pure as possible after splitting?
ID3 answers this using information gain. C4.5 answers this using gain ratio.
The recursion stops when:
- node becomes pure
- no features remain
- or no useful split is possible
So a large classification problem becomes a sequence of small rule-based decisions.
Advantages of ID3 and C4.5
- easy to interpret
- produces human-readable rules
- no heavy math during prediction
- useful for categorical classification tasks
- good for exam answers because the logic is visual and recursive
Limitations
ID3
- biased toward attributes with many values
- struggles with continuous data
- can overfit
- sensitive to noise
C4.5
- more complex than ID3
- deeper trees can still overfit without pruning
- threshold search for continuous features adds computation
Exam-Oriented Summary
ID3 Definition
ID3 is a top-down greedy decision tree algorithm that chooses the feature with the highest information gain at each step.
C4.5 Definition
C4.5 is an improved decision tree algorithm that extends ID3 by using gain ratio and supporting continuous features.
ID3 Formula
Entropy: Information Gain:
C4.5 Formula
Split Information: Gain Ratio:
Common Steps
- compute impurity of current dataset
- score every feature
- choose best feature
- split dataset
- repeat recursively
- stop when node is pure or features are exhausted
Very Short Revision
ID3
- uses entropy
- computes information gain
- picks feature with max gain
- recursive tree construction
C4.5
- starts like ID3
- adds split information
- uses gain ratio
- handles continuous features better
Final Takeaway
ID3 and C4.5 both build decision trees by recursively selecting the best splitting feature. ID3 uses Information Gain, while C4.5 improves it using Gain Ratio to avoid unfair preference for features with many distinct values. So:
- use ID3 to understand the core decision tree idea
- use C4.5 as the more practical and improved version
Artificial Neural Network (ANN)
An Artificial Neural Network is a model made of layers of neurons. A basic ANN has:
- input layer
- hidden layer(s)
- output layer
- weights and biases
- activation functions
A neuron computes: and then applies an activation function:
For this example, we build a 2-layer neural network for binary classification:
- hidden layer uses ReLU
- output layer uses Sigmoid
The final output is a probability: and prediction is:
What This Network Learns
For input , the network computes:
Where:
- = linear outputs
- = hidden layer activations
- = final predicted probability
Short and Clean Code
import numpy as np
class SimpleANN:
def __init__(self, input_size, hidden_size, lr=0.1, epochs=10000):
np.random.seed(42)
self.lr = lr
self.epochs = epochs
self.W1 = np.random.randn(hidden_size, input_size) * 0.1
self.b1 = np.zeros((hidden_size, 1))
self.W2 = np.random.randn(1, hidden_size) * 0.1
self.b2 = np.zeros((1, 1))
self.costs = []
def relu(self, Z):
return np.maximum(0, Z)
def relu_deriv(self, Z):
return (Z > 0).astype(float)
def sigmoid(self, Z):
Z = np.clip(Z, -500, 500)
return 1 / (1 + np.exp(-Z))
def forward(self, X):
Z1 = self.W1 @ X + self.b1
A1 = self.relu(Z1)
Z2 = self.W2 @ A1 + self.b2
A2 = self.sigmoid(Z2)
cache = (X, Z1, A1, Z2, A2)
return A2, cache
def compute_cost(self, Y, A2):
eps = 1e-9
A2 = np.clip(A2, eps, 1 - eps)
return -np.mean(Y * np.log(A2) + (1 - Y) * np.log(1 - A2))
def backward(self, Y, cache):
X, Z1, A1, Z2, A2 = cache
m = X.shape[1]
dZ2 = A2 - Y
dW2 = (dZ2 @ A1.T) / m
db2 = np.sum(dZ2, axis=1, keepdims=True) / m
dZ1 = (self.W2.T @ dZ2) * self.relu_deriv(Z1)
dW1 = (dZ1 @ X.T) / m
db1 = np.sum(dZ1, axis=1, keepdims=True) / m
return dW1, db1, dW2, db2
def fit(self, X, Y):
for i in range(self.epochs):
A2, cache = self.forward(X)
cost = self.compute_cost(Y, A2)
dW1, db1, dW2, db2 = self.backward(Y, cache)
self.W1 -= self.lr * dW1
self.b1 -= self.lr * db1
self.W2 -= self.lr * dW2
self.b2 -= self.lr * db2
self.costs.append(cost)
if i % 1000 == 0:
print(f"Epoch {i}: Cost = {cost:.6f}")
def predict(self, X):
A2, _ = self.forward(X)
return (A2 >= 0.5).astype(int)
X = np.array([[0, 0, 1, 1],
[0, 1, 0, 1]], dtype=float)
Y = np.array([[0, 0, 0, 1]], dtype=float)
model = SimpleANN(input_size=2, hidden_size=4, lr=0.1, epochs=10000)
model.fit(X, Y)
pred = model.predict(X)
print("Predictions:", pred)
Dataset Used: AND Gate
The network is trained on the AND truth table:
Input matrix:
X = np.array([[0, 0, 1, 1],
[0, 1, 0, 1]], dtype=float)
Target matrix:
Y = np.array([[0, 0, 0, 1]], dtype=float)
Shape meaning:
- has shape
- 2 input features
- 4 training examples
So each column is one example:
Network Architecture
This ANN has:
- 2 input neurons
- 4 hidden neurons
- 1 output neuron
So parameter shapes are:
Step-by-Step Algorithm
Step 1: Initialize weights and biases
We begin with small random weights and zero biases.
Code:
self.W1 = np.random.randn(hidden_size, input_size) * 0.1
self.b1 = np.zeros((hidden_size, 1))
self.W2 = np.random.randn(1, hidden_size) * 0.1
self.b2 = np.zeros((1, 1))
Concept:
- weights decide how strongly neurons influence the next layer
- biases shift the activation
- small random values break symmetry
- if all weights start the same, neurons learn the same thing
Why not large random weights:
- large values can make training unstable
- small values help smoother learning at the start
Step 2: Hidden layer linear transformation
Each hidden neuron computes:
Code:
Z1 = self.W1 @ X + self.b1
Concept: This is the weighted sum of inputs plus bias.
For one hidden neuron:
Since there are 4 hidden neurons, this is done 4 times in parallel.
Step 3: Apply ReLU activation
ReLU function is:
Code:
A1 = self.relu(Z1)
and:
def relu(self, Z):
return np.maximum(0, Z)
Concept:
- negative values become 0
- positive values remain unchanged
Why ReLU:
- introduces non-linearity
- lets the network learn more complex patterns
- simple and efficient
Without activation, multiple layers would collapse into just one linear transformation.
Step 4: Output layer linear transformation
Now hidden activations are passed to the output neuron:
Code:
Z2 = self.W2 @ A1 + self.b2
Concept: This combines the hidden-layer outputs into one final score.
Step 5: Apply Sigmoid to get probability
Sigmoid function:
Code:
A2 = self.sigmoid(Z2)
and:
def sigmoid(self, Z):
Z = np.clip(Z, -500, 500)
return 1 / (1 + np.exp(-Z))
Concept:
- converts raw score into probability
- output is between 0 and 1
- suitable for binary classification
Meaning:
Why clip is used:
- prevents overflow in
exp - improves numerical stability
Step 6: Compute the cost
For binary classification, we use binary cross-entropy loss:
Code:
def compute_cost(self, Y, A2):
eps = 1e-9
A2 = np.clip(A2, eps, 1 - eps)
return -np.mean(Y * np.log(A2) + (1 - Y) * np.log(1 - A2))
Concept:
- if actual label is 1, we want output close to 1
- if actual label is 0, we want output close to 0
- confident wrong predictions get heavily penalized
Why clip again:
- avoids which is undefined
Step 7: Backpropagation for output layer
The error at the output layer is:
Code:
dZ2 = A2 - Y
dW2 = (dZ2 @ A1.T) / m
db2 = np.sum(dZ2, axis=1, keepdims=True) / m
Equations:
Concept: This tells how much the output weights and bias contributed to the error.
Step 8: Backpropagation for hidden layer
The hidden layer error is:
Code:
dZ1 = (self.W2.T @ dZ2) * self.relu_deriv(Z1)
dW1 = (dZ1 @ X.T) / m
db1 = np.sum(dZ1, axis=1, keepdims=True) / m
Equations:
ReLU derivative:
Code:
def relu_deriv(self, Z):
return (Z > 0).astype(float)
Concept:
- output error is sent backward into the hidden layer
- only active ReLU neurons pass gradient
- this is how the network learns internal representations
Step 9: Update parameters
Gradient descent update rule:
Code:
self.W1 -= self.lr * dW1
self.b1 -= self.lr * db1
self.W2 -= self.lr * dW2
self.b2 -= self.lr * db2
Concept:
- move parameters in the direction that reduces loss
- repeat this many times
- gradually improve predictions
Here:
- is the learning rate
- a higher learning rate updates faster, but may overshoot
- a lower learning rate is safer, but slower
Step 10: Make predictions
After training, the network outputs probabilities. Convert them into classes using threshold 0.5:
Code:
def predict(self, X):
A2, _ = self.forward(X)
return (A2 >= 0.5).astype(int)
Concept -> Equation -> Code Mapping
1. Weighted input
Concept: Each neuron forms a weighted sum of inputs.
Equation:
Code:
Z1 = self.W1 @ X + self.b1
Z2 = self.W2 @ A1 + self.b2
2. Non-linearity
Concept: Activation functions allow the network to learn beyond straight-line relationships.
Equations:
Code:
A1 = self.relu(Z1)
A2 = self.sigmoid(Z2)
3. Forward propagation
Concept: Data flows from input to hidden to output.
Equations:
Code:
def forward(self, X):
Z1 = self.W1 @ X + self.b1
A1 = self.relu(Z1)
Z2 = self.W2 @ A1 + self.b2
A2 = self.sigmoid(Z2)
4. Loss measurement
Concept: We need to measure how wrong predictions are.
Equation:
Code:
cost = self.compute_cost(Y, A2)
5. Error propagation backward
Concept: The network computes gradients layer by layer from output back to input.
Equations:
Code:
dZ2 = A2 - Y
dW2 = (dZ2 @ A1.T) / m
dZ1 = (self.W2.T @ dZ2) * self.relu_deriv(Z1)
dW1 = (dZ1 @ X.T) / m
6. Learning
Concept: Use gradients to improve parameters.
Equation:
Code:
self.W1 -= self.lr * dW1
self.b1 -= self.lr * db1
self.W2 -= self.lr * dW2
self.b2 -= self.lr * db2
Solving the AND Gate Example
The AND gate outputs 1 only when both inputs are 1:
During training:
- the network starts with random weights
- predictions are poor at first
- after many epochs, weights and biases adjust
- the cost decreases
- final outputs approach the correct AND values
Expected final prediction:
Code:
pred = model.predict(X)
print("Predictions:", pred)
If training succeeds, output becomes:
Predictions: [[0 0 0 1]]
One Forward Pass Example
Suppose for one sample:
Assume one hidden neuron has:
Then:
Apply ReLU:
Then output neuron may combine hidden activations and pass through sigmoid. If final output score is: then:
Since: prediction is:
This is how the ANN converts inputs into a class decision.
Why ANN Works
A neural network works because:
- weights learn which inputs matter
- biases shift decision boundaries
- activation functions add non-linearity
- backpropagation tells each parameter how it contributed to the error
- gradient descent improves the parameters repeatedly
So the network gradually learns a function that maps input to output.
Why Hidden Layers Matter
A single linear model can only learn a linear boundary. A hidden layer with activation allows:
- combinations of features
- piecewise linear transformations
- more expressive decision boundaries
Even though AND is simple, this example demonstrates the full learning pipeline of an ANN.
Cost Curve Meaning
The printed cost every 1000 epochs tells whether learning is working.
If cost decreases:
- predictions are improving
- gradients are useful
- parameter updates are moving in the correct direction
If cost does not decrease:
- learning rate may be wrong
- architecture may be unsuitable
- initialization may be poor
Practical Notes
1. Initialization matters
Bad initialization can slow or break learning.
2. Learning rate matters
If learning rate is:
- too high -> unstable training
- too low -> very slow training
3. Activation choice matters
- ReLU is common in hidden layers
- Sigmoid is common for binary output
4. More layers increase capacity
Deeper networks can learn more complex patterns, but are also harder to train.
Exam-Oriented Summary
Definition
An ANN is a layered network of neurons that learns by adjusting weights and biases using backpropagation and gradient descent.
Architecture Used
- 2 input neurons
- 1 hidden layer with 4 neurons
- 1 output neuron
Important Equations
Hidden layer: Output layer: Sigmoid: ReLU: Loss: Gradient descent:
Training Steps
- initialize parameters
- perform forward propagation
- compute loss
- perform backpropagation
- update parameters
- repeat for many epochs
Very Short Revision
- input passes through weights and bias
- ReLU activates hidden layer
- sigmoid gives output probability
- cross-entropy measures error
- backpropagation computes gradients
- gradient descent updates weights
- repeat until cost decreases and predictions improve
Final Takeaway
This ANN from scratch shows the complete neural-network learning process:
- forward propagation computes predictions
- loss measures error
- backpropagation computes gradients
- gradient descent updates parameters
For the AND gate dataset, the network learns the correct truth table: which shows that it has successfully learned the mapping from inputs to outputs.