Linear Regression

Simple Linear Regression predicts a continuous value using one input feature by fitting a straight line: $\overset{y}{^} = b_{0} + b_{1} x$ where:

$b_{0}$ = intercept
$b_{1}$ = slope
$\overset{y}{^}$ = predicted output

The goal is to find the line that minimizes the sum of squared errors: $J (b_{0}, b_{1}) = i = 1 \sum n (y_{i} - \overset{y}{^}_{i})^{2}$

Intuition

We want the “best” straight line through the data.

If $b_{1} > 0$ , the line goes upward
If $b_{1} < 0$ , the line goes downward
$b_{0}$ tells where the line cuts the $y$ -axis

For a dataset with one feature: $X = x_{1} x_{2} ⋮ x_{n}, y = y_{1} y_{2} ⋮ y_{n}$ we add a bias column of 1s: $X_{b} = 11 ⋮ 1 x_{1} x_{2} ⋮ x_{n}$

Then the parameters are computed using the Normal Equation: $θ = (X_{b}^{T} X_{b})^{- 1} X_{b}^{T} y$ where: $θ = [b_{0} b_{1}]$

Short and Clean Code

import numpy as np
import matplotlib.pyplot as plt

class SimpleLinearRegression:
    def __init__(self):
        self.intercept_ = 0.0
        self.coef_ = 0.0
        self.r2_ = 0.0

    def fit(self, X, y):
        X = np.asarray(X).reshape(-1, 1)
        y = np.asarray(y).reshape(-1, 1)

        Xb = np.c_[np.ones((len(X), 1)), X]
        theta = np.linalg.inv(Xb.T @ Xb) @ Xb.T @ y

        self.intercept_ = theta[0, 0]
        self.coef_ = theta[1, 0]

        y_pred = Xb @ theta
        ss_res = np.sum((y - y_pred) ** 2)
        ss_tot = np.sum((y - y.mean()) ** 2)
        self.r2_ = 1 - ss_res / ss_tot
        return self

    def predict(self, X):
        X = np.asarray(X).reshape(-1, 1)
        return self.intercept_ + self.coef_ * X

X = np.array([1,2,3,4,5,6,7,8,9,10])
y = np.array([2,4,5,4,5,7,8,9,10,12])

model = SimpleLinearRegression().fit(X, y)
y_pred = model.predict(X)

print("Intercept:", round(model.intercept_, 4))
print("Slope:", round(model.coef_, 4))
print("R^2:", round(model.r2_, 4))

plt.scatter(X, y, label="Data")
plt.plot(X, y_pred, label="Regression line")
plt.xlabel("X")
plt.ylabel("y")
plt.title("Simple Linear Regression")
plt.legend()
plt.show()

Output Meaning

From this code, the fitted line is approximately: $\overset{y}{^} = 1.0667 + 1.0182 x$

So:

intercept $\approx 1.0667$
slope $\approx 1.0182$

This means:

when $x = 0$ , predicted $y \approx 1.0667$
for every increase of 1 in $x$ , $y$ increases by about $1.0182$

The $R^{2}$ score is approximately: $R^{2} \approx 0.9525$ which means the model explains about 95.25% of the variance in the target.

Step 1: Prepare the data

We start with:

X = np.array([1,2,3,4,5,6,7,8,9,10])
y = np.array([2,4,5,4,5,7,8,9,10,12])

Concept:

$X$ is the input feature
$y$ is the actual output

Dataset pairs: $(1, 2), (2, 4), (3, 5), (4, 4), (5, 5), (6, 7), (7, 8), (8, 9), (9, 10), (10, 12)$

Step 2: Add the bias column

To learn both intercept and slope together, we transform $X$ into: $X_{b} = 111111111112345678910$

Code:

Xb = np.c_[np.ones((len(X), 1)), X.reshape(-1, 1)]

Concept:

first column of 1s handles the intercept
second column stores the feature values

Step 3: Compute parameters using the Normal Equation

Equation: $θ = (X_{b}^{T} X_{b})^{- 1} X_{b}^{T} y$

Code:

theta = np.linalg.inv(Xb.T @ Xb) @ Xb.T @ y.reshape(-1, 1)

This gives: $θ = [1.0667 1.0182]$

So: $b_{0} = 1.0667, b_{1} = 1.0182$

Meaning: $\overset{y}{^} = 1.0667 + 1.0182 x$

Step 4: Predict values

For each input, substitute into: $\overset{y}{^} = b_{0} + b_{1} x$

Code:

y_pred = self.intercept_ + self.coef_ * X.reshape(-1, 1)

Let us compute a few predictions manually.

For $x = 1$ : $\overset{y}{^} = 1.0667 + 1.0182 (1) = 2.0849$

For $x = 2$ : $\overset{y}{^} = 1.0667 + 1.0182 (2) = 3.1030$

For $x = 5$ : $\overset{y}{^} = 1.0667 + 1.0182 (5) = 6.1576$

For $x = 10$ : $\overset{y}{^} = 1.0667 + 1.0182 (10) = 11.2485$

So the model predictions are approximately: $[2.0849, 3.1030, 4.1212, 5.1394, 6.1576, 7.1758, 8.1939, 9.2121, 10.2303, 11.2485]$

Step 5: Measure performance using $R^{2}$

The coefficient of determination is: $R^{2} = 1 - \frac{\sum ( y - y ^ ) ^{2}}{\sum ( y - y ˉ ) ^{2}}$

Where:

$\sum (y - \overset{y}{^})^{2}$ = residual sum of squares
$\sum (y - \overset{y}{ˉ})^{2}$ = total sum of squares
$\overset{y}{ˉ}$ = mean of actual $y$

Code:

ss_res = np.sum((y - y_pred) ** 2)
ss_tot = np.sum((y - y.mean()) ** 2)
self.r2_ = 1 - ss_res / ss_tot

For this dataset: $R^{2} \approx 0.9525$

Interpretation:

close to 1 means strong fit
close to 0 means poor fit

Code Explanation: Concept -> Equation -> Code

1. Store model parameters

Concept: We need to remember the learned intercept, slope, and model score.

Code:

class SimpleLinearRegression:
    def __init__(self):
        self.intercept_ = 0.0
        self.coef_ = 0.0
        self.r2_ = 0.0

Meaning:

intercept_ stores $b_{0}$
coef_ stores $b_{1}$
r2_ stores $R^{2}$

2. Convert input into column form

Concept: Matrix equations require $X$ and $y$ in proper shapes.

Code:

X = np.asarray(X).reshape(-1, 1)
y = np.asarray(y).reshape(-1, 1)

Meaning:

reshape(-1, 1) makes data a column vector

Example: $X = 123 ⋮ 10$

3. Add the bias term

Concept: The intercept must be part of the matrix multiplication.

Equation: $\overset{y}{^} = X_{b} θ$

Code:

Xb = np.c_[np.ones((len(X), 1)), X]

This builds: $X_{b} = 11 ⋮ 1 x_{1} x_{2} ⋮ x_{n}$

6. Evaluate model fit

Concept: Compare predictions with actual values.

Equation: $S S_{res} = \sum (y - \overset{y}{^})^{2}$ $S S_{t o t} = \sum (y - \overset{y}{ˉ})^{2}$ $R^{2} = 1 - \frac{S S _{res}}{S S _{t o t}}$

Code:

ss_res = np.sum((y - y_pred) ** 2)
ss_tot = np.sum((y - y.mean()) ** 2)
self.r2_ = 1 - ss_res / ss_tot

Worked Example

Using: $x = 6$ Prediction: $\overset{y}{^} = 1.0667 + 1.0182 (6) = 7.1758$

Actual value in dataset: $y = 7$

Error: $y - \overset{y}{^} = 7 - 7.1758 = - 0.1758$

Squared error: $(- 0.1758)^{2} \approx 0.0309$

This is how the model measures how far prediction is from truth.

Why This Algorithm Works

Simple Linear Regression assumes:

the relationship is approximately linear
one feature is enough to explain the target
the best line is the one with minimum squared error

By minimizing squared error, the algorithm finds a line that stays as close as possible to all points overall.

Final Summary

Simple Linear Regression learns: $\overset{y}{^} = b_{0} + b_{1} x$ using: $θ = (X_{b}^{T} X_{b})^{- 1} X_{b}^{T} y$

For this dataset, the learned model is: $\overset{y}{^} = 1.0667 + 1.0182 x$ with: $R^{2} \approx 0.9525$

So the model fits the data well and captures a strong positive linear relationship.

Exam-Oriented Points

Used for predicting a continuous value
Works with one input feature
Equation of model: $\overset{y}{^} = b_{0} + b_{1} x$
Parameters are found using the Normal Equation
Performance is commonly measured with $R^{2}$
Best when the relationship between feature and target is approximately linear

Very Short Revision

Add bias column
Compute parameters using Normal Equation
Form regression line
Predict output
Evaluate using $R^{2}$

Formula: $θ = (X_{b}^{T} X_{b})^{- 1} X_{b}^{T} y$ Model: $\overset{y}{^} = b_{0} + b_{1} x$

Multiple Linear Regression

Multiple Linear Regression predicts a continuous target using two or more input features. It assumes a linear relationship between inputs and output: $\overset{y}{^} = b_{0} + b_{1} x_{1} + b_{2} x_{2} + \dots + b_{n} x_{n}$ where:

$b_{0}$ = intercept
$b_{1}, b_{2}, \dots, b_{n}$ = coefficients
$\overset{y}{^}$ = predicted value

For this example with two features: $\overset{y}{^} = b_{0} + b_{1} x_{1} + b_{2} x_{2}$

The model learns the best coefficients by minimizing the sum of squared errors: $J = i = 1 \sum m (y_{i} - \overset{y}{^}_{i})^{2}$

Intuition

Simple Linear Regression fits a line. Multiple Linear Regression with two features fits a plane. With more than two features, it fits a hyperplane.

So here:

$x_{1}$ affects $y$
$x_{2}$ affects $y$
the model combines both effects into one equation

Main Equation

If the dataset has two input features: $X = x_{11} x_{21} ⋮ x_{m 1} x_{12} x_{22} ⋮ x_{m 2}$ then after adding the bias column: $X_{b} = 11 ⋮ 1 x_{11} x_{21} ⋮ x_{m 1} x_{12} x_{22} ⋮ x_{m 2}$

The Normal Equation is: $θ = (X_{b}^{T} X_{b})^{- 1} X_{b}^{T} y$ where: $θ = b_{0} b_{1} b_{2}$

Short and Clean Code

import numpy as np
import matplotlib.pyplot as plt

class MultipleLinearRegression:
    def __init__(self):
        self.intercept_ = 0.0
        self.coef_ = None
        self.r2_ = 0.0

    def fit(self, X, y):
        X = np.asarray(X, dtype=float)
        y = np.asarray(y, dtype=float).reshape(-1, 1)

        Xb = np.c_[np.ones((len(X), 1)), X]
        theta = np.linalg.inv(Xb.T @ Xb) @ Xb.T @ y

        self.intercept_ = theta[0, 0]
        self.coef_ = theta[1:, 0]

        y_pred = Xb @ theta
        ss_res = np.sum((y - y_pred) ** 2)
        ss_tot = np.sum((y - y.mean()) ** 2)
        self.r2_ = 1 - ss_res / ss_tot
        return self

    def predict(self, X):
        X = np.asarray(X, dtype=float)
        return self.intercept_ + X @ self.coef_

np.random.seed(0)
X1 = np.random.randint(1, 11, 15)
X2 = np.random.randint(1, 11, 15)
X = np.column_stack((X1, X2))
y = 1 + 2 * X1 + 3 * X2 + np.random.randn(15) * 2

model = MultipleLinearRegression().fit(X, y)
y_pred = model.predict(X)

print("Intercept:", round(model.intercept_, 4))
print("Coefficients:", np.round(model.coef_, 4))
print("R^2:", round(model.r2_, 4))

fig = plt.figure(figsize=(9, 6))
ax = fig.add_subplot(111, projection="3d")

ax.scatter(X[:, 0], X[:, 1], y, label="Data")

x1_grid, x2_grid = np.meshgrid(
    np.linspace(X[:, 0].min(), X[:, 0].max(), 20),
    np.linspace(X[:, 1].min(), X[:, 1].max(), 20)
)
grid_points = np.c_[x1_grid.ravel(), x2_grid.ravel()]
z_grid = model.predict(grid_points).reshape(x1_grid.shape)

ax.plot_surface(x1_grid, x2_grid, z_grid, alpha=0.5)
ax.set_xlabel("X1")
ax.set_ylabel("X2")
ax.set_zlabel("y")
ax.set_title("Multiple Linear Regression Plane")
plt.show()

Dataset Used

The target was generated using: $y = 1 + 2 x_{1} + 3 x_{2} + noise$

Code:

y = 1 + 2 * X1 + 3 * X2 + np.random.randn(15) * 2

Concept:

true intercept is near 1
true coefficient of $x_{1}$ is near 2
true coefficient of $x_{2}$ is near 3
random noise is added to make it realistic

So the model should learn values close to: $b_{0} \approx 1, b_{1} \approx 2, b_{2} \approx 3$

Step-by-Step Algorithm

Step 1: Prepare the data

We create two input features:

X1 = np.random.randint(1, 11, 15)
X2 = np.random.randint(1, 11, 15)
X = np.column_stack((X1, X2))

Concept: Each sample now has two features: $x = (x_{1}, x_{2})$

So a row may look like: $(6, 8)$ meaning:

first feature = 6
second feature = 8

Step 2: Add the bias column

To learn the intercept together with slopes, add a column of ones: $X_{b} = 11 ⋮ x_{11} x_{21} ⋮ x_{12} x_{22} ⋮$

Code:

Xb = np.c_[np.ones((len(X), 1)), X]

Concept:

first column handles the intercept
remaining columns are the actual features

Step 3: Compute coefficients using the Normal Equation

Equation: $θ = (X_{b}^{T} X_{b})^{- 1} X_{b}^{T} y$

Code:

theta = np.linalg.inv(Xb.T @ Xb) @ Xb.T @ y

Concept: This directly computes the best-fit coefficients without iterative optimization.

The parameter vector is: $θ = b_{0} b_{1} b_{2}$

Code mapping:

self.intercept_ = theta[0, 0]
self.coef_ = theta[1:, 0]

So:

intercept_ stores $b_{0}$
coef_[0] stores $b_{1}$
coef_[1] stores $b_{2}$

Step 4: Form the regression equation

Once the parameters are learned, the model becomes: $\overset{y}{^} = b_{0} + b_{1} x_{1} + b_{2} x_{2}$

Code:

return self.intercept_ + X @ self.coef_

Concept: For each new row:

multiply each feature by its coefficient
add the intercept
get the predicted output

Step 5: Compute predictions

Predictions for training data: $\overset{y}{^} = X_{b} θ$

Code:

y_pred = Xb @ theta

Concept: This gives the fitted values of the regression plane on the training samples.

Step 6: Evaluate with $R^{2}$

The coefficient of determination is: $R^{2} = 1 - \frac{\sum ( y - y ^ ) ^{2}}{\sum ( y - y ˉ ) ^{2}}$

Code:

ss_res = np.sum((y - y_pred) ** 2)
ss_tot = np.sum((y - y.mean()) ** 2)
self.r2_ = 1 - ss_res / ss_tot

Meaning:

$S S_{res}$ = residual sum of squares
$S S_{t o t}$ = total sum of squares

Interpretation:

$R^{2} = 1$ means perfect fit
larger $R^{2}$ means better fit
close to 0 means poor explanatory power

Code Explanation: Concept -> Equation -> Code

1. Store model parameters

Concept: The model needs to store intercept, coefficients, and score.

Code:

def __init__(self):
    self.intercept_ = 0.0
    self.coef_ = None
    self.r2_ = 0.0

Meaning:

intercept_ = $b_{0}$
coef_ = slopes
r2_ = model accuracy measure

2. Convert input shapes properly

Concept: Matrix operations need correctly shaped arrays.

Code:

X = np.asarray(X, dtype=float)
y = np.asarray(y, dtype=float).reshape(-1, 1)

Meaning:

X becomes a 2D matrix
y becomes a column vector

3. Add intercept term

Concept: We include the intercept in matrix multiplication by adding a bias column.

Equation: $X_{b} = [1 x_{1} x_{2}]$

Code:

Xb = np.c_[np.ones((len(X), 1)), X]

4. Learn best coefficients

Concept: Find coefficients that minimize squared error.

Equation: $θ = (X_{b}^{T} X_{b})^{- 1} X_{b}^{T} y$

Code:

theta = np.linalg.inv(Xb.T @ Xb) @ Xb.T @ y

5. Separate intercept and slopes

Concept: The first value of $θ$ is intercept, the rest are feature coefficients.

Code:

self.intercept_ = theta[0, 0]
self.coef_ = theta[1:, 0]

If: $θ = 1.2 1.9 3.1$ then the model is: $\overset{y}{^} = 1.2 + 1.9 x_{1} + 3.1 x_{2}$

6. Predict outputs

Concept: Use the learned plane to estimate new values.

Equation: $\overset{y}{^} = b_{0} + b_{1} x_{1} + b_{2} x_{2}$

Code:

return self.intercept_ + X @ self.coef_

7. Evaluate goodness of fit

Concept: Check how much variance in $y$ is explained by the model.

Equation: $R^{2} = 1 - \frac{S S _{res}}{S S _{t o t}}$

Code:

ss_res = np.sum((y - y_pred) ** 2)
ss_tot = np.sum((y - y.mean()) ** 2)
self.r2_ = 1 - ss_res / ss_tot

Worked Example

Suppose the learned model is approximately: $\overset{y}{^} = 1.3 + 1.9 x_{1} + 3.0 x_{2}$

Take one sample: $x_{1} = 4, x_{2} = 5$

Prediction: $\overset{y}{^} = 1.3 + 1.9 (4) + 3.0 (5)$ $\overset{y}{^} = 1.3 + 7.6 + 15$ $\overset{y}{^} = 23.9$

So for that point, the model predicts: $23.9$

This is exactly what the predict() method is doing for every row.

Why This Algorithm Works

Multiple Linear Regression assumes:

the target depends linearly on the features
each feature contributes additively
the best model is the one with minimum squared error

So it finds a plane: $\overset{y}{^} = b_{0} + b_{1} x_{1} + b_{2} x_{2}$ that stays as close as possible to the observed data points.

Understanding the 3D Plot

With two features, the model can be visualized in 3D:

horizontal axis 1 = $x_{1}$
horizontal axis 2 = $x_{2}$
vertical axis = $y$

The blue points are actual data. The surface is the regression plane.

Code:

ax.scatter(X[:, 0], X[:, 1], y, label="Data")
ax.plot_surface(x1_grid, x2_grid, z_grid, alpha=0.5)

Concept:

scatter points show real observations
plane shows model predictions
a good model has points lying near the plane

Practical Notes

1. Multiple features

Unlike simple linear regression, multiple linear regression uses: $x_{1}, x_{2}, \dots, x_{n}$ So it can capture the effect of several variables together.

2. Coefficient meaning

If: $\overset{y}{^} = b_{0} + b_{1} x_{1} + b_{2} x_{2}$ then:

$b_{1}$ = change in $y$ for 1 unit increase in $x_{1}$ , keeping $x_{2}$ fixed
$b_{2}$ = change in $y$ for 1 unit increase in $x_{2}$ , keeping $x_{1}$ fixed

3. Multicollinearity

If input features are highly correlated, coefficient estimates can become unstable.

4. Linear assumption

If the true relationship is non-linear, this model may underfit.

Exam-Oriented Summary

Definition

Multiple Linear Regression predicts a continuous output using two or more input features.

Model Equation

$\overset{y}{^} = b_{0} + b_{1} x_{1} + b_{2} x_{2} + \dots + b_{n} x_{n}$

Normal Equation

$θ = (X_{b}^{T} X_{b})^{- 1} X_{b}^{T} y$

Performance Metric

$R^{2} = 1 - \frac{\sum ( y - y ^ ) ^{2}}{\sum ( y - y ˉ ) ^{2}}$

Steps

arrange input matrix
add bias column
compute coefficients using Normal Equation
predict outputs
evaluate using $R^{2}$

Very Short Revision

add bias column
apply Normal Equation
get intercept and coefficients
form regression plane
predict target values
evaluate using $R^{2}$

Main formulas: $\overset{y}{^} = b_{0} + b_{1} x_{1} + b_{2} x_{2}$ $θ = (X_{b}^{T} X_{b})^{- 1} X_{b}^{T} y$

Final Takeaway

Multiple Linear Regression extends simple linear regression to multiple input features. Instead of fitting a line, it fits a plane or hyperplane. It learns coefficients using the Normal Equation and predicts continuous values using: $\overset{y}{^} = b_{0} + b_{1} x_{1} + b_{2} x_{2} + \dots + b_{n} x_{n}$ For your sample dataset, it should learn values close to the true generating rule: $y = 1 + 2 x_{1} + 3 x_{2} + noise$ so the fitted coefficients should be close to: $b_{0} \approx 1, b_{1} \approx 2, b_{2} \approx 3$

Logistic Regression

Logistic Regression is a binary classification algorithm used when the output belongs to one of two classes: $y \in {0, 1}$ It does not predict the class directly using a line. Instead, it predicts a probability using the sigmoid function, then converts that probability into class 0 or 1. The model is: $z = w_{1} x_{1} + w_{2} x_{2} + \dots + w_{n} x_{n} + b$ $\overset{y}{^} = σ (z) = \frac{1}{1 + e ^{- z}}$ where:

$z$ = linear score
$σ (z)$ = sigmoid function
$\overset{y}{^}$ = predicted probability that class is 1

If: $\overset{y}{^} \geq 0.5 \Rightarrow predict 1$ $\overset{y}{^} < 0.5 \Rightarrow predict 0$

Main Idea

Linear Regression gives any real number as output, but classification needs a value between 0 and 1. So Logistic Regression first computes a linear combination: $z = Xw + b$ then applies the sigmoid: $σ (z) = \frac{1}{1 + e ^{- z}}$ This maps any real number into: $(0, 1)$ So the output can be interpreted as a probability.

Short and Clean Code

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

class LogisticRegressionScratch:
    def __init__(self, lr=0.1, epochs=1000):
        self.lr = lr
        self.epochs = epochs
        self.w = None
        self.b = 0.0
        self.loss_history = []

    def _sigmoid(self, z):
        z = np.clip(z, -500, 500)
        return 1 / (1 + np.exp(-z))

    def _loss(self, y, p):
        eps = 1e-9
        p = np.clip(p, eps, 1 - eps)
        return -np.mean(y * np.log(p) + (1 - y) * np.log(1 - p))

    def fit(self, X, y):
        X = np.asarray(X)
        y = np.asarray(y)
        m, n = X.shape
        self.w = np.zeros(n)

        for _ in range(self.epochs):
            z = X @ self.w + self.b
            p = self._sigmoid(z)

            dw = (X.T @ (p - y)) / m
            db = np.mean(p - y)

            self.w -= self.lr * dw
            self.b -= self.lr * db

            self.loss_history.append(self._loss(y, p))
        return self

    def predict_proba(self, X):
        X = np.asarray(X)
        return self._sigmoid(X @ self.w + self.b)

    def predict(self, X):
        return (self.predict_proba(X) >= 0.5).astype(int)

np.random.seed(42)
X = np.random.rand(200, 2) * 10
y = (X[:, 0] + X[:, 1] > 10).astype(int)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

model = LogisticRegressionScratch(lr=0.1, epochs=1000)
model.fit(X_train, y_train)

pred = model.predict(X_test)
acc = np.mean(pred == y_test)

print("Weights:", np.round(model.w, 4))
print("Bias:", round(model.b, 4))
print("Accuracy:", round(acc, 4))

plt.plot(model.loss_history)
plt.title("Loss Convergence")
plt.xlabel("Iteration")
plt.ylabel("Loss")
plt.grid(True)
plt.show()

What This Code Does

This example creates 2D points: $X = [x_{1}, x_{2}]$ and labels them using: $y = {1, 0, x_{1} + x_{2} > 10 x_{1} + x_{2} \leq 10$ So the true decision boundary is: $x_{1} + x_{2} = 10$ This is a binary classification problem.

Step-by-Step Algorithm

Step 1: Create the dataset

Code:

np.random.seed(42)
X = np.random.rand(200, 2) * 10
y = (X[:, 0] + X[:, 1] > 10).astype(int)

Concept:

X contains 200 samples
each sample has 2 features: $x_{1}, x_{2}$
class label depends on whether the sum is greater than 10

Equation: $y = {1, 0, x_{1} + x_{2} > 10 otherwise$

Meaning:

points above the line belong to class 1
points below the line belong to class 0

Step 2: Split into train and test sets

Code:

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Concept:

training data is used to learn parameters
test data is used to check performance on unseen data

Here:

80% for training
20% for testing

Step 3: Standardize features

Code:

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Concept: Gradient descent works better when features are on similar scales.

Standardization formula: $x_{scaled} = \frac{x - μ}{σ}$ where:

$μ$ = mean of the feature
$σ$ = standard deviation

Why it helps:

faster convergence
more stable updates
one feature does not dominate another due to large magnitude

Step 4: Compute the linear score

For each sample, Logistic Regression first computes: $z = w_{1} x_{1} + w_{2} x_{2} + \dots + w_{n} x_{n} + b$ In vector form: $z = Xw + b$

Code:

z = X @ self.w + self.b

Concept: This is the same linear part used in linear models. But here it is not the final output. It is only the input to the sigmoid function.

Step 5: Apply sigmoid to get probability

Equation: $σ (z) = \frac{1}{1 + e ^{- z}}$

Code:

p = self._sigmoid(z)

Concept: The sigmoid compresses any real number into a value between 0 and 1.

Examples:

if $z = 0$ : $σ (0) = \frac{1}{1 + e ^{0}} = 0.5$
if $z$ is large positive, probability is close to 1
if $z$ is large negative, probability is close to 0

So: $p = P (y = 1 ∣ X)$

Step 6: Measure error using cross-entropy loss

For Logistic Regression, we do not use mean squared error. We use cross-entropy loss: $J (w, b) = - \frac{1}{m} i = 1 \sum m [y_{i} lo g (\overset{y}{^}_{i}) + (1 - y_{i}) lo g (1 - \overset{y}{^}_{i})]$

Code:

def _loss(self, y, p):
    eps = 1e-9
    p = np.clip(p, eps, 1 - eps)
    return -np.mean(y * np.log(p) + (1 - y) * np.log(1 - p))

Concept:

if actual class is 1, we want $\overset{y}{^}$ close to 1
if actual class is 0, we want $\overset{y}{^}$ close to 0
wrong confident predictions are penalized heavily

Why clipping is used:

log(0) is undefined
so probabilities are clipped slightly away from 0 and 1

Step 7: Compute gradients

To reduce loss, we update weights and bias using gradient descent.

Gradient formulas: $\frac{\partial J}{\partial w} = \frac{1}{m} X^{T} (\overset{y}{^} - y)$ $\frac{\partial J}{\partial b} = \frac{1}{m} \sum (\overset{y}{^} - y)$

Code:

dw = (X.T @ (p - y)) / m
db = np.mean(p - y)

Concept:

dw tells how weights should change
db tells how bias should change
if prediction is too large, parameters are pushed downward
if prediction is too small, parameters are pushed upward

Step 8: Update parameters

Gradient descent update rule: $w := w - α \frac{\partial J}{\partial w}$ $b := b - α \frac{\partial J}{\partial b}$ where $α$ is the learning rate.

Code:

self.w -= self.lr * dw
self.b -= self.lr * db

Concept:

move parameters in the direction that reduces loss
repeat many times until learning stabilizes

Step 9: Convert probabilities to classes

After training, predicted probability is converted to class label.

Rule: $\overset{y}{^} = {1, 0, p \geq 0.5 p < 0.5$

Code:

return (self.predict_proba(X) >= 0.5).astype(int)

Concept:

probabilities are continuous
classification needs discrete labels

Step 10: Measure accuracy

Code:

pred = model.predict(X_test)
acc = np.mean(pred == y_test)

Equation: $Accuracy = \frac{Number of correct predictions}{Total predictions}$

Concept: Accuracy tells what fraction of test samples were classified correctly.

Concept -> Equation -> Code Mapping

1. Model parameters

Concept: The model must learn weights and bias.

Equation: $z = Xw + b$

Code:

self.w = np.zeros(n)
self.b = 0.0

Meaning:

start with all weights as 0
start with bias as 0

2. Probability model

Concept: Turn linear score into probability.

Equation: $\overset{y}{^} = σ (z) = \frac{1}{1 + e ^{- z}}$

Code:

def _sigmoid(self, z):
    z = np.clip(z, -500, 500)
    return 1 / (1 + np.exp(-z))

Why clip:

avoids overflow for very large positive or negative values

3. Forward pass

Concept: Compute predictions from current parameters.

Equation: $z = Xw + b$ $p = σ (z)$

Code:

z = X @ self.w + self.b
p = self._sigmoid(z)

Meaning:

z = raw score
p = predicted probability

4. Loss calculation

Concept: See how wrong the predictions are.

Equation: $J (w, b) = - \frac{1}{m} \sum [y lo g (p) + (1 - y) lo g (1 - p)]$

Code:

self.loss_history.append(self._loss(y, p))

Meaning:

each iteration stores loss
useful for checking whether training is improving

5. Backward pass

Concept: Find how parameters affect the loss.

Equation: $\frac{\partial J}{\partial w} = \frac{1}{m} X^{T} (p - y)$ $\frac{\partial J}{\partial b} = \frac{1}{m} \sum (p - y)$

Code:

dw = (X.T @ (p - y)) / m
db = np.mean(p - y)

Meaning: These gradients guide the update step.

6. Learning step

Concept: Improve the model gradually.

Equation: $w := w - α d w$ $b := b - α d b$

Code:

self.w -= self.lr * dw
self.b -= self.lr * db

Meaning: Repeated updates make the model better at classification.

Worked Example on One Sample

Suppose after some training, for one sample: $x = [0.8, 1.2]$ and the model has: $w = [1.5, 1.0], b = - 0.4$

Step 1: Compute score

$z = (1.5) (0.8) + (1.0) (1.2) - 0.4$ $z = 1.2 + 1.2 - 0.4 = 2.0$

Step 2: Apply sigmoid

$\overset{y}{^} = σ (2) = \frac{1}{1 + e ^{- 2}}$ $\overset{y}{^} \approx 0.8808$

Step 3: Classify

Since: $0.8808 > 0.5$ prediction is: $1$

So this sample is classified as class 1.

Why Logistic Regression Works

The model learns a boundary where the probability changes from class 0 to class 1. For two features, the decision boundary is: $w_{1} x_{1} + w_{2} x_{2} + b = 0$ Because: $σ (z) = 0.5 when z = 0$ So:

if $z > 0$ , class tends toward 1
if $z < 0$ , class tends toward 0

This creates a linear decision boundary.

For your dataset, true labels come from: $x_{1} + x_{2} > 10$ So Logistic Regression is a good fit because the classes are separable by a line.

What the Loss Curve Means

Code:

plt.plot(model.loss_history)
plt.title("Loss Convergence")
plt.xlabel("Iteration")
plt.ylabel("Loss")
plt.grid(True)
plt.show()

Concept:

at the beginning, loss is high
during learning, loss should decrease
a downward curve means gradient descent is working

If the curve:

decreases smoothly -> learning is stable
oscillates wildly -> learning rate may be too high
decreases very slowly -> learning rate may be too low

Practical Notes

1. Feature scaling matters

Because Logistic Regression uses gradient descent, features with large values can slow down convergence. That is why: $x_{scaled} = \frac{x - μ}{σ}$ is important.

2. Learning rate matters

If learning rate $α$ is:

too high -> training may diverge
too low -> training becomes very slow

Typical starting values: $0.01 to 0.1$

3. Linear boundary limitation

Logistic Regression assumes the boundary is linear: $w_{1} x_{1} + w_{2} x_{2} + \dots + w_{n} x_{n} + b = 0$ If data is non-linear, you may need:

feature engineering
polynomial features
another model

4. Correlated features can cause issues

If features are strongly correlated, learning may become unstable. This is called multicollinearity.

Exam-Oriented Summary

Definition

Logistic Regression is a supervised learning algorithm used for binary classification.

Model

$z = Xw + b$ $\overset{y}{^} = \frac{1}{1 + e ^{- z}}$

Decision Rule

$\overset{y}{^} \geq 0.5 \Rightarrow 1$ $\overset{y}{^} < 0.5 \Rightarrow 0$

Loss Function

$J (w, b) = - \frac{1}{m} \sum [y lo g (\overset{y}{^}) + (1 - y) lo g (1 - \overset{y}{^})]$

Gradient Descent Updates

$w := w - α \frac{1}{m} X^{T} (\overset{y}{^} - y)$ $b := b - α \frac{1}{m} \sum (\overset{y}{^} - y)$

Uses

spam detection
disease prediction
pass/fail prediction
yes/no classification tasks

Very Short Revision

Compute linear score: $z = Xw + b$
Apply sigmoid: $\overset{y}{^} = \frac{1}{1 + e ^{- z}}$
Compute cross-entropy loss
Update weights and bias using gradient descent
Convert probability to class using threshold 0.5

Final Takeaway

Logistic Regression is a simple but powerful classification algorithm. It learns a linear decision boundary, uses sigmoid to output probabilities, and improves itself by minimizing cross-entropy loss using gradient descent. For your example, it works well because the classes are generated by a rule that is approximately linearly separable.

Principal Component Analysis (PCA)

PCA is a dimensionality reduction algorithm. It creates new features called principal components that keep as much information as possible from the original dataset. The main idea is:

find directions where the data varies the most
rank those directions
keep only the most important ones

If the original data has $d$ features, PCA can produce up to $d$ principal components.

Why PCA is Needed

High-dimensional data causes problems such as:

harder visualization
more computation
more difficult learning
curse of dimensionality

PCA solves this by projecting the data onto fewer dimensions while preserving maximum variance.

Core Idea

Suppose the data matrix is: $X \in R^{n \times d}$ where:

$n$ = number of samples
$d$ = number of features

PCA finds a new set of orthogonal directions: $u_{1}, u_{2}, \dots, u_{d}$ such that:

$u_{1}$ captures the maximum variance
$u_{2}$ captures the next maximum variance
and so on

These directions are the eigenvectors of the covariance matrix. Their importance is given by the eigenvalues.

Main Equations

1. Standardization

Each feature is standardized using Z-score: $x^{'} = \frac{x - μ}{σ}$

2. Covariance matrix

For centered data: $C = \frac{1}{n} X^{T} X$

3. Eigen decomposition

$C u = λ u$ where:

$u$ = eigenvector
$λ$ = eigenvalue

4. Projection

If $W_{k}$ contains the top $k$ principal components, then: $X_{proj} = X W_{k}$

Intuition

Each principal component is a direction in feature space. If data spreads a lot along a direction, that direction contains a lot of information. So PCA keeps the directions with the largest variance.

That is why PCA chooses eigenvectors with the largest eigenvalues.

Short and Clean Code

import numpy as np
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA as SklearnPCA

class PCAFromScratch:
    def __init__(self, n_components):
        self.n_components = n_components
        self.mean_ = None
        self.std_ = None
        self.components_ = None
        self.eigenvalues_ = None
        self.explained_variance_ratio_ = None

    def fit(self, X):
        X = np.asarray(X, dtype=float)

        self.mean_ = X.mean(axis=0)
        self.std_ = X.std(axis=0)
        Xs = (X - self.mean_) / self.std_

        C = (Xs.T @ Xs) / len(Xs)
        eigenvalues, eigenvectors = np.linalg.eigh(C)

        order = np.argsort(eigenvalues)[::-1]
        eigenvalues = eigenvalues[order]
        eigenvectors = eigenvectors[:, order]

        self.eigenvalues_ = eigenvalues[:self.n_components]
        self.components_ = eigenvectors[:, :self.n_components]
        self.explained_variance_ratio_ = eigenvalues / eigenvalues.sum()
        return self

    def transform(self, X):
        X = np.asarray(X, dtype=float)
        Xs = (X - self.mean_) / self.std_
        return Xs @ self.components_

    def fit_transform(self, X):
        self.fit(X)
        return self.transform(X)

iris = load_iris()
X = iris.data

pca = PCAFromScratch(n_components=2)
X_proj = pca.fit_transform(X)

print("Top eigenvalues:", np.round(pca.eigenvalues_, 4))
print("Principal components:\n", np.round(pca.components_, 4))
print("Explained variance ratio:", np.round(pca.explained_variance_ratio_, 4))

sk_pca = SklearnPCA(n_components=2)
X_sk = sk_pca.fit_transform((X - X.mean(axis=0)) / X.std(axis=0))

print("Sklearn components:\n", np.round(sk_pca.components_.T, 4))

What This Code Does

This code:

loads the Iris dataset
standardizes all features
computes the covariance matrix
finds eigenvalues and eigenvectors
sorts them from largest to smallest
selects the top n_components
projects the original data onto those components

So 4-dimensional Iris data becomes 2-dimensional.

Dataset

The Iris dataset has:

150 samples
4 features
3 flower classes

The four features are:

sepal length
sepal width
petal length
petal width

PCA uses only the feature matrix: $X \in R^{150 \times 4}$

Since PCA is unsupervised, labels are not needed.

Step-by-Step Algorithm

Step 1: Standardize the dataset

PCA is sensitive to scale. If one feature has larger values, it can dominate the variance.

So each column is standardized: $x^{'} = \frac{x - μ}{σ}$

Code:

self.mean_ = X.mean(axis=0)
self.std_ = X.std(axis=0)
Xs = (X - self.mean_) / self.std_

Concept:

subtract column mean
divide by column standard deviation
now every feature has roughly comparable scale

Why important:

PCA is variance-based
variance depends on feature scale
standardization ensures fairness among features

Step 2: Compute covariance matrix

The covariance matrix measures how features vary together.

Equation: $C = \frac{1}{n} X_{s}^{T} X_{s}$

Code:

C = (Xs.T @ Xs) / len(Xs)

Concept:

diagonal entries = variance of each standardized feature
off-diagonal entries = covariance between pairs of features

For Iris: $C \in R^{4 \times 4}$

Because there are 4 original features.

Step 3: Find eigenvalues and eigenvectors

PCA solves: $C u = λ u$

Code:

eigenvalues, eigenvectors = np.linalg.eigh(C)

Concept:

each eigenvector gives a direction
each eigenvalue tells how much variance is captured in that direction

Why eigh and not eig:

covariance matrix is symmetric
np.linalg.eigh is better for symmetric matrices

Interpretation:

large eigenvalue $\Rightarrow$ important component
small eigenvalue $\Rightarrow$ less important component

Step 4: Sort eigenvalues and eigenvectors

The most useful components are the ones with the largest eigenvalues.

Code:

order = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[order]
eigenvectors = eigenvectors[:, order]

Concept:

argsort sorts indices
[::-1] reverses order to descending
now the first column of eigenvectors is the first principal component

Step 5: Select top principal components

If we want only $k$ components: $W_{k} = [u_{1} u_{2} \dots u_{k}]$

Code:

self.eigenvalues_ = eigenvalues[:self.n_components]
self.components_ = eigenvectors[:, :self.n_components]

Concept:

keep only the dominant directions
discard weaker directions
dimension reduces from $d$ to $k$

For example:

original data: 4 features
choose 2 components
reduced data: 2 features

Step 6: Project data onto the new space

Projection formula: $X_{proj} = X_{s} W_{k}$

Code:

return Xs @ self.components_

Concept:

each sample is re-expressed in terms of principal components
this gives lower-dimensional data
information loss is minimized as much as possible for the chosen number of components

If: $X_{s} \in R^{150 \times 4}, W_{2} \in R^{4 \times 2}$ then: $X_{proj} \in R^{150 \times 2}$

Concept -> Equation -> Code Mapping

1. Equalize feature scales

Concept: All features should contribute fairly.

Equation: $x^{'} = \frac{x - μ}{σ}$

Code:

self.mean_ = X.mean(axis=0)
self.std_ = X.std(axis=0)
Xs = (X - self.mean_) / self.std_

2. Measure variance structure

Concept: We need a matrix that summarizes how features vary together.

Equation: $C = \frac{1}{n} X_{s}^{T} X_{s}$

Code:

C = (Xs.T @ Xs) / len(Xs)

3. Find important directions

Concept: The best projection directions are the eigenvectors of the covariance matrix.

Equation: $C u = λ u$

Code:

eigenvalues, eigenvectors = np.linalg.eigh(C)

4. Rank the directions

Concept: Directions with larger variance are more useful.

Equation: $λ_{1} \geq λ_{2} \geq \dots \geq λ_{d}$

Code:

order = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[order]
eigenvectors = eigenvectors[:, order]

5. Keep only the top directions

Concept: Dimensionality reduction means retaining only the most informative directions.

Equation: $W_{k} = [u_{1} u_{2} \dots u_{k}]$

Code:

self.components_ = eigenvectors[:, :self.n_components]

6. Project data

Concept: Convert old features into principal component coordinates.

Equation: $X_{proj} = X_{s} W_{k}$

Code:

return Xs @ self.components_

Why Eigenvectors and Eigenvalues Appear

PCA wants to maximize the variance of projected data. If we project onto a unit vector $u$ , projected variance is: $Var (X u) = u^{T} C u$ subject to: $u^{T} u = 1$

This is an optimization problem. Using Lagrange multipliers: $L (u, λ) = u^{T} C u - λ (u^{T} u - 1)$ Taking derivative and setting to zero gives: $C u = λ u$

So:

the best directions are eigenvectors of $C$
the amount of retained variance is given by eigenvalues

This is the mathematical reason behind PCA.

Why the Largest Eigenvalue Matters

The projected variance along direction $u$ is: $u^{T} C u$ For an eigenvector: $C u = λ u$ So: $u^{T} C u = u^{T} (λ u) = λ u^{T} u$ Since: $u^{T} u = 1$ we get: $u^{T} C u = λ$

Therefore:

projected variance equals the eigenvalue
maximizing variance means choosing the largest eigenvalue

That is why PCA selects top eigenvalues first.

Explained Variance Ratio

A useful quantity is: $Explained Variance Ratio_{i} = \frac{λ _{i}}{\sum _{j} λ _{j}}$

Code:

self.explained_variance_ratio_ = eigenvalues / eigenvalues.sum()

Concept: This tells how much total information each principal component retains.

Example interpretation:

PC1 = 72%
PC2 = 23%
then first two PCs keep 95% of the total variance

This helps decide how many components to keep.

Worked Mini Example

Suppose after covariance computation, eigenvalues are: $[2.91, 0.92, 0.15, 0.02]$

Then:

first principal component captures the most variance
second captures the next most
third and fourth contribute little

Total variance: $2.91 + 0.92 + 0.15 + 0.02 = 4.00$

Explained variance ratios: $\frac{2.91}{4} = 0.7275$ $\frac{0.92}{4} = 0.23$ $\frac{0.15}{4} = 0.0375$ $\frac{0.02}{4} = 0.005$

So:

PC1 keeps about 72.75%
PC2 keeps about 23%
first two together keep about 95.75%

This means reducing from 4D to 2D is very reasonable.

Understanding the Components Matrix

If the selected components are: $W_{2} = 0.52 - 0.26 0.58 0.56 - 0.27 - 0.92 - 0.03 - 0.27$ then:

first column = first principal component
second column = second principal component

Each column shows how the original features combine to form the new axis.

For example, first component: $P C_{1} = 0.52 x_{1} - 0.26 x_{2} + 0.58 x_{3} + 0.56 x_{4}$

So a principal component is a linear combination of original features.

Comparing with Scikit-Learn

Code:

sk_pca = SklearnPCA(n_components=2)
X_sk = sk_pca.fit_transform((X - X.mean(axis=0)) / X.std(axis=0))
print(np.round(sk_pca.components_.T, 4))

Concept: This checks whether the scratch implementation gives similar principal components.

Important note: Principal components may differ by sign. That means if one library gives: $u$ another may give: $- u$ This is still correct because both represent the same direction.

So when comparing PCA outputs, sign flips are normal.

Why PCA Works

PCA works because:

it identifies directions of maximum spread
those directions preserve the most information
the directions are orthogonal, so they do not duplicate information
low-variance directions can often be removed with minimal information loss

So PCA compresses data while keeping the most useful structure.

Limitations

1. PCA is linear

PCA only finds linear combinations of features. If structure is highly non-linear, PCA may not capture it well.

2. PCA is sensitive to scale

Without standardization, large-scale features dominate.

3. Components may be hard to interpret

The new axes are combinations of original features, so they may not be as interpretable as raw columns.

4. Variance does not always mean usefulness

PCA keeps directions with high variance, but high variance does not always mean high predictive importance for a target variable.

Exam-Oriented Summary

Definition

PCA is an unsupervised dimensionality reduction technique that transforms correlated features into orthogonal principal components.

Goal

Reduce the number of features while retaining maximum variance.

Steps

standardize data
compute covariance matrix
find eigenvalues and eigenvectors
sort them in descending order
select top $k$ components
project data onto them

Important Equations

Standardization: $x^{'} = \frac{x - μ}{σ}$ Covariance: $C = \frac{1}{n} X_{s}^{T} X_{s}$ Eigen equation: $C u = λ u$ Projection: $X_{proj} = X_{s} W_{k}$ Explained variance ratio: $\frac{λ _{i}}{\sum _{j} λ _{j}}$

Interpretation

eigenvectors = directions of principal components
eigenvalues = variance captured by those directions
larger eigenvalue = more important component

Very Short Revision

PCA reduces dimensions by:

standardizing data
computing covariance matrix
finding eigenvectors/eigenvalues
sorting by largest eigenvalue
keeping top components
projecting data onto them

Main idea: $keep directions with maximum variance$

Final Takeaway

PCA transforms high-dimensional data into a lower-dimensional form by finding the most informative orthogonal directions. These directions are the eigenvectors of the covariance matrix, and their importance is measured by eigenvalues. The larger the eigenvalue, the more variance that principal component preserves, so the better it is for dimensionality reduction.

K-Nearest Neighbor

KNN is an instance-based or lazy learning algorithm. It does not learn an explicit equation during training. Instead, it:

stores the training data
waits until a new test point comes
finds the nearest stored examples
predicts using those neighbors

There are two common versions:

KNN Classification -> predict the most common class
KNN Regression -> predict the mean of neighbor target values

Main Idea

Each sample is a point in $n$ -dimensional space: $x = ⟨ x_{1}, x_{2}, \dots, x_{n} ⟩$ To predict for a new point, KNN measures the distance from that point to all training points.

The most common distance used is Euclidean distance: $d (x^{(a)}, x^{(b)}) = j = 1 \sum n (x_{j}^{(a)} - x_{j}^{(b)})^{2}$

Then:

in classification, choose the majority class among the nearest $K$ neighbors
in regression, choose the average target among the nearest $K$ neighbors

Part A: KNN Classification

What It Does

Given a new test point:

compute its distance to every training point
sort distances
pick the nearest $K$
return the majority class

So the prediction rule is: $\overset{y}{^} = mode of the K nearest labels$

Short and Clean KNN Classification Code

import numpy as np
from collections import Counter

class KNNClassifier:
    def __init__(self, k=3):
        self.k = k
        self.X_train = None
        self.y_train = None

    def fit(self, X, y):
        self.X_train = np.asarray(X, dtype=float)
        self.y_train = np.asarray(y)

    def _distance(self, a, b):
        return np.sqrt(np.sum((a - b) ** 2))

    def _predict_one(self, x):
        distances = [self._distance(x, x_train) for x_train in self.X_train]
        k_idx = np.argsort(distances)[:self.k]
        k_labels = self.y_train[k_idx]
        return Counter(k_labels).most_common(1)[0][0]

    def predict(self, X):
        X = np.asarray(X, dtype=float)
        return np.array([self._predict_one(x) for x in X])

X_train = np.array([
    [1, 1],
    [2, 1],
    [4, 3],
    [5, 4],
    [6, 5]
], dtype=float)

y_train = np.array([0, 0, 1, 1, 1])

X_test = np.array([
    [3, 2],
    [5, 5]
], dtype=float)

clf = KNNClassifier(k=3)
clf.fit(X_train, y_train)
pred = clf.predict(X_test)

print("Predictions:", pred)

Small Classification Example

Training data: $(1, 1) \to 0, (2, 1) \to 0, (4, 3) \to 1, (5, 4) \to 1, (6, 5) \to 1$

Test point: $x_{t} = (3, 2)$

Take: $K = 3$

Solving the Classification Example Step by Step

We compute distance from $(3, 2)$ to every training point.

Distance to $(1, 1)$

$d = (3 - 1)^{2} + (2 - 1)^{2} = 4 + 1 = 5 \approx 2.236$

Distance to $(2, 1)$

$d = (3 - 2)^{2} + (2 - 1)^{2} = 1 + 1 = 2 \approx 1.414$

Distance to $(4, 3)$

$d = (3 - 4)^{2} + (2 - 3)^{2} = 1 + 1 = 2 \approx 1.414$

Distance to $(5, 4)$

$d = (3 - 5)^{2} + (2 - 4)^{2} = 4 + 4 = 8 \approx 2.828$

Distance to $(6, 5)$

$d = (3 - 6)^{2} + (2 - 5)^{2} = 9 + 9 = 18 \approx 4.243$

Now sort the distances:

Point	Label	Distance
$(2, 1)$	0	1.414
$(4, 3)$	1	1.414
$(1, 1)$	0	2.236
$(5, 4)$	1	2.828
$(6, 5)$	1	4.243

Nearest 3 labels: $[0, 1, 0]$

Majority class: $0$

So: $\overset{y}{^} = 0$

That is exactly how KNN classification works.

KNN Classification Algorithm

Step 1: Store training data

Unlike linear models, KNN does not compute weights during training.

Code:

def fit(self, X, y):
    self.X_train = np.asarray(X, dtype=float)
    self.y_train = np.asarray(y)

Concept: Training in KNN simply means memorizing the dataset.

Step 2: Compute distance

Equation: $d (x, x_{i}) = j = 1 \sum n (x_{j} - x_{ij})^{2}$

Code:

def _distance(self, a, b):
    return np.sqrt(np.sum((a - b) ** 2))

Concept: This measures how close a training sample is to the test sample.

Smaller distance means more similar.

Step 3: Rank neighbors

Code:

distances = [self._distance(x, x_train) for x_train in self.X_train]
k_idx = np.argsort(distances)[:self.k]

Concept:

compute all distances
sort them
keep the indices of the nearest $K$

Step 4: Vote by majority

Code:

k_labels = self.y_train[k_idx]
return Counter(k_labels).most_common(1)[0][0]

Concept: Among the nearest neighbors, whichever class occurs most is chosen.

Equation: $\overset{y}{^} = mode (y_{(1)}, y_{(2)}, \dots, y_{(K)})$

Concept -> Equation -> Code Mapping for Classification

1. Represent samples as points

Concept: Each row is a point in feature space.

Equation: $x = ⟨ x_{1}, x_{2}, \dots, x_{n} ⟩$

Code:

self.X_train = np.asarray(X, dtype=float)

2. Measure closeness

Concept: Similarity is measured using Euclidean distance.

Equation: $d (x^{(a)}, x^{(b)}) = j = 1 \sum n (x_{j}^{(a)} - x_{j}^{(b)})^{2}$

Code:

def _distance(self, a, b):
    return np.sqrt(np.sum((a - b) ** 2))

3. Select nearest neighbors

Concept: Prediction depends only on nearby training samples.

Code:

k_idx = np.argsort(distances)[:self.k]

4. Use majority vote

Concept: Classification chooses the most frequent class.

Equation: $\overset{y}{^} = mode (k nearest labels)$

Code:

return Counter(k_labels).most_common(1)[0][0]

Why KNN Classification Works

The assumption is:

points that are close in feature space tend to have the same class

So instead of learning a global rule, KNN makes a local decision around the test point.

That is why it is called instance-based learning.

Part B: KNN Regression

What It Does

KNN regression follows the same steps as classification:

compute distance to all training points
choose the nearest $K$
average their target values

Prediction rule: $\overset{y}{^} = \frac{1}{K} i = 1 \sum K y_{(i)}$ where $y_{(i)}$ are the target values of the nearest $K$ neighbors.

Short and Clean KNN Regression Code

import numpy as np

class KNNRegressor:
    def __init__(self, k=3):
        self.k = k
        self.X_train = None
        self.y_train = None

    def fit(self, X, y):
        self.X_train = np.asarray(X, dtype=float)
        self.y_train = np.asarray(y, dtype=float)

    def _distance(self, a, b):
        return np.sqrt(np.sum((a - b) ** 2))

    def _predict_one(self, x):
        distances = [self._distance(x, x_train) for x_train in self.X_train]
        k_idx = np.argsort(distances)[:self.k]
        k_values = self.y_train[k_idx]
        return np.mean(k_values)

    def predict(self, X):
        X = np.asarray(X, dtype=float)
        return np.array([self._predict_one(x) for x in X])

X_train = np.array([[1], [2], [3], [6], [7]], dtype=float)
y_train = np.array([30000, 35000, 40000, 70000, 75000], dtype=float)

X_test = np.array([[4]], dtype=float)

reg = KNNRegressor(k=3)
reg.fit(X_train, y_train)
pred = reg.predict(X_test)

print("Prediction:", pred)

Small Regression Example

Suppose training data is:

YearsExperience	Salary
1	30000
2	35000
3	40000
6	70000
7	75000

Test point: $x_{t} = 4$

Take: $K = 3$

Solving the Regression Example Step by Step

We compute the distance from 4 to each training point.

Distance to 1

$d = ∣4 - 1∣ = 3$

Distance to 2

$d = ∣4 - 2∣ = 2$

Distance to 3

$d = ∣4 - 3∣ = 1$

Distance to 6

$d = ∣4 - 6∣ = 2$

Distance to 7

$d = ∣4 - 7∣ = 3$

Sorted distances:

Point	Target	Distance
3	40000	1
2	35000	2
6	70000	2
1	30000	3
7	75000	3

Nearest 3 target values: $[40000, 35000, 70000]$

Prediction is their mean: $\overset{y}{^} = \frac{40000 + 35000 + 70000}{3}$ $\overset{y}{^} = \frac{145000}{3} \approx 48333.33$

So the predicted salary is: $48333.33$

KNN Regression Algorithm

Step 1: Store training data

Code:

def fit(self, X, y):
    self.X_train = np.asarray(X, dtype=float)
    self.y_train = np.asarray(y, dtype=float)

Concept: Just store examples and target values.

Step 2: Compute all distances

Equation: $d (x, x_{i}) = j = 1 \sum n (x_{j} - x_{ij})^{2}$

Code:

distances = [self._distance(x, x_train) for x_train in self.X_train]

Concept: Measure how close the new point is to all training samples.

Step 3: Pick nearest $K$

Code:

k_idx = np.argsort(distances)[:self.k]

Concept: Keep the closest $K$ neighbors only.

Step 4: Average their target values

Equation: $\overset{y}{^} = \frac{1}{K} i = 1 \sum K y_{(i)}$

Code:

k_values = self.y_train[k_idx]
return np.mean(k_values)

Concept: Regression prediction is the average of nearby outputs.

Concept -> Equation -> Code Mapping for Regression

1. Use the same distance idea

Concept: Near points should have similar target values.

Equation: $d (x^{(a)}, x^{(b)}) = j = 1 \sum n (x_{j}^{(a)} - x_{j}^{(b)})^{2}$

Code:

def _distance(self, a, b):
    return np.sqrt(np.sum((a - b) ** 2))

2. Find local neighborhood

Concept: Prediction uses local samples rather than a global fitted line.

Code:

k_idx = np.argsort(distances)[:self.k]

3. Average local outputs

Concept: Regression uses the mean of neighbor target values.

Equation: $\overset{y}{^} = \frac{1}{K} i = 1 \sum K y_{(i)}$

Code:

return np.mean(k_values)

KNN Classification vs KNN Regression

Aspect	KNN Classification	KNN Regression
Output type	class label	continuous value
Final rule	majority vote	mean of neighbors
Formula	mode of nearest labels	average of nearest targets

Unified Intuition

KNN does not learn a formula like: $y = m x + b$ or: $\overset{y}{^} = σ (w x + b)$

Instead, it says:

for this new point, let me look around nearby training points and decide locally

So every test sample gets its own local prediction rule.

That is why KNN is called:

lazy learning
instance-based learning

Choosing the Value of $K$

The value of $K$ controls smoothness.

Small $K$

sensitive to noise
more flexible
may overfit

Large $K$

smoother decision
may underfit
local details may be lost

A common rough idea: $K \approx m$ where $m$ is the number of training samples

But in practice, $K$ is usually chosen using validation.

Why Feature Scaling Matters

KNN depends entirely on distance. So features with larger numeric ranges dominate the distance.

Example:

Age may range from 20 to 80
Glucose may range from 0 to 200

Then Glucose can dominate Euclidean distance.

So scaling is often important: $x^{'} = \frac{x - μ}{σ}$

Without scaling, nearest neighbors may be misleading.

Time Cost of KNN

KNN is cheap during training but expensive during prediction.

Training

just store the data

Prediction

For every test point:

compute distance to all training points
sort them
then predict

So prediction can be costly when the training set is large.

This is one of the main disadvantages of KNN.

Advantages of KNN

very simple
no training optimization needed
easy to understand
works for both classification and regression
naturally handles multi-class classification

Limitations of KNN

prediction is slow on large datasets
sensitive to feature scale
sensitive to irrelevant features
choice of $K$ matters a lot
can perform poorly in very high dimensions

This last issue is related to the curse of dimensionality.

Exam-Oriented Summary

Definition

KNN is an instance-based supervised learning algorithm that predicts a new sample using the nearest stored training samples.

Distance Formula

$d (x^{(a)}, x^{(b)}) = j = 1 \sum n (x_{j}^{(a)} - x_{j}^{(b)})^{2}$

Classification Rule

$\overset{y}{^} = mode of K nearest labels$

Regression Rule

$\overset{y}{^} = \frac{1}{K} i = 1 \sum K y_{(i)}$

Steps

store training data
compute distance from test point to all training points
sort by distance
select nearest $K$
classify by majority vote or regress by mean

Very Short Revision

KNN Classification

find nearest $K$
take majority class
output class label

KNN Regression

find nearest $K$
take mean of target values
output continuous value

Main formula: $d (x^{(a)}, x^{(b)}) = j = 1 \sum n (x_{j}^{(a)} - x_{j}^{(b)})^{2}$

Final Takeaway

KNN is one of the simplest machine learning algorithms. It does not build a model during training, but predicts by comparing a new point to stored training examples. For classification, it uses the majority class of the nearest neighbors. For regression, it uses the mean target value of the nearest neighbors. Its simplicity makes it excellent for understanding local learning, but its prediction cost and sensitivity to scaling are important limitations.

Decision Tree (ID3 & C4.5)

Both ID3 and C4.5 are decision tree algorithms used for classification. They build a tree top-down by repeatedly choosing the best feature to split the dataset.

A decision tree has:

root/internal nodes = feature tests
branches = test outcomes
leaf nodes = final class labels

The key difference is:

ID3 chooses splits using Information Gain
C4.5 improves ID3 by using Gain Ratio and can handle continuous features better

1. Decision Tree Idea

At each step, we want to ask the best question about the data. A good split should make the child groups more pure.

If a node contains both classes mixed together, it is impure. If a node contains only one class, it is pure.

Decision trees reduce impurity step by step until leaves are formed.

Part A: ID3

What ID3 Does

ID3 = Iterative Dichotomiser 3 It builds the tree recursively:

compute entropy of current dataset
compute information gain of every feature
choose the feature with the highest information gain
split the data on that feature
repeat for each subset

ID3 mainly works best with categorical features.

ID3 Core Equations

Entropy

Entropy measures uncertainty in a dataset: $H (S) = - i \sum p_{i} lo g_{2} p_{i}$ where:

$S$ = current dataset
$p_{i}$ = proportion of class $i$

For binary classification: $H (S) = - p_{0} lo g_{2} p_{0} - p_{1} lo g_{2} p_{1}$

Meaning:

entropy = 0 $\Rightarrow$ perfectly pure
entropy is high $\Rightarrow$ classes are mixed

Information Gain

If we split dataset $S$ using feature $A$ : $I G (S, A) = H (S) - v \in v a l u es (A) \sum \frac{∣ S _{v} ∣}{∣ S ∣} H (S_{v})$ where:

$S_{v}$ = subset where feature $A$ takes value $v$

Meaning:

higher information gain = better split
ID3 chooses the feature with maximum information gain

Short and Clean ID3 Code

import pandas as pd
import numpy as np

class ID3:
    def __init__(self):
        self.tree = None
        self.default_class = None

    def entropy(self, y):
        values, counts = np.unique(y, return_counts=True)
        p = counts / counts.sum()
        return -np.sum(p * np.log2(p + 1e-12))

    def information_gain(self, X, y, feature):
        parent_entropy = self.entropy(y)
        values, counts = np.unique(X[feature], return_counts=True)

        child_entropy = 0
        for v, c in zip(values, counts):
            y_sub = y[X[feature] == v]
            child_entropy += (c / len(X)) * self.entropy(y_sub)

        return parent_entropy - child_entropy

    def best_feature(self, X, y):
        gains = {f: self.information_gain(X, y, f) for f in X.columns}
        return max(gains, key=gains.get)

    def build(self, X, y):
        if len(np.unique(y)) == 1:
            return y.iloc[0]

        if X.empty:
            return y.mode()[0]

        best = self.best_feature(X, y)
        tree = {best: {}}

        for v in X[best].unique():
            mask = X[best] == v
            X_sub = X.loc[mask].drop(columns=[best])
            y_sub = y.loc[mask]

            if len(X_sub) == 0:
                tree[best][v] = y.mode()[0]
            else:
                tree[best][v] = self.build(X_sub, y_sub)

        return tree

    def fit(self, X, y):
        self.default_class = y.mode()[0]
        self.tree = self.build(X, y)
        return self

    def _predict_one(self, x, tree):
        if not isinstance(tree, dict):
            return tree

        feature = next(iter(tree))
        value = x.get(feature)

        if value not in tree[feature]:
            return self.default_class

        return self._predict_one(x, tree[feature][value])

    def predict(self, X):
        return X.apply(lambda row: self._predict_one(row, self.tree), axis=1)

Small Example for ID3

We use a tiny categorical dataset:

data = pd.DataFrame({
    "Weather": ["Sunny", "Sunny", "Overcast", "Rain", "Rain", "Rain", "Overcast", "Sunny"],
    "Wind":    ["Weak", "Strong", "Weak", "Weak", "Weak", "Strong", "Strong", "Weak"],
    "Play":    ["No", "No", "Yes", "Yes", "Yes", "No", "Yes", "No"]
})

X = data[["Weather", "Wind"]]
y = data["Play"]

model = ID3().fit(X, y)
print("ID3 Tree:", model.tree)
print("Predictions:", model.predict(X).tolist())

Solving the ID3 Example Step by Step

Dataset: $S = {8 samples}$ Target classes:

Yes = 4
No = 4

So root entropy: $H (S) = - \frac{4}{8} lo g_{2} \frac{4}{8} - \frac{4}{8} lo g_{2} \frac{4}{8} = 1$

So the root is maximally mixed.

Step 1: Try splitting on `Weather`

Possible values:

Sunny
Overcast
Rain

Subsets:

Sunny $\to$ [No, No, No]
Overcast $\to$ [Yes, Yes]
Rain $\to$ [Yes, Yes, No]

Entropies:

Sunny subset

All are No, so: $H (S u nn y) = 0$

Overcast subset

All are Yes, so: $H (O v erc a s t) = 0$

Rain subset

2 Yes, 1 No: $H (R ain) = - \frac{2}{3} lo g_{2} \frac{2}{3} - \frac{1}{3} lo g_{2} \frac{1}{3} \approx 0.9183$

Weighted entropy after splitting on Weather: $\frac{3}{8} (0) + \frac{2}{8} (0) + \frac{3}{8} (0.9183) = 0.3444$

Information gain: $I G (S, Weather) = 1 - 0.3444 = 0.6556$

Step 2: Try splitting on `Wind`

Possible values:

Weak
Strong

Subsets:

Weak $\to$ [No, Yes, Yes, Yes, No]
Strong $\to$ [No, No, Yes]

Entropies:

Weak

3 Yes, 2 No: $H (W e ak) = - \frac{3}{5} lo g_{2} \frac{3}{5} - \frac{2}{5} lo g_{2} \frac{2}{5} \approx 0.9710$

Strong

1 Yes, 2 No: $H (St ro n g) = - \frac{1}{3} lo g_{2} \frac{1}{3} - \frac{2}{3} lo g_{2} \frac{2}{3} \approx 0.9183$

Weighted entropy: $\frac{5}{8} (0.9710) + \frac{3}{8} (0.9183) \approx 0.9512$

Information gain: $I G (S, Wind) = 1 - 0.9512 = 0.0488$

Step 3: Choose the best feature

Since: $I G (Weather) > I G (Wind)$ ID3 chooses: $Weather$ as the root feature.

So the first tree becomes:

Weather
├── Sunny     -> No
├── Overcast  -> Yes
└── Rain      -> split again

Step 4: Recurse on the `Rain` subset

Rain subset:

Weak -> Yes
Weak -> Yes
Strong -> No

Now only one feature remains: Wind

If split on Wind:

Weak -> all Yes
Strong -> all No

So final tree:

Weather
├── Sunny     -> No
├── Overcast  -> Yes
└── Rain
    ├── Weak  -> Yes
    └── Strong -> No

This is exactly how ID3 builds the tree recursively.

ID3 Concept -> Equation -> Code

1. Measure impurity

Concept: We first measure how mixed the labels are.

Equation: $H (S) = - i \sum p_{i} lo g_{2} p_{i}$

Code:

def entropy(self, y):
    values, counts = np.unique(y, return_counts=True)
    p = counts / counts.sum()
    return -np.sum(p * np.log2(p + 1e-12))

2. Score each feature

Concept: Check how much uncertainty reduces after splitting on a feature.

Equation: $I G (S, A) = H (S) - v \sum \frac{∣ S _{v} ∣}{∣ S ∣} H (S_{v})$

Code:

def information_gain(self, X, y, feature):
    parent_entropy = self.entropy(y)
    values, counts = np.unique(X[feature], return_counts=True)

    child_entropy = 0
    for v, c in zip(values, counts):
        y_sub = y[X[feature] == v]
        child_entropy += (c / len(X)) * self.entropy(y_sub)

    return parent_entropy - child_entropy

3. Pick the best feature

Concept: Choose the feature with maximum information gain.

Code:

def best_feature(self, X, y):
    gains = {f: self.information_gain(X, y, f) for f in X.columns}
    return max(gains, key=gains.get)

4. Build tree recursively

Concept: After choosing the best feature, split data and build smaller trees.

Code:

def build(self, X, y):
    if len(np.unique(y)) == 1:
        return y.iloc[0]

    if X.empty:
        return y.mode()[0]

    best = self.best_feature(X, y)
    tree = {best: {}}

    for v in X[best].unique():
        mask = X[best] == v
        X_sub = X.loc[mask].drop(columns=[best])
        y_sub = y.loc[mask]
        tree[best][v] = self.build(X_sub, y_sub)

    return tree

Meaning:

if node is pure -> make leaf
if no features left -> return majority class
else split and recurse

ID3 Pseudocode

ID3(D, features):
    if all labels same:
        return leaf
    if no features left:
        return majority class
    best = feature with max information gain
    create node(best)
    for each value v of best:
        recurse on subset where best = v

ID3 Advantages

simple and easy to understand
tree rules are interpretable
good for categorical data
useful in exam explanations because steps are very clear

ID3 Limitations

biased toward features with many distinct values
not naturally suited for continuous features
can overfit
sensitive to noise

Part B: C4.5

What C4.5 Does

C4.5 is an improved version of ID3. It fixes major weaknesses of ID3.

Main improvements:

uses Gain Ratio instead of pure Information Gain
handles continuous features
can handle missing values better
usually produces more practical trees

So:

ID3 = older, simpler
C4.5 = smarter extension of ID3

Why ID3 Needs Improvement

Information Gain can favor a feature with many distinct values.

Example:

a feature like StudentID may split every row separately
this gives very high information gain
but it does not generalize

So C4.5 divides Information Gain by a quantity called Split Information.

C4.5 Core Equations

Entropy

Same as ID3: $H (S) = - i \sum p_{i} lo g_{2} p_{i}$

Information Gain

Same as ID3: $I G (S, A) = H (S) - v \sum \frac{∣ S _{v} ∣}{∣ S ∣} H (S_{v})$

Split Information

$S I (S, A) = - v \sum \frac{∣ S _{v} ∣}{∣ S ∣} lo g_{2} \frac{∣ S _{v} ∣}{∣ S ∣}$

Gain Ratio

$GR (S, A) = \frac{I G ( S , A )}{S I ( S , A )}$

C4.5 chooses the feature with the highest gain ratio.

Short and Clean C4.5 Code

import pandas as pd
import numpy as np

class C45:
    def __init__(self):
        self.tree = None
        self.default_class = None

    def entropy(self, y):
        values, counts = np.unique(y, return_counts=True)
        p = counts / counts.sum()
        return -np.sum(p * np.log2(p + 1e-12))

    def information_gain(self, X, y, feature):
        parent_entropy = self.entropy(y)
        values, counts = np.unique(X[feature], return_counts=True)

        child_entropy = 0
        for v, c in zip(values, counts):
            y_sub = y[X[feature] == v]
            child_entropy += (c / len(X)) * self.entropy(y_sub)

        return parent_entropy - child_entropy

    def split_info(self, X, feature):
        values, counts = np.unique(X[feature], return_counts=True)
        p = counts / counts.sum()
        return -np.sum(p * np.log2(p + 1e-12))

    def gain_ratio(self, X, y, feature):
        ig = self.information_gain(X, y, feature)
        si = self.split_info(X, feature)
        return 0 if si == 0 else ig / si

    def best_feature(self, X, y):
        ratios = {f: self.gain_ratio(X, y, f) for f in X.columns}
        return max(ratios, key=ratios.get)

    def build(self, X, y):
        if len(np.unique(y)) == 1:
            return y.iloc[0]

        if X.empty:
            return y.mode()[0]

        best = self.best_feature(X, y)
        tree = {best: {}}

        for v in X[best].unique():
            mask = X[best] == v
            X_sub = X.loc[mask].drop(columns=[best])
            y_sub = y.loc[mask]

            if len(X_sub) == 0:
                tree[best][v] = y.mode()[0]
            else:
                tree[best][v] = self.build(X_sub, y_sub)

        return tree

    def fit(self, X, y):
        self.default_class = y.mode()[0]
        self.tree = self.build(X, y)
        return self

    def _predict_one(self, x, tree):
        if not isinstance(tree, dict):
            return tree

        feature = next(iter(tree))
        value = x.get(feature)

        if value not in tree[feature]:
            return self.default_class

        return self._predict_one(x, tree[feature][value])

    def predict(self, X):
        return X.apply(lambda row: self._predict_one(row, self.tree), axis=1)

Small Example for C4.5

We use a dataset with a high-cardinality feature to see why Gain Ratio helps:

data = pd.DataFrame({
    "ID":      ["S1", "S2", "S3", "S4", "S5", "S6"],
    "Weather": ["Sunny", "Sunny", "Rain", "Rain", "Overcast", "Overcast"],
    "Play":    ["No", "No", "Yes", "Yes", "Yes", "Yes"]
})

X = data[["ID", "Weather"]]
y = data["Play"]

model = C45().fit(X, y)
print("C4.5 Tree:", model.tree)
print("Predictions:", model.predict(X).tolist())

Solving the C4.5 Example Step by Step

Root labels:

Yes = 4
No = 2

Entropy: $H (S) = - \frac{4}{6} lo g_{2} \frac{4}{6} - \frac{2}{6} lo g_{2} \frac{2}{6} \approx 0.9183$

Feature 1: `ID`

Every ID is unique. So each split contains exactly 1 sample. That means every child subset is pure.

Weighted child entropy: $0$ Thus: $I G (S, ID) = 0.9183$

This looks perfect for ID3. But it is misleading because ID is just a unique label.

Now compute split information. Since there are 6 equally-sized branches: $S I (S, ID) = - 6 (\frac{1}{6} lo g_{2} \frac{1}{6}) = lo g_{2} 6 \approx 2.585$

So gain ratio: $GR (S, ID) = \frac{0.9183}{2.585} \approx 0.355$

Feature 2: `Weather`

Values:

Sunny -> [No, No]
Rain -> [Yes, Yes]
Overcast -> [Yes, Yes]

All child subsets are pure, so: $I G (S, Weather) = 0.9183$

Split information: There are 3 equally-sized groups of size 2: $S I (S, Weather) = - 3 (\frac{2}{6} lo g_{2} \frac{2}{6}) = 1.585$

Gain ratio: $GR (S, Weather) = \frac{0.9183}{1.585} \approx 0.579$

Choose the best feature

Although both features have equal information gain, gain ratio is larger for Weather: $GR (Weather) > GR (ID)$

So C4.5 correctly chooses: $Weather$ instead of ID.

This is the key improvement over ID3.

C4.5 Concept -> Equation -> Code

1. Compute entropy

Concept: Measure uncertainty at the current node.

Equation: $H (S) = - i \sum p_{i} lo g_{2} p_{i}$

Code:

def entropy(self, y):
    values, counts = np.unique(y, return_counts=True)
    p = counts / counts.sum()
    return -np.sum(p * np.log2(p + 1e-12))

2. Compute information gain

Concept: Measure reduction in entropy after split.

Equation: $I G (S, A) = H (S) - v \sum \frac{∣ S _{v} ∣}{∣ S ∣} H (S_{v})$

Code:

def information_gain(self, X, y, feature):
    parent_entropy = self.entropy(y)
    ...
    return parent_entropy - child_entropy

3. Compute split information

Concept: Measure how broadly the split divides the data.

Equation: $S I (S, A) = - v \sum \frac{∣ S _{v} ∣}{∣ S ∣} lo g_{2} \frac{∣ S _{v} ∣}{∣ S ∣}$

Code:

def split_info(self, X, feature):
    values, counts = np.unique(X[feature], return_counts=True)
    p = counts / counts.sum()
    return -np.sum(p * np.log2(p + 1e-12))

4. Compute gain ratio

Concept: Normalize information gain so features with too many distinct values are not unfairly favored.

Equation: $GR (S, A) = \frac{I G ( S , A )}{S I ( S , A )}$

Code:

def gain_ratio(self, X, y, feature):
    ig = self.information_gain(X, y, feature)
    si = self.split_info(X, feature)
    return 0 if si == 0 else ig / si

5. Choose best feature and recurse

Concept: Build the decision tree exactly like ID3, but use Gain Ratio instead of Information Gain.

Code:

def best_feature(self, X, y):
    ratios = {f: self.gain_ratio(X, y, f) for f in X.columns}
    return max(ratios, key=ratios.get)

Handling Continuous Features in C4.5

A major improvement of C4.5 is continuous-value handling. For a numeric feature, C4.5 tries thresholds: $A \leq t vs A > t$ and chooses the threshold with the best gain ratio.

Example split: $Glucose \leq 120 ?$

So unlike ID3, C4.5 can naturally work with continuous attributes by converting them into binary threshold splits.

A simplified threshold idea in code would be:

threshold = 120
left = X[feature] <= threshold
right = X[feature] > threshold

Then entropy, IG, and GR are computed for that split.

ID3 vs C4.5

Main Difference

ID3 chooses: $max I G$ C4.5 chooses: $max GR$

Comparison Table

Aspect	ID3	C4.5
Split criterion	Information Gain	Gain Ratio
Handles categorical features	Yes	Yes
Handles continuous features	Poorly / not directly	Yes
Bias toward many-valued features	High	Reduced
Missing values	Weak	Better
Complexity	Simpler	Slightly more advanced

Full Combined Example

import pandas as pd

data = pd.DataFrame({
    "Weather": ["Sunny", "Sunny", "Overcast", "Rain", "Rain", "Rain", "Overcast", "Sunny"],
    "Wind":    ["Weak", "Strong", "Weak", "Weak", "Weak", "Strong", "Strong", "Weak"],
    "Play":    ["No", "No", "Yes", "Yes", "Yes", "No", "Yes", "No"]
})

X = data[["Weather", "Wind"]]
y = data["Play"]

id3_model = ID3().fit(X, y)
c45_model = C45().fit(X, y)

print("ID3 Tree:", id3_model.tree)
print("C4.5 Tree:", c45_model.tree)
print("ID3 Predictions:", id3_model.predict(X).tolist())
print("C4.5 Predictions:", c45_model.predict(X).tolist())

How the Tree Predicts

Suppose tree is:

Weather
├── Sunny     -> No
├── Overcast  -> Yes
└── Rain
    ├── Weak   -> Yes
    └── Strong -> No

For a new row:

Weather = Rain, Wind = Weak

Path:

check Weather
move to Rain
check Wind
Weak -> leaf = Yes

So predicted class is: $Yes$

Why These Algorithms Work

Both algorithms work by repeatedly reducing uncertainty. They ask:

which feature makes the class labels as pure as possible after splitting?

ID3 answers this using information gain. C4.5 answers this using gain ratio.

The recursion stops when:

node becomes pure
no features remain
or no useful split is possible

So a large classification problem becomes a sequence of small rule-based decisions.

Advantages of ID3 and C4.5

easy to interpret
produces human-readable rules
no heavy math during prediction
useful for categorical classification tasks
good for exam answers because the logic is visual and recursive

Limitations

ID3

biased toward attributes with many values
struggles with continuous data
can overfit
sensitive to noise

C4.5

more complex than ID3
deeper trees can still overfit without pruning
threshold search for continuous features adds computation

Exam-Oriented Summary

ID3 Definition

ID3 is a top-down greedy decision tree algorithm that chooses the feature with the highest information gain at each step.

C4.5 Definition

C4.5 is an improved decision tree algorithm that extends ID3 by using gain ratio and supporting continuous features.

ID3 Formula

Entropy: $H (S) = - i \sum p_{i} lo g_{2} p_{i}$ Information Gain: $I G (S, A) = H (S) - v \sum \frac{∣ S _{v} ∣}{∣ S ∣} H (S_{v})$

C4.5 Formula

Split Information: $S I (S, A) = - v \sum \frac{∣ S _{v} ∣}{∣ S ∣} lo g_{2} \frac{∣ S _{v} ∣}{∣ S ∣}$ Gain Ratio: $GR (S, A) = \frac{I G ( S , A )}{S I ( S , A )}$

Common Steps

compute impurity of current dataset
score every feature
choose best feature
split dataset
repeat recursively
stop when node is pure or features are exhausted

Very Short Revision

ID3

uses entropy
computes information gain
picks feature with max gain
recursive tree construction

C4.5

starts like ID3
adds split information
uses gain ratio
handles continuous features better

Final Takeaway

ID3 and C4.5 both build decision trees by recursively selecting the best splitting feature. ID3 uses Information Gain, while C4.5 improves it using Gain Ratio to avoid unfair preference for features with many distinct values. So:

use ID3 to understand the core decision tree idea
use C4.5 as the more practical and improved version

Artificial Neural Network (ANN)

An Artificial Neural Network is a model made of layers of neurons. A basic ANN has:

input layer
hidden layer(s)
output layer
weights and biases
activation functions

A neuron computes: $z = W x + b$ and then applies an activation function: $a = g (z)$

For this example, we build a 2-layer neural network for binary classification:

hidden layer uses ReLU
output layer uses Sigmoid

The final output is a probability: $\overset{y}{^} \in (0, 1)$ and prediction is: $\overset{y}{^} \geq 0.5 \Rightarrow 1, \overset{y}{^} < 0.5 \Rightarrow 0$

What This Network Learns

For input $X$ , the network computes: $Z_{1} = W_{1} X + b_{1}$ $A_{1} = ReLU (Z_{1})$ $Z_{2} = W_{2} A_{1} + b_{2}$ $A_{2} = σ (Z_{2}) = \frac{1}{1 + e ^{- Z_{2}}}$

Where:

$Z_{1}, Z_{2}$ = linear outputs
$A_{1}$ = hidden layer activations
$A_{2}$ = final predicted probability

Short and Clean Code

import numpy as np

class SimpleANN:
    def __init__(self, input_size, hidden_size, lr=0.1, epochs=10000):
        np.random.seed(42)
        self.lr = lr
        self.epochs = epochs
        self.W1 = np.random.randn(hidden_size, input_size) * 0.1
        self.b1 = np.zeros((hidden_size, 1))
        self.W2 = np.random.randn(1, hidden_size) * 0.1
        self.b2 = np.zeros((1, 1))
        self.costs = []

    def relu(self, Z):
        return np.maximum(0, Z)

    def relu_deriv(self, Z):
        return (Z > 0).astype(float)

    def sigmoid(self, Z):
        Z = np.clip(Z, -500, 500)
        return 1 / (1 + np.exp(-Z))

    def forward(self, X):
        Z1 = self.W1 @ X + self.b1
        A1 = self.relu(Z1)
        Z2 = self.W2 @ A1 + self.b2
        A2 = self.sigmoid(Z2)
        cache = (X, Z1, A1, Z2, A2)
        return A2, cache

    def compute_cost(self, Y, A2):
        eps = 1e-9
        A2 = np.clip(A2, eps, 1 - eps)
        return -np.mean(Y * np.log(A2) + (1 - Y) * np.log(1 - A2))

    def backward(self, Y, cache):
        X, Z1, A1, Z2, A2 = cache
        m = X.shape[1]

        dZ2 = A2 - Y
        dW2 = (dZ2 @ A1.T) / m
        db2 = np.sum(dZ2, axis=1, keepdims=True) / m

        dZ1 = (self.W2.T @ dZ2) * self.relu_deriv(Z1)
        dW1 = (dZ1 @ X.T) / m
        db1 = np.sum(dZ1, axis=1, keepdims=True) / m

        return dW1, db1, dW2, db2

    def fit(self, X, Y):
        for i in range(self.epochs):
            A2, cache = self.forward(X)
            cost = self.compute_cost(Y, A2)
            dW1, db1, dW2, db2 = self.backward(Y, cache)

            self.W1 -= self.lr * dW1
            self.b1 -= self.lr * db1
            self.W2 -= self.lr * dW2
            self.b2 -= self.lr * db2

            self.costs.append(cost)
            if i % 1000 == 0:
                print(f"Epoch {i}: Cost = {cost:.6f}")

    def predict(self, X):
        A2, _ = self.forward(X)
        return (A2 >= 0.5).astype(int)

X = np.array([[0, 0, 1, 1],
              [0, 1, 0, 1]], dtype=float)

Y = np.array([[0, 0, 0, 1]], dtype=float)

model = SimpleANN(input_size=2, hidden_size=4, lr=0.1, epochs=10000)
model.fit(X, Y)

pred = model.predict(X)
print("Predictions:", pred)

Dataset Used: AND Gate

The network is trained on the AND truth table: $x_{1} 0011 x_{2} 0101 y 0001$

Input matrix:

X = np.array([[0, 0, 1, 1],
              [0, 1, 0, 1]], dtype=float)

Target matrix:

Y = np.array([[0, 0, 0, 1]], dtype=float)

Shape meaning:

$X$ has shape $(2, 4)$
2 input features
4 training examples

So each column is one example: $X = [00011011]$

Network Architecture

This ANN has:

2 input neurons
4 hidden neurons
1 output neuron

So parameter shapes are: $W_{1} \in R^{4 \times 2}, b_{1} \in R^{4 \times 1}$ $W_{2} \in R^{1 \times 4}, b_{2} \in R^{1 \times 1}$

Step-by-Step Algorithm

Step 1: Initialize weights and biases

We begin with small random weights and zero biases.

Code:

self.W1 = np.random.randn(hidden_size, input_size) * 0.1
self.b1 = np.zeros((hidden_size, 1))
self.W2 = np.random.randn(1, hidden_size) * 0.1
self.b2 = np.zeros((1, 1))

Concept:

weights decide how strongly neurons influence the next layer
biases shift the activation
small random values break symmetry
if all weights start the same, neurons learn the same thing

Why not large random weights:

large values can make training unstable
small values help smoother learning at the start

Step 2: Hidden layer linear transformation

Each hidden neuron computes: $Z_{1} = W_{1} X + b_{1}$

Code:

Z1 = self.W1 @ X + self.b1

Concept: This is the weighted sum of inputs plus bias.

For one hidden neuron: $z = w_{1} x_{1} + w_{2} x_{2} + b$

Since there are 4 hidden neurons, this is done 4 times in parallel.

Step 3: Apply ReLU activation

ReLU function is: $ReLU (z) = max (0, z)$

Code:

A1 = self.relu(Z1)

and:

def relu(self, Z):
    return np.maximum(0, Z)

Concept:

negative values become 0
positive values remain unchanged

Why ReLU:

introduces non-linearity
lets the network learn more complex patterns
simple and efficient

Without activation, multiple layers would collapse into just one linear transformation.

Step 4: Output layer linear transformation

Now hidden activations are passed to the output neuron: $Z_{2} = W_{2} A_{1} + b_{2}$

Code:

Z2 = self.W2 @ A1 + self.b2

Concept: This combines the hidden-layer outputs into one final score.

Step 5: Apply Sigmoid to get probability

Sigmoid function: $σ (z) = \frac{1}{1 + e ^{- z}}$

Code:

A2 = self.sigmoid(Z2)

and:

def sigmoid(self, Z):
    Z = np.clip(Z, -500, 500)
    return 1 / (1 + np.exp(-Z))

Concept:

converts raw score into probability
output is between 0 and 1
suitable for binary classification

Meaning: $A_{2} = P (y = 1 ∣ X)$

Why clip is used:

prevents overflow in exp
improves numerical stability

Step 6: Compute the cost

For binary classification, we use binary cross-entropy loss: $J = - \frac{1}{m} \sum [Y lo g (A_{2}) + (1 - Y) lo g (1 - A_{2})]$

Code:

def compute_cost(self, Y, A2):
    eps = 1e-9
    A2 = np.clip(A2, eps, 1 - eps)
    return -np.mean(Y * np.log(A2) + (1 - Y) * np.log(1 - A2))

Concept:

if actual label is 1, we want output close to 1
if actual label is 0, we want output close to 0
confident wrong predictions get heavily penalized

Why clip again:

avoids $lo g (0)$ which is undefined

Step 7: Backpropagation for output layer

The error at the output layer is: $d Z_{2} = A_{2} - Y$

Code:

dZ2 = A2 - Y
dW2 = (dZ2 @ A1.T) / m
db2 = np.sum(dZ2, axis=1, keepdims=True) / m

Equations: $d W_{2} = \frac{1}{m} d Z_{2} A_{1}^{T}$ $d b_{2} = \frac{1}{m} \sum d Z_{2}$

Concept: This tells how much the output weights and bias contributed to the error.

Step 8: Backpropagation for hidden layer

The hidden layer error is: $d Z_{1} = (W_{2}^{T} d Z_{2}) ⊙ ReLU^{'} (Z_{1})$

Code:

dZ1 = (self.W2.T @ dZ2) * self.relu_deriv(Z1)
dW1 = (dZ1 @ X.T) / m
db1 = np.sum(dZ1, axis=1, keepdims=True) / m

Equations: $d W_{1} = \frac{1}{m} d Z_{1} X^{T}$ $d b_{1} = \frac{1}{m} \sum d Z_{1}$

ReLU derivative: $ReLU^{'} (z) = {1, 0, z > 0 z \leq 0$

Code:

def relu_deriv(self, Z):
    return (Z > 0).astype(float)

Concept:

output error is sent backward into the hidden layer
only active ReLU neurons pass gradient
this is how the network learns internal representations

Step 9: Update parameters

Gradient descent update rule: $W := W - α d W$ $b := b - α d b$

Code:

self.W1 -= self.lr * dW1
self.b1 -= self.lr * db1
self.W2 -= self.lr * dW2
self.b2 -= self.lr * db2

Concept:

move parameters in the direction that reduces loss
repeat this many times
gradually improve predictions

Here:

$α$ is the learning rate
a higher learning rate updates faster, but may overshoot
a lower learning rate is safer, but slower

Step 10: Make predictions

After training, the network outputs probabilities. Convert them into classes using threshold 0.5: $\overset{y}{^} = {1, 0, A_{2} \geq 0.5 A_{2} < 0.5$

Code:

def predict(self, X):
    A2, _ = self.forward(X)
    return (A2 >= 0.5).astype(int)

Concept -> Equation -> Code Mapping

1. Weighted input

Concept: Each neuron forms a weighted sum of inputs.

Equation: $z = W x + b$

Code:

Z1 = self.W1 @ X + self.b1
Z2 = self.W2 @ A1 + self.b2

2. Non-linearity

Concept: Activation functions allow the network to learn beyond straight-line relationships.

Equations: $A_{1} = ReLU (Z_{1})$ $A_{2} = σ (Z_{2})$

Code:

A1 = self.relu(Z1)
A2 = self.sigmoid(Z2)

3. Forward propagation

Concept: Data flows from input to hidden to output.

Equations: $Z_{1} = W_{1} X + b_{1}$ $A_{1} = ReLU (Z_{1})$ $Z_{2} = W_{2} A_{1} + b_{2}$ $A_{2} = σ (Z_{2})$

Code:

def forward(self, X):
    Z1 = self.W1 @ X + self.b1
    A1 = self.relu(Z1)
    Z2 = self.W2 @ A1 + self.b2
    A2 = self.sigmoid(Z2)

4. Loss measurement

Concept: We need to measure how wrong predictions are.

Equation: $J = - \frac{1}{m} \sum [Y lo g (A_{2}) + (1 - Y) lo g (1 - A_{2})]$

Code:

cost = self.compute_cost(Y, A2)

5. Error propagation backward

Concept: The network computes gradients layer by layer from output back to input.

Equations: $d Z_{2} = A_{2} - Y$ $d W_{2} = \frac{1}{m} d Z_{2} A_{1}^{T}$ $d Z_{1} = (W_{2}^{T} d Z_{2}) ⊙ ReLU^{'} (Z_{1})$ $d W_{1} = \frac{1}{m} d Z_{1} X^{T}$

Code:

dZ2 = A2 - Y
dW2 = (dZ2 @ A1.T) / m
dZ1 = (self.W2.T @ dZ2) * self.relu_deriv(Z1)
dW1 = (dZ1 @ X.T) / m

6. Learning

Concept: Use gradients to improve parameters.

Equation: $θ := θ - α \nabla J$

Code:

self.W1 -= self.lr * dW1
self.b1 -= self.lr * db1
self.W2 -= self.lr * dW2
self.b2 -= self.lr * db2

Solving the AND Gate Example

The AND gate outputs 1 only when both inputs are 1: $(0, 0) \to 0$ $(0, 1) \to 0$ $(1, 0) \to 0$ $(1, 1) \to 1$

During training:

the network starts with random weights
predictions are poor at first
after many epochs, weights and biases adjust
the cost decreases
final outputs approach the correct AND values

Expected final prediction: $[0, 0, 0, 1]$

Code:

pred = model.predict(X)
print("Predictions:", pred)

If training succeeds, output becomes:

Predictions: [[0 0 0 1]]

One Forward Pass Example

Suppose for one sample: $x = [11]$

Assume one hidden neuron has: $w = [0.6 0.4], b = - 0.2$

Then: $z = 0.6 (1) + 0.4 (1) - 0.2 = 0.8$

Apply ReLU: $a = max (0, 0.8) = 0.8$

Then output neuron may combine hidden activations and pass through sigmoid. If final output score is: $z_{2} = 2.1$ then: $A_{2} = σ (2.1) = \frac{1}{1 + e ^{- 2.1}} \approx 0.8909$

Since: $0.8909 > 0.5$ prediction is: $1$

This is how the ANN converts inputs into a class decision.

Why ANN Works

A neural network works because:

weights learn which inputs matter
biases shift decision boundaries
activation functions add non-linearity
backpropagation tells each parameter how it contributed to the error
gradient descent improves the parameters repeatedly

So the network gradually learns a function that maps input to output.

Why Hidden Layers Matter

A single linear model can only learn a linear boundary. A hidden layer with activation allows:

combinations of features
piecewise linear transformations
more expressive decision boundaries

Even though AND is simple, this example demonstrates the full learning pipeline of an ANN.

Cost Curve Meaning

The printed cost every 1000 epochs tells whether learning is working.

If cost decreases:

predictions are improving
gradients are useful
parameter updates are moving in the correct direction

If cost does not decrease:

learning rate may be wrong
architecture may be unsuitable
initialization may be poor

Practical Notes

1. Initialization matters

Bad initialization can slow or break learning.

2. Learning rate matters

If learning rate is:

too high -> unstable training
too low -> very slow training

3. Activation choice matters

ReLU is common in hidden layers
Sigmoid is common for binary output

4. More layers increase capacity

Deeper networks can learn more complex patterns, but are also harder to train.

Exam-Oriented Summary

Definition

An ANN is a layered network of neurons that learns by adjusting weights and biases using backpropagation and gradient descent.

Architecture Used

2 input neurons
1 hidden layer with 4 neurons
1 output neuron

Important Equations

Hidden layer: $Z_{1} = W_{1} X + b_{1}, A_{1} = ReLU (Z_{1})$ Output layer: $Z_{2} = W_{2} A_{1} + b_{2}, A_{2} = σ (Z_{2})$ Sigmoid: $σ (z) = \frac{1}{1 + e ^{- z}}$ ReLU: $ReLU (z) = max (0, z)$ Loss: $J = - \frac{1}{m} \sum [Y lo g (A_{2}) + (1 - Y) lo g (1 - A_{2})]$ Gradient descent: $W := W - α d W, b := b - α d b$

Training Steps

initialize parameters
perform forward propagation
compute loss
perform backpropagation
update parameters
repeat for many epochs

Very Short Revision

input passes through weights and bias
ReLU activates hidden layer
sigmoid gives output probability
cross-entropy measures error
backpropagation computes gradients
gradient descent updates weights
repeat until cost decreases and predictions improve

Final Takeaway

This ANN from scratch shows the complete neural-network learning process:

forward propagation computes predictions
loss measures error
backpropagation computes gradients
gradient descent updates parameters

For the AND gate dataset, the network learns the correct truth table: $[0, 0, 0, 1]$ which shows that it has successfully learned the mapping from inputs to outputs.

Keyboard shortcuts

Notes