Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Logistic Regression

Logistic Regression is a binary classification algorithm used when the output belongs to one of two classes: It does not predict the class directly using a line. Instead, it predicts a probability using the sigmoid function, then converts that probability into class 0 or 1. The model is: where:

  • = linear score
  • = sigmoid function
  • = predicted probability that class is 1

If:


Main Idea

Linear Regression gives any real number as output, but classification needs a value between 0 and 1. So Logistic Regression first computes a linear combination: then applies the sigmoid: This maps any real number into: So the output can be interpreted as a probability.


Short and Clean Code

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

class LogisticRegressionScratch:
    def __init__(self, lr=0.1, epochs=1000):
        self.lr = lr
        self.epochs = epochs
        self.w = None
        self.b = 0.0
        self.loss_history = []

    def _sigmoid(self, z):
        z = np.clip(z, -500, 500)
        return 1 / (1 + np.exp(-z))

    def _loss(self, y, p):
        eps = 1e-9
        p = np.clip(p, eps, 1 - eps)
        return -np.mean(y * np.log(p) + (1 - y) * np.log(1 - p))

    def fit(self, X, y):
        X = np.asarray(X)
        y = np.asarray(y)
        m, n = X.shape
        self.w = np.zeros(n)

        for _ in range(self.epochs):
            z = X @ self.w + self.b
            p = self._sigmoid(z)

            dw = (X.T @ (p - y)) / m
            db = np.mean(p - y)

            self.w -= self.lr * dw
            self.b -= self.lr * db

            self.loss_history.append(self._loss(y, p))
        return self

    def predict_proba(self, X):
        X = np.asarray(X)
        return self._sigmoid(X @ self.w + self.b)

    def predict(self, X):
        return (self.predict_proba(X) >= 0.5).astype(int)

np.random.seed(42)
X = np.random.rand(200, 2) * 10
y = (X[:, 0] + X[:, 1] > 10).astype(int)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

model = LogisticRegressionScratch(lr=0.1, epochs=1000)
model.fit(X_train, y_train)

pred = model.predict(X_test)
acc = np.mean(pred == y_test)

print("Weights:", np.round(model.w, 4))
print("Bias:", round(model.b, 4))
print("Accuracy:", round(acc, 4))

plt.plot(model.loss_history)
plt.title("Loss Convergence")
plt.xlabel("Iteration")
plt.ylabel("Loss")
plt.grid(True)
plt.show()

What This Code Does

This example creates 2D points: and labels them using: So the true decision boundary is: This is a binary classification problem.


Step-by-Step Algorithm

Step 1: Create the dataset

Code:

np.random.seed(42)
X = np.random.rand(200, 2) * 10
y = (X[:, 0] + X[:, 1] > 10).astype(int)

Concept:

  • X contains 200 samples
  • each sample has 2 features:
  • class label depends on whether the sum is greater than 10

Equation:

Meaning:

  • points above the line belong to class 1
  • points below the line belong to class 0

Step 2: Split into train and test sets

Code:

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Concept:

  • training data is used to learn parameters
  • test data is used to check performance on unseen data

Here:

  • 80% for training
  • 20% for testing

Step 3: Standardize features

Code:

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Concept: Gradient descent works better when features are on similar scales.

Standardization formula: where:

  • = mean of the feature
  • = standard deviation

Why it helps:

  • faster convergence
  • more stable updates
  • one feature does not dominate another due to large magnitude

Step 4: Compute the linear score

For each sample, Logistic Regression first computes: In vector form:

Code:

z = X @ self.w + self.b

Concept: This is the same linear part used in linear models. But here it is not the final output. It is only the input to the sigmoid function.


Step 5: Apply sigmoid to get probability

Equation:

Code:

p = self._sigmoid(z)

Concept: The sigmoid compresses any real number into a value between 0 and 1.

Examples:

  • if :
  • if is large positive, probability is close to 1
  • if is large negative, probability is close to 0

So:


Step 6: Measure error using cross-entropy loss

For Logistic Regression, we do not use mean squared error. We use cross-entropy loss:

Code:

def _loss(self, y, p):
    eps = 1e-9
    p = np.clip(p, eps, 1 - eps)
    return -np.mean(y * np.log(p) + (1 - y) * np.log(1 - p))

Concept:

  • if actual class is 1, we want close to 1
  • if actual class is 0, we want close to 0
  • wrong confident predictions are penalized heavily

Why clipping is used:

  • log(0) is undefined
  • so probabilities are clipped slightly away from 0 and 1

Step 7: Compute gradients

To reduce loss, we update weights and bias using gradient descent.

Gradient formulas:

Code:

dw = (X.T @ (p - y)) / m
db = np.mean(p - y)

Concept:

  • dw tells how weights should change
  • db tells how bias should change
  • if prediction is too large, parameters are pushed downward
  • if prediction is too small, parameters are pushed upward

Step 8: Update parameters

Gradient descent update rule: where is the learning rate.

Code:

self.w -= self.lr * dw
self.b -= self.lr * db

Concept:

  • move parameters in the direction that reduces loss
  • repeat many times until learning stabilizes

Step 9: Convert probabilities to classes

After training, predicted probability is converted to class label.

Rule:

Code:

return (self.predict_proba(X) >= 0.5).astype(int)

Concept:

  • probabilities are continuous
  • classification needs discrete labels

Step 10: Measure accuracy

Code:

pred = model.predict(X_test)
acc = np.mean(pred == y_test)

Equation:

Concept: Accuracy tells what fraction of test samples were classified correctly.


Concept -> Equation -> Code Mapping

1. Model parameters

Concept: The model must learn weights and bias.

Equation:

Code:

self.w = np.zeros(n)
self.b = 0.0

Meaning:

  • start with all weights as 0
  • start with bias as 0

2. Probability model

Concept: Turn linear score into probability.

Equation:

Code:

def _sigmoid(self, z):
    z = np.clip(z, -500, 500)
    return 1 / (1 + np.exp(-z))

Why clip:

  • avoids overflow for very large positive or negative values

3. Forward pass

Concept: Compute predictions from current parameters.

Equation:

Code:

z = X @ self.w + self.b
p = self._sigmoid(z)

Meaning:

  • z = raw score
  • p = predicted probability

4. Loss calculation

Concept: See how wrong the predictions are.

Equation:

Code:

self.loss_history.append(self._loss(y, p))

Meaning:

  • each iteration stores loss
  • useful for checking whether training is improving

5. Backward pass

Concept: Find how parameters affect the loss.

Equation:

Code:

dw = (X.T @ (p - y)) / m
db = np.mean(p - y)

Meaning: These gradients guide the update step.


6. Learning step

Concept: Improve the model gradually.

Equation:

Code:

self.w -= self.lr * dw
self.b -= self.lr * db

Meaning: Repeated updates make the model better at classification.


Worked Example on One Sample

Suppose after some training, for one sample: and the model has:

Step 1: Compute score

Step 2: Apply sigmoid

Step 3: Classify

Since: prediction is:

So this sample is classified as class 1.


Why Logistic Regression Works

The model learns a boundary where the probability changes from class 0 to class 1. For two features, the decision boundary is: Because: So:

  • if , class tends toward 1
  • if , class tends toward 0

This creates a linear decision boundary.

For your dataset, true labels come from: So Logistic Regression is a good fit because the classes are separable by a line.


What the Loss Curve Means

Code:

plt.plot(model.loss_history)
plt.title("Loss Convergence")
plt.xlabel("Iteration")
plt.ylabel("Loss")
plt.grid(True)
plt.show()

Concept:

  • at the beginning, loss is high
  • during learning, loss should decrease
  • a downward curve means gradient descent is working

If the curve:

  • decreases smoothly -> learning is stable
  • oscillates wildly -> learning rate may be too high
  • decreases very slowly -> learning rate may be too low

Practical Notes

1. Feature scaling matters

Because Logistic Regression uses gradient descent, features with large values can slow down convergence. That is why: is important.

2. Learning rate matters

If learning rate is:

  • too high -> training may diverge
  • too low -> training becomes very slow

Typical starting values:

3. Linear boundary limitation

Logistic Regression assumes the boundary is linear: If data is non-linear, you may need:

  • feature engineering
  • polynomial features
  • another model

4. Correlated features can cause issues

If features are strongly correlated, learning may become unstable. This is called multicollinearity.


Exam-Oriented Summary

Definition

Logistic Regression is a supervised learning algorithm used for binary classification.

Model

Decision Rule

Loss Function

Gradient Descent Updates

Uses

  • spam detection
  • disease prediction
  • pass/fail prediction
  • yes/no classification tasks

Very Short Revision

  • Compute linear score:
  • Apply sigmoid:
  • Compute cross-entropy loss
  • Update weights and bias using gradient descent
  • Convert probability to class using threshold 0.5

Final Takeaway

Logistic Regression is a simple but powerful classification algorithm. It learns a linear decision boundary, uses sigmoid to output probabilities, and improves itself by minimizing cross-entropy loss using gradient descent. For your example, it works well because the classes are generated by a rule that is approximately linearly separable.