Logistic Regression

Logistic Regression is a binary classification algorithm used when the output belongs to one of two classes: $y \in {0, 1}$ It does not predict the class directly using a line. Instead, it predicts a probability using the sigmoid function, then converts that probability into class 0 or 1. The model is: $z = w_{1} x_{1} + w_{2} x_{2} + \dots + w_{n} x_{n} + b$ $\overset{y}{^} = σ (z) = \frac{1}{1 + e ^{- z}}$ where:

$z$ = linear score
$σ (z)$ = sigmoid function
$\overset{y}{^}$ = predicted probability that class is 1

If: $\overset{y}{^} \geq 0.5 \Rightarrow predict 1$ $\overset{y}{^} < 0.5 \Rightarrow predict 0$

Main Idea

Linear Regression gives any real number as output, but classification needs a value between 0 and 1. So Logistic Regression first computes a linear combination: $z = Xw + b$ then applies the sigmoid: $σ (z) = \frac{1}{1 + e ^{- z}}$ This maps any real number into: $(0, 1)$ So the output can be interpreted as a probability.

Short and Clean Code

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

class LogisticRegressionScratch:
    def __init__(self, lr=0.1, epochs=1000):
        self.lr = lr
        self.epochs = epochs
        self.w = None
        self.b = 0.0
        self.loss_history = []

    def _sigmoid(self, z):
        z = np.clip(z, -500, 500)
        return 1 / (1 + np.exp(-z))

    def _loss(self, y, p):
        eps = 1e-9
        p = np.clip(p, eps, 1 - eps)
        return -np.mean(y * np.log(p) + (1 - y) * np.log(1 - p))

    def fit(self, X, y):
        X = np.asarray(X)
        y = np.asarray(y)
        m, n = X.shape
        self.w = np.zeros(n)

        for _ in range(self.epochs):
            z = X @ self.w + self.b
            p = self._sigmoid(z)

            dw = (X.T @ (p - y)) / m
            db = np.mean(p - y)

            self.w -= self.lr * dw
            self.b -= self.lr * db

            self.loss_history.append(self._loss(y, p))
        return self

    def predict_proba(self, X):
        X = np.asarray(X)
        return self._sigmoid(X @ self.w + self.b)

    def predict(self, X):
        return (self.predict_proba(X) >= 0.5).astype(int)

np.random.seed(42)
X = np.random.rand(200, 2) * 10
y = (X[:, 0] + X[:, 1] > 10).astype(int)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

model = LogisticRegressionScratch(lr=0.1, epochs=1000)
model.fit(X_train, y_train)

pred = model.predict(X_test)
acc = np.mean(pred == y_test)

print("Weights:", np.round(model.w, 4))
print("Bias:", round(model.b, 4))
print("Accuracy:", round(acc, 4))

plt.plot(model.loss_history)
plt.title("Loss Convergence")
plt.xlabel("Iteration")
plt.ylabel("Loss")
plt.grid(True)
plt.show()

What This Code Does

This example creates 2D points: $X = [x_{1}, x_{2}]$ and labels them using: $y = {1, 0, x_{1} + x_{2} > 10 x_{1} + x_{2} \leq 10$ So the true decision boundary is: $x_{1} + x_{2} = 10$ This is a binary classification problem.

Step-by-Step Algorithm

Step 1: Create the dataset

Code:

np.random.seed(42)
X = np.random.rand(200, 2) * 10
y = (X[:, 0] + X[:, 1] > 10).astype(int)

Concept:

X contains 200 samples
each sample has 2 features: $x_{1}, x_{2}$
class label depends on whether the sum is greater than 10

Equation: $y = {1, 0, x_{1} + x_{2} > 10 otherwise$

Meaning:

points above the line belong to class 1
points below the line belong to class 0

Step 2: Split into train and test sets

Code:

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Concept:

training data is used to learn parameters
test data is used to check performance on unseen data

Here:

80% for training
20% for testing

Step 3: Standardize features

Code:

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Concept: Gradient descent works better when features are on similar scales.

Standardization formula: $x_{scaled} = \frac{x - μ}{σ}$ where:

$μ$ = mean of the feature
$σ$ = standard deviation

Why it helps:

faster convergence
more stable updates
one feature does not dominate another due to large magnitude

Step 4: Compute the linear score

For each sample, Logistic Regression first computes: $z = w_{1} x_{1} + w_{2} x_{2} + \dots + w_{n} x_{n} + b$ In vector form: $z = Xw + b$

Code:

z = X @ self.w + self.b

Concept: This is the same linear part used in linear models. But here it is not the final output. It is only the input to the sigmoid function.

Step 5: Apply sigmoid to get probability

Equation: $σ (z) = \frac{1}{1 + e ^{- z}}$

Code:

p = self._sigmoid(z)

Concept: The sigmoid compresses any real number into a value between 0 and 1.

Examples:

if $z = 0$ : $σ (0) = \frac{1}{1 + e ^{0}} = 0.5$
if $z$ is large positive, probability is close to 1
if $z$ is large negative, probability is close to 0

So: $p = P (y = 1 ∣ X)$

Step 6: Measure error using cross-entropy loss

For Logistic Regression, we do not use mean squared error. We use cross-entropy loss: $J (w, b) = - \frac{1}{m} i = 1 \sum m [y_{i} lo g (\overset{y}{^}_{i}) + (1 - y_{i}) lo g (1 - \overset{y}{^}_{i})]$

Code:

def _loss(self, y, p):
    eps = 1e-9
    p = np.clip(p, eps, 1 - eps)
    return -np.mean(y * np.log(p) + (1 - y) * np.log(1 - p))

Concept:

if actual class is 1, we want $\overset{y}{^}$ close to 1
if actual class is 0, we want $\overset{y}{^}$ close to 0
wrong confident predictions are penalized heavily

Why clipping is used:

log(0) is undefined
so probabilities are clipped slightly away from 0 and 1

Step 7: Compute gradients

To reduce loss, we update weights and bias using gradient descent.

Gradient formulas: $\frac{\partial J}{\partial w} = \frac{1}{m} X^{T} (\overset{y}{^} - y)$ $\frac{\partial J}{\partial b} = \frac{1}{m} \sum (\overset{y}{^} - y)$

Code:

dw = (X.T @ (p - y)) / m
db = np.mean(p - y)

Concept:

dw tells how weights should change
db tells how bias should change
if prediction is too large, parameters are pushed downward
if prediction is too small, parameters are pushed upward

Step 8: Update parameters

Gradient descent update rule: $w := w - α \frac{\partial J}{\partial w}$ $b := b - α \frac{\partial J}{\partial b}$ where $α$ is the learning rate.

Code:

self.w -= self.lr * dw
self.b -= self.lr * db

Concept:

move parameters in the direction that reduces loss
repeat many times until learning stabilizes

Step 9: Convert probabilities to classes

After training, predicted probability is converted to class label.

Rule: $\overset{y}{^} = {1, 0, p \geq 0.5 p < 0.5$

Code:

return (self.predict_proba(X) >= 0.5).astype(int)

Concept:

probabilities are continuous
classification needs discrete labels

Step 10: Measure accuracy

Code:

pred = model.predict(X_test)
acc = np.mean(pred == y_test)

Equation: $Accuracy = \frac{Number of correct predictions}{Total predictions}$

Concept: Accuracy tells what fraction of test samples were classified correctly.

Concept -> Equation -> Code Mapping

1. Model parameters

Concept: The model must learn weights and bias.

Equation: $z = Xw + b$

Code:

self.w = np.zeros(n)
self.b = 0.0

Meaning:

start with all weights as 0
start with bias as 0

2. Probability model

Concept: Turn linear score into probability.

Equation: $\overset{y}{^} = σ (z) = \frac{1}{1 + e ^{- z}}$

Code:

def _sigmoid(self, z):
    z = np.clip(z, -500, 500)
    return 1 / (1 + np.exp(-z))

Why clip:

avoids overflow for very large positive or negative values

3. Forward pass

Concept: Compute predictions from current parameters.

Equation: $z = Xw + b$ $p = σ (z)$

Code:

z = X @ self.w + self.b
p = self._sigmoid(z)

Meaning:

z = raw score
p = predicted probability

4. Loss calculation

Concept: See how wrong the predictions are.

Equation: $J (w, b) = - \frac{1}{m} \sum [y lo g (p) + (1 - y) lo g (1 - p)]$

Code:

self.loss_history.append(self._loss(y, p))

Meaning:

each iteration stores loss
useful for checking whether training is improving

5. Backward pass

Concept: Find how parameters affect the loss.

Equation: $\frac{\partial J}{\partial w} = \frac{1}{m} X^{T} (p - y)$ $\frac{\partial J}{\partial b} = \frac{1}{m} \sum (p - y)$

Code:

dw = (X.T @ (p - y)) / m
db = np.mean(p - y)

Meaning: These gradients guide the update step.

6. Learning step

Concept: Improve the model gradually.

Equation: $w := w - α d w$ $b := b - α d b$

Code:

self.w -= self.lr * dw
self.b -= self.lr * db

Meaning: Repeated updates make the model better at classification.

Worked Example on One Sample

Suppose after some training, for one sample: $x = [0.8, 1.2]$ and the model has: $w = [1.5, 1.0], b = - 0.4$

Step 1: Compute score

$z = (1.5) (0.8) + (1.0) (1.2) - 0.4$ $z = 1.2 + 1.2 - 0.4 = 2.0$

Step 2: Apply sigmoid

$\overset{y}{^} = σ (2) = \frac{1}{1 + e ^{- 2}}$ $\overset{y}{^} \approx 0.8808$

Step 3: Classify

Since: $0.8808 > 0.5$ prediction is: $1$

So this sample is classified as class 1.

Why Logistic Regression Works

The model learns a boundary where the probability changes from class 0 to class 1. For two features, the decision boundary is: $w_{1} x_{1} + w_{2} x_{2} + b = 0$ Because: $σ (z) = 0.5 when z = 0$ So:

if $z > 0$ , class tends toward 1
if $z < 0$ , class tends toward 0

This creates a linear decision boundary.

For your dataset, true labels come from: $x_{1} + x_{2} > 10$ So Logistic Regression is a good fit because the classes are separable by a line.

What the Loss Curve Means

Code:

plt.plot(model.loss_history)
plt.title("Loss Convergence")
plt.xlabel("Iteration")
plt.ylabel("Loss")
plt.grid(True)
plt.show()

Concept:

at the beginning, loss is high
during learning, loss should decrease
a downward curve means gradient descent is working

If the curve:

decreases smoothly -> learning is stable
oscillates wildly -> learning rate may be too high
decreases very slowly -> learning rate may be too low

Practical Notes

1. Feature scaling matters

Because Logistic Regression uses gradient descent, features with large values can slow down convergence. That is why: $x_{scaled} = \frac{x - μ}{σ}$ is important.

2. Learning rate matters

If learning rate $α$ is:

too high -> training may diverge
too low -> training becomes very slow

Typical starting values: $0.01 to 0.1$

3. Linear boundary limitation

Logistic Regression assumes the boundary is linear: $w_{1} x_{1} + w_{2} x_{2} + \dots + w_{n} x_{n} + b = 0$ If data is non-linear, you may need:

feature engineering
polynomial features
another model

4. Correlated features can cause issues

If features are strongly correlated, learning may become unstable. This is called multicollinearity.

Exam-Oriented Summary

Definition

Logistic Regression is a supervised learning algorithm used for binary classification.

Model

$z = Xw + b$ $\overset{y}{^} = \frac{1}{1 + e ^{- z}}$

Decision Rule

$\overset{y}{^} \geq 0.5 \Rightarrow 1$ $\overset{y}{^} < 0.5 \Rightarrow 0$

Loss Function

$J (w, b) = - \frac{1}{m} \sum [y lo g (\overset{y}{^}) + (1 - y) lo g (1 - \overset{y}{^})]$

Gradient Descent Updates

$w := w - α \frac{1}{m} X^{T} (\overset{y}{^} - y)$ $b := b - α \frac{1}{m} \sum (\overset{y}{^} - y)$

Uses

spam detection
disease prediction
pass/fail prediction
yes/no classification tasks

Very Short Revision

Compute linear score: $z = Xw + b$
Apply sigmoid: $\overset{y}{^} = \frac{1}{1 + e ^{- z}}$
Compute cross-entropy loss
Update weights and bias using gradient descent
Convert probability to class using threshold 0.5

Final Takeaway

Logistic Regression is a simple but powerful classification algorithm. It learns a linear decision boundary, uses sigmoid to output probabilities, and improves itself by minimizing cross-entropy loss using gradient descent. For your example, it works well because the classes are generated by a rule that is approximately linearly separable.

Keyboard shortcuts

Notes