Logistic Regression
Logistic Regression is a binary classification algorithm used when the output belongs to one of two classes:
It does not predict the class directly using a line. Instead, it predicts a probability using the sigmoid function, then converts that probability into class 0 or 1.
The model is:
where:
- = linear score
- = sigmoid function
- = predicted probability that class is 1
If:
Main Idea
Linear Regression gives any real number as output, but classification needs a value between 0 and 1. So Logistic Regression first computes a linear combination: then applies the sigmoid: This maps any real number into: So the output can be interpreted as a probability.
Short and Clean Code
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
class LogisticRegressionScratch:
def __init__(self, lr=0.1, epochs=1000):
self.lr = lr
self.epochs = epochs
self.w = None
self.b = 0.0
self.loss_history = []
def _sigmoid(self, z):
z = np.clip(z, -500, 500)
return 1 / (1 + np.exp(-z))
def _loss(self, y, p):
eps = 1e-9
p = np.clip(p, eps, 1 - eps)
return -np.mean(y * np.log(p) + (1 - y) * np.log(1 - p))
def fit(self, X, y):
X = np.asarray(X)
y = np.asarray(y)
m, n = X.shape
self.w = np.zeros(n)
for _ in range(self.epochs):
z = X @ self.w + self.b
p = self._sigmoid(z)
dw = (X.T @ (p - y)) / m
db = np.mean(p - y)
self.w -= self.lr * dw
self.b -= self.lr * db
self.loss_history.append(self._loss(y, p))
return self
def predict_proba(self, X):
X = np.asarray(X)
return self._sigmoid(X @ self.w + self.b)
def predict(self, X):
return (self.predict_proba(X) >= 0.5).astype(int)
np.random.seed(42)
X = np.random.rand(200, 2) * 10
y = (X[:, 0] + X[:, 1] > 10).astype(int)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
model = LogisticRegressionScratch(lr=0.1, epochs=1000)
model.fit(X_train, y_train)
pred = model.predict(X_test)
acc = np.mean(pred == y_test)
print("Weights:", np.round(model.w, 4))
print("Bias:", round(model.b, 4))
print("Accuracy:", round(acc, 4))
plt.plot(model.loss_history)
plt.title("Loss Convergence")
plt.xlabel("Iteration")
plt.ylabel("Loss")
plt.grid(True)
plt.show()
What This Code Does
This example creates 2D points: and labels them using: So the true decision boundary is: This is a binary classification problem.
Step-by-Step Algorithm
Step 1: Create the dataset
Code:
np.random.seed(42)
X = np.random.rand(200, 2) * 10
y = (X[:, 0] + X[:, 1] > 10).astype(int)
Concept:
Xcontains 200 samples- each sample has 2 features:
- class label depends on whether the sum is greater than 10
Equation:
Meaning:
- points above the line belong to class 1
- points below the line belong to class 0
Step 2: Split into train and test sets
Code:
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
Concept:
- training data is used to learn parameters
- test data is used to check performance on unseen data
Here:
- 80% for training
- 20% for testing
Step 3: Standardize features
Code:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
Concept: Gradient descent works better when features are on similar scales.
Standardization formula: where:
- = mean of the feature
- = standard deviation
Why it helps:
- faster convergence
- more stable updates
- one feature does not dominate another due to large magnitude
Step 4: Compute the linear score
For each sample, Logistic Regression first computes: In vector form:
Code:
z = X @ self.w + self.b
Concept: This is the same linear part used in linear models. But here it is not the final output. It is only the input to the sigmoid function.
Step 5: Apply sigmoid to get probability
Equation:
Code:
p = self._sigmoid(z)
Concept: The sigmoid compresses any real number into a value between 0 and 1.
Examples:
- if :
- if is large positive, probability is close to 1
- if is large negative, probability is close to 0
So:
Step 6: Measure error using cross-entropy loss
For Logistic Regression, we do not use mean squared error. We use cross-entropy loss:
Code:
def _loss(self, y, p):
eps = 1e-9
p = np.clip(p, eps, 1 - eps)
return -np.mean(y * np.log(p) + (1 - y) * np.log(1 - p))
Concept:
- if actual class is 1, we want close to 1
- if actual class is 0, we want close to 0
- wrong confident predictions are penalized heavily
Why clipping is used:
log(0)is undefined- so probabilities are clipped slightly away from 0 and 1
Step 7: Compute gradients
To reduce loss, we update weights and bias using gradient descent.
Gradient formulas:
Code:
dw = (X.T @ (p - y)) / m
db = np.mean(p - y)
Concept:
dwtells how weights should changedbtells how bias should change- if prediction is too large, parameters are pushed downward
- if prediction is too small, parameters are pushed upward
Step 8: Update parameters
Gradient descent update rule: where is the learning rate.
Code:
self.w -= self.lr * dw
self.b -= self.lr * db
Concept:
- move parameters in the direction that reduces loss
- repeat many times until learning stabilizes
Step 9: Convert probabilities to classes
After training, predicted probability is converted to class label.
Rule:
Code:
return (self.predict_proba(X) >= 0.5).astype(int)
Concept:
- probabilities are continuous
- classification needs discrete labels
Step 10: Measure accuracy
Code:
pred = model.predict(X_test)
acc = np.mean(pred == y_test)
Equation:
Concept: Accuracy tells what fraction of test samples were classified correctly.
Concept -> Equation -> Code Mapping
1. Model parameters
Concept: The model must learn weights and bias.
Equation:
Code:
self.w = np.zeros(n)
self.b = 0.0
Meaning:
- start with all weights as 0
- start with bias as 0
2. Probability model
Concept: Turn linear score into probability.
Equation:
Code:
def _sigmoid(self, z):
z = np.clip(z, -500, 500)
return 1 / (1 + np.exp(-z))
Why clip:
- avoids overflow for very large positive or negative values
3. Forward pass
Concept: Compute predictions from current parameters.
Equation:
Code:
z = X @ self.w + self.b
p = self._sigmoid(z)
Meaning:
z= raw scorep= predicted probability
4. Loss calculation
Concept: See how wrong the predictions are.
Equation:
Code:
self.loss_history.append(self._loss(y, p))
Meaning:
- each iteration stores loss
- useful for checking whether training is improving
5. Backward pass
Concept: Find how parameters affect the loss.
Equation:
Code:
dw = (X.T @ (p - y)) / m
db = np.mean(p - y)
Meaning: These gradients guide the update step.
6. Learning step
Concept: Improve the model gradually.
Equation:
Code:
self.w -= self.lr * dw
self.b -= self.lr * db
Meaning: Repeated updates make the model better at classification.
Worked Example on One Sample
Suppose after some training, for one sample: and the model has:
Step 1: Compute score
Step 2: Apply sigmoid
Step 3: Classify
Since: prediction is:
So this sample is classified as class 1.
Why Logistic Regression Works
The model learns a boundary where the probability changes from class 0 to class 1. For two features, the decision boundary is: Because: So:
- if , class tends toward 1
- if , class tends toward 0
This creates a linear decision boundary.
For your dataset, true labels come from: So Logistic Regression is a good fit because the classes are separable by a line.
What the Loss Curve Means
Code:
plt.plot(model.loss_history)
plt.title("Loss Convergence")
plt.xlabel("Iteration")
plt.ylabel("Loss")
plt.grid(True)
plt.show()
Concept:
- at the beginning, loss is high
- during learning, loss should decrease
- a downward curve means gradient descent is working
If the curve:
- decreases smoothly -> learning is stable
- oscillates wildly -> learning rate may be too high
- decreases very slowly -> learning rate may be too low
Practical Notes
1. Feature scaling matters
Because Logistic Regression uses gradient descent, features with large values can slow down convergence. That is why: is important.
2. Learning rate matters
If learning rate is:
- too high -> training may diverge
- too low -> training becomes very slow
Typical starting values:
3. Linear boundary limitation
Logistic Regression assumes the boundary is linear: If data is non-linear, you may need:
- feature engineering
- polynomial features
- another model
4. Correlated features can cause issues
If features are strongly correlated, learning may become unstable. This is called multicollinearity.
Exam-Oriented Summary
Definition
Logistic Regression is a supervised learning algorithm used for binary classification.
Model
Decision Rule
Loss Function
Gradient Descent Updates
Uses
- spam detection
- disease prediction
- pass/fail prediction
- yes/no classification tasks
Very Short Revision
- Compute linear score:
- Apply sigmoid:
- Compute cross-entropy loss
- Update weights and bias using gradient descent
- Convert probability to class using threshold 0.5
Final Takeaway
Logistic Regression is a simple but powerful classification algorithm. It learns a linear decision boundary, uses sigmoid to output probabilities, and improves itself by minimizing cross-entropy loss using gradient descent. For your example, it works well because the classes are generated by a rule that is approximately linearly separable.