K-Nearest Neighbor
KNN is an instance-based or lazy learning algorithm. It does not learn an explicit equation during training. Instead, it:
- stores the training data
- waits until a new test point comes
- finds the nearest stored examples
- predicts using those neighbors
There are two common versions:
- KNN Classification -> predict the most common class
- KNN Regression -> predict the mean of neighbor target values
Main Idea
Each sample is a point in -dimensional space: To predict for a new point, KNN measures the distance from that point to all training points.
The most common distance used is Euclidean distance:
Then:
- in classification, choose the majority class among the nearest neighbors
- in regression, choose the average target among the nearest neighbors
Part A: KNN Classification
What It Does
Given a new test point:
- compute its distance to every training point
- sort distances
- pick the nearest
- return the majority class
So the prediction rule is:
Short and Clean KNN Classification Code
import numpy as np
from collections import Counter
class KNNClassifier:
def __init__(self, k=3):
self.k = k
self.X_train = None
self.y_train = None
def fit(self, X, y):
self.X_train = np.asarray(X, dtype=float)
self.y_train = np.asarray(y)
def _distance(self, a, b):
return np.sqrt(np.sum((a - b) ** 2))
def _predict_one(self, x):
distances = [self._distance(x, x_train) for x_train in self.X_train]
k_idx = np.argsort(distances)[:self.k]
k_labels = self.y_train[k_idx]
return Counter(k_labels).most_common(1)[0][0]
def predict(self, X):
X = np.asarray(X, dtype=float)
return np.array([self._predict_one(x) for x in X])
X_train = np.array([
[1, 1],
[2, 1],
[4, 3],
[5, 4],
[6, 5]
], dtype=float)
y_train = np.array([0, 0, 1, 1, 1])
X_test = np.array([
[3, 2],
[5, 5]
], dtype=float)
clf = KNNClassifier(k=3)
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
print("Predictions:", pred)
Small Classification Example
Training data:
Test point:
Take:
Solving the Classification Example Step by Step
We compute distance from to every training point.
Distance to
Distance to
Distance to
Distance to
Distance to
Now sort the distances:
| Point | Label | Distance |
|---|---|---|
| 0 | 1.414 | |
| 1 | 1.414 | |
| 0 | 2.236 | |
| 1 | 2.828 | |
| 1 | 4.243 |
Nearest 3 labels:
Majority class:
So:
That is exactly how KNN classification works.
KNN Classification Algorithm
Step 1: Store training data
Unlike linear models, KNN does not compute weights during training.
Code:
def fit(self, X, y):
self.X_train = np.asarray(X, dtype=float)
self.y_train = np.asarray(y)
Concept: Training in KNN simply means memorizing the dataset.
Step 2: Compute distance
Equation:
Code:
def _distance(self, a, b):
return np.sqrt(np.sum((a - b) ** 2))
Concept: This measures how close a training sample is to the test sample.
Smaller distance means more similar.
Step 3: Rank neighbors
Code:
distances = [self._distance(x, x_train) for x_train in self.X_train]
k_idx = np.argsort(distances)[:self.k]
Concept:
- compute all distances
- sort them
- keep the indices of the nearest
Step 4: Vote by majority
Code:
k_labels = self.y_train[k_idx]
return Counter(k_labels).most_common(1)[0][0]
Concept: Among the nearest neighbors, whichever class occurs most is chosen.
Equation:
Concept -> Equation -> Code Mapping for Classification
1. Represent samples as points
Concept: Each row is a point in feature space.
Equation:
Code:
self.X_train = np.asarray(X, dtype=float)
2. Measure closeness
Concept: Similarity is measured using Euclidean distance.
Equation:
Code:
def _distance(self, a, b):
return np.sqrt(np.sum((a - b) ** 2))
3. Select nearest neighbors
Concept: Prediction depends only on nearby training samples.
Code:
k_idx = np.argsort(distances)[:self.k]
4. Use majority vote
Concept: Classification chooses the most frequent class.
Equation:
Code:
return Counter(k_labels).most_common(1)[0][0]
Why KNN Classification Works
The assumption is:
points that are close in feature space tend to have the same class
So instead of learning a global rule, KNN makes a local decision around the test point.
That is why it is called instance-based learning.
Part B: KNN Regression
What It Does
KNN regression follows the same steps as classification:
- compute distance to all training points
- choose the nearest
- average their target values
Prediction rule: where are the target values of the nearest neighbors.
Short and Clean KNN Regression Code
import numpy as np
class KNNRegressor:
def __init__(self, k=3):
self.k = k
self.X_train = None
self.y_train = None
def fit(self, X, y):
self.X_train = np.asarray(X, dtype=float)
self.y_train = np.asarray(y, dtype=float)
def _distance(self, a, b):
return np.sqrt(np.sum((a - b) ** 2))
def _predict_one(self, x):
distances = [self._distance(x, x_train) for x_train in self.X_train]
k_idx = np.argsort(distances)[:self.k]
k_values = self.y_train[k_idx]
return np.mean(k_values)
def predict(self, X):
X = np.asarray(X, dtype=float)
return np.array([self._predict_one(x) for x in X])
X_train = np.array([[1], [2], [3], [6], [7]], dtype=float)
y_train = np.array([30000, 35000, 40000, 70000, 75000], dtype=float)
X_test = np.array([[4]], dtype=float)
reg = KNNRegressor(k=3)
reg.fit(X_train, y_train)
pred = reg.predict(X_test)
print("Prediction:", pred)
Small Regression Example
Suppose training data is:
| YearsExperience | Salary |
|---|---|
| 1 | 30000 |
| 2 | 35000 |
| 3 | 40000 |
| 6 | 70000 |
| 7 | 75000 |
Test point:
Take:
Solving the Regression Example Step by Step
We compute the distance from 4 to each training point.
Distance to 1
Distance to 2
Distance to 3
Distance to 6
Distance to 7
Sorted distances:
| Point | Target | Distance |
|---|---|---|
| 3 | 40000 | 1 |
| 2 | 35000 | 2 |
| 6 | 70000 | 2 |
| 1 | 30000 | 3 |
| 7 | 75000 | 3 |
Nearest 3 target values:
Prediction is their mean:
So the predicted salary is:
KNN Regression Algorithm
Step 1: Store training data
Code:
def fit(self, X, y):
self.X_train = np.asarray(X, dtype=float)
self.y_train = np.asarray(y, dtype=float)
Concept: Just store examples and target values.
Step 2: Compute all distances
Equation:
Code:
distances = [self._distance(x, x_train) for x_train in self.X_train]
Concept: Measure how close the new point is to all training samples.
Step 3: Pick nearest
Code:
k_idx = np.argsort(distances)[:self.k]
Concept: Keep the closest neighbors only.
Step 4: Average their target values
Equation:
Code:
k_values = self.y_train[k_idx]
return np.mean(k_values)
Concept: Regression prediction is the average of nearby outputs.
Concept -> Equation -> Code Mapping for Regression
1. Use the same distance idea
Concept: Near points should have similar target values.
Equation:
Code:
def _distance(self, a, b):
return np.sqrt(np.sum((a - b) ** 2))
2. Find local neighborhood
Concept: Prediction uses local samples rather than a global fitted line.
Code:
k_idx = np.argsort(distances)[:self.k]
3. Average local outputs
Concept: Regression uses the mean of neighbor target values.
Equation:
Code:
return np.mean(k_values)
KNN Classification vs KNN Regression
| Aspect | KNN Classification | KNN Regression |
|---|---|---|
| Output type | class label | continuous value |
| Final rule | majority vote | mean of neighbors |
| Formula | mode of nearest labels | average of nearest targets |
Unified Intuition
KNN does not learn a formula like: or:
Instead, it says:
for this new point, let me look around nearby training points and decide locally
So every test sample gets its own local prediction rule.
That is why KNN is called:
- lazy learning
- instance-based learning
Choosing the Value of
The value of controls smoothness.
Small
- sensitive to noise
- more flexible
- may overfit
Large
- smoother decision
- may underfit
- local details may be lost
A common rough idea: where is the number of training samples
But in practice, is usually chosen using validation.
Why Feature Scaling Matters
KNN depends entirely on distance. So features with larger numeric ranges dominate the distance.
Example:
- Age may range from 20 to 80
- Glucose may range from 0 to 200
Then Glucose can dominate Euclidean distance.
So scaling is often important:
Without scaling, nearest neighbors may be misleading.
Time Cost of KNN
KNN is cheap during training but expensive during prediction.
Training
- just store the data
Prediction
For every test point:
- compute distance to all training points
- sort them
- then predict
So prediction can be costly when the training set is large.
This is one of the main disadvantages of KNN.
Advantages of KNN
- very simple
- no training optimization needed
- easy to understand
- works for both classification and regression
- naturally handles multi-class classification
Limitations of KNN
- prediction is slow on large datasets
- sensitive to feature scale
- sensitive to irrelevant features
- choice of matters a lot
- can perform poorly in very high dimensions
This last issue is related to the curse of dimensionality.
Exam-Oriented Summary
Definition
KNN is an instance-based supervised learning algorithm that predicts a new sample using the nearest stored training samples.
Distance Formula
Classification Rule
Regression Rule
Steps
- store training data
- compute distance from test point to all training points
- sort by distance
- select nearest
- classify by majority vote or regress by mean
Very Short Revision
KNN Classification
- find nearest
- take majority class
- output class label
KNN Regression
- find nearest
- take mean of target values
- output continuous value
Main formula:
Final Takeaway
KNN is one of the simplest machine learning algorithms. It does not build a model during training, but predicts by comparing a new point to stored training examples. For classification, it uses the majority class of the nearest neighbors. For regression, it uses the mean target value of the nearest neighbors. Its simplicity makes it excellent for understanding local learning, but its prediction cost and sensitivity to scaling are important limitations.