Principal Component Analysis (PCA)
PCA is a dimensionality reduction algorithm. It creates new features called principal components that keep as much information as possible from the original dataset. The main idea is:
- find directions where the data varies the most
- rank those directions
- keep only the most important ones
If the original data has features, PCA can produce up to principal components.
Why PCA is Needed
High-dimensional data causes problems such as:
- harder visualization
- more computation
- more difficult learning
- curse of dimensionality
PCA solves this by projecting the data onto fewer dimensions while preserving maximum variance.
Core Idea
Suppose the data matrix is: where:
- = number of samples
- = number of features
PCA finds a new set of orthogonal directions: such that:
- captures the maximum variance
- captures the next maximum variance
- and so on
These directions are the eigenvectors of the covariance matrix. Their importance is given by the eigenvalues.
Main Equations
1. Standardization
Each feature is standardized using Z-score:
2. Covariance matrix
For centered data:
3. Eigen decomposition
where:
- = eigenvector
- = eigenvalue
4. Projection
If contains the top principal components, then:
Intuition
Each principal component is a direction in feature space. If data spreads a lot along a direction, that direction contains a lot of information. So PCA keeps the directions with the largest variance.
That is why PCA chooses eigenvectors with the largest eigenvalues.
Short and Clean Code
import numpy as np
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA as SklearnPCA
class PCAFromScratch:
def __init__(self, n_components):
self.n_components = n_components
self.mean_ = None
self.std_ = None
self.components_ = None
self.eigenvalues_ = None
self.explained_variance_ratio_ = None
def fit(self, X):
X = np.asarray(X, dtype=float)
self.mean_ = X.mean(axis=0)
self.std_ = X.std(axis=0)
Xs = (X - self.mean_) / self.std_
C = (Xs.T @ Xs) / len(Xs)
eigenvalues, eigenvectors = np.linalg.eigh(C)
order = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[order]
eigenvectors = eigenvectors[:, order]
self.eigenvalues_ = eigenvalues[:self.n_components]
self.components_ = eigenvectors[:, :self.n_components]
self.explained_variance_ratio_ = eigenvalues / eigenvalues.sum()
return self
def transform(self, X):
X = np.asarray(X, dtype=float)
Xs = (X - self.mean_) / self.std_
return Xs @ self.components_
def fit_transform(self, X):
self.fit(X)
return self.transform(X)
iris = load_iris()
X = iris.data
pca = PCAFromScratch(n_components=2)
X_proj = pca.fit_transform(X)
print("Top eigenvalues:", np.round(pca.eigenvalues_, 4))
print("Principal components:\n", np.round(pca.components_, 4))
print("Explained variance ratio:", np.round(pca.explained_variance_ratio_, 4))
sk_pca = SklearnPCA(n_components=2)
X_sk = sk_pca.fit_transform((X - X.mean(axis=0)) / X.std(axis=0))
print("Sklearn components:\n", np.round(sk_pca.components_.T, 4))
What This Code Does
This code:
- loads the Iris dataset
- standardizes all features
- computes the covariance matrix
- finds eigenvalues and eigenvectors
- sorts them from largest to smallest
- selects the top
n_components - projects the original data onto those components
So 4-dimensional Iris data becomes 2-dimensional.
Dataset
The Iris dataset has:
- 150 samples
- 4 features
- 3 flower classes
The four features are:
- sepal length
- sepal width
- petal length
- petal width
PCA uses only the feature matrix:
Since PCA is unsupervised, labels are not needed.
Step-by-Step Algorithm
Step 1: Standardize the dataset
PCA is sensitive to scale. If one feature has larger values, it can dominate the variance.
So each column is standardized:
Code:
self.mean_ = X.mean(axis=0)
self.std_ = X.std(axis=0)
Xs = (X - self.mean_) / self.std_
Concept:
- subtract column mean
- divide by column standard deviation
- now every feature has roughly comparable scale
Why important:
- PCA is variance-based
- variance depends on feature scale
- standardization ensures fairness among features
Step 2: Compute covariance matrix
The covariance matrix measures how features vary together.
Equation:
Code:
C = (Xs.T @ Xs) / len(Xs)
Concept:
- diagonal entries = variance of each standardized feature
- off-diagonal entries = covariance between pairs of features
For Iris:
Because there are 4 original features.
Step 3: Find eigenvalues and eigenvectors
PCA solves:
Code:
eigenvalues, eigenvectors = np.linalg.eigh(C)
Concept:
- each eigenvector gives a direction
- each eigenvalue tells how much variance is captured in that direction
Why eigh and not eig:
- covariance matrix is symmetric
np.linalg.eighis better for symmetric matrices
Interpretation:
- large eigenvalue important component
- small eigenvalue less important component
Step 4: Sort eigenvalues and eigenvectors
The most useful components are the ones with the largest eigenvalues.
Code:
order = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[order]
eigenvectors = eigenvectors[:, order]
Concept:
argsortsorts indices[::-1]reverses order to descending- now the first column of
eigenvectorsis the first principal component
Step 5: Select top principal components
If we want only components:
Code:
self.eigenvalues_ = eigenvalues[:self.n_components]
self.components_ = eigenvectors[:, :self.n_components]
Concept:
- keep only the dominant directions
- discard weaker directions
- dimension reduces from to
For example:
- original data: 4 features
- choose 2 components
- reduced data: 2 features
Step 6: Project data onto the new space
Projection formula:
Code:
return Xs @ self.components_
Concept:
- each sample is re-expressed in terms of principal components
- this gives lower-dimensional data
- information loss is minimized as much as possible for the chosen number of components
If: then:
Concept -> Equation -> Code Mapping
1. Equalize feature scales
Concept: All features should contribute fairly.
Equation:
Code:
self.mean_ = X.mean(axis=0)
self.std_ = X.std(axis=0)
Xs = (X - self.mean_) / self.std_
2. Measure variance structure
Concept: We need a matrix that summarizes how features vary together.
Equation:
Code:
C = (Xs.T @ Xs) / len(Xs)
3. Find important directions
Concept: The best projection directions are the eigenvectors of the covariance matrix.
Equation:
Code:
eigenvalues, eigenvectors = np.linalg.eigh(C)
4. Rank the directions
Concept: Directions with larger variance are more useful.
Equation:
Code:
order = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[order]
eigenvectors = eigenvectors[:, order]
5. Keep only the top directions
Concept: Dimensionality reduction means retaining only the most informative directions.
Equation:
Code:
self.components_ = eigenvectors[:, :self.n_components]
6. Project data
Concept: Convert old features into principal component coordinates.
Equation:
Code:
return Xs @ self.components_
Why Eigenvectors and Eigenvalues Appear
PCA wants to maximize the variance of projected data. If we project onto a unit vector , projected variance is: subject to:
This is an optimization problem. Using Lagrange multipliers: Taking derivative and setting to zero gives:
So:
- the best directions are eigenvectors of
- the amount of retained variance is given by eigenvalues
This is the mathematical reason behind PCA.
Why the Largest Eigenvalue Matters
The projected variance along direction is: For an eigenvector: So: Since: we get:
Therefore:
- projected variance equals the eigenvalue
- maximizing variance means choosing the largest eigenvalue
That is why PCA selects top eigenvalues first.
Explained Variance Ratio
A useful quantity is:
Code:
self.explained_variance_ratio_ = eigenvalues / eigenvalues.sum()
Concept: This tells how much total information each principal component retains.
Example interpretation:
- PC1 = 72%
- PC2 = 23%
- then first two PCs keep 95% of the total variance
This helps decide how many components to keep.
Worked Mini Example
Suppose after covariance computation, eigenvalues are:
Then:
- first principal component captures the most variance
- second captures the next most
- third and fourth contribute little
Total variance:
Explained variance ratios:
So:
- PC1 keeps about 72.75%
- PC2 keeps about 23%
- first two together keep about 95.75%
This means reducing from 4D to 2D is very reasonable.
Understanding the Components Matrix
If the selected components are: then:
- first column = first principal component
- second column = second principal component
Each column shows how the original features combine to form the new axis.
For example, first component:
So a principal component is a linear combination of original features.
Comparing with Scikit-Learn
Code:
sk_pca = SklearnPCA(n_components=2)
X_sk = sk_pca.fit_transform((X - X.mean(axis=0)) / X.std(axis=0))
print(np.round(sk_pca.components_.T, 4))
Concept: This checks whether the scratch implementation gives similar principal components.
Important note: Principal components may differ by sign. That means if one library gives: another may give: This is still correct because both represent the same direction.
So when comparing PCA outputs, sign flips are normal.
Why PCA Works
PCA works because:
- it identifies directions of maximum spread
- those directions preserve the most information
- the directions are orthogonal, so they do not duplicate information
- low-variance directions can often be removed with minimal information loss
So PCA compresses data while keeping the most useful structure.
Limitations
1. PCA is linear
PCA only finds linear combinations of features. If structure is highly non-linear, PCA may not capture it well.
2. PCA is sensitive to scale
Without standardization, large-scale features dominate.
3. Components may be hard to interpret
The new axes are combinations of original features, so they may not be as interpretable as raw columns.
4. Variance does not always mean usefulness
PCA keeps directions with high variance, but high variance does not always mean high predictive importance for a target variable.
Exam-Oriented Summary
Definition
PCA is an unsupervised dimensionality reduction technique that transforms correlated features into orthogonal principal components.
Goal
Reduce the number of features while retaining maximum variance.
Steps
- standardize data
- compute covariance matrix
- find eigenvalues and eigenvectors
- sort them in descending order
- select top components
- project data onto them
Important Equations
Standardization: Covariance: Eigen equation: Projection: Explained variance ratio:
Interpretation
- eigenvectors = directions of principal components
- eigenvalues = variance captured by those directions
- larger eigenvalue = more important component
Very Short Revision
PCA reduces dimensions by:
- standardizing data
- computing covariance matrix
- finding eigenvectors/eigenvalues
- sorting by largest eigenvalue
- keeping top components
- projecting data onto them
Main idea:
Final Takeaway
PCA transforms high-dimensional data into a lower-dimensional form by finding the most informative orthogonal directions. These directions are the eigenvectors of the covariance matrix, and their importance is measured by eigenvalues. The larger the eigenvalue, the more variance that principal component preserves, so the better it is for dimensionality reduction.