Artificial Neural Network (ANN)
An Artificial Neural Network is a model made of layers of neurons. A basic ANN has:
- input layer
- hidden layer(s)
- output layer
- weights and biases
- activation functions
A neuron computes: and then applies an activation function:
For this example, we build a 2-layer neural network for binary classification:
- hidden layer uses ReLU
- output layer uses Sigmoid
The final output is a probability: and prediction is:
What This Network Learns
For input , the network computes:
Where:
- = linear outputs
- = hidden layer activations
- = final predicted probability
Short and Clean Code
import numpy as np
class SimpleANN:
def __init__(self, input_size, hidden_size, lr=0.1, epochs=10000):
np.random.seed(42)
self.lr = lr
self.epochs = epochs
self.W1 = np.random.randn(hidden_size, input_size) * 0.1
self.b1 = np.zeros((hidden_size, 1))
self.W2 = np.random.randn(1, hidden_size) * 0.1
self.b2 = np.zeros((1, 1))
self.costs = []
def relu(self, Z):
return np.maximum(0, Z)
def relu_deriv(self, Z):
return (Z > 0).astype(float)
def sigmoid(self, Z):
Z = np.clip(Z, -500, 500)
return 1 / (1 + np.exp(-Z))
def forward(self, X):
Z1 = self.W1 @ X + self.b1
A1 = self.relu(Z1)
Z2 = self.W2 @ A1 + self.b2
A2 = self.sigmoid(Z2)
cache = (X, Z1, A1, Z2, A2)
return A2, cache
def compute_cost(self, Y, A2):
eps = 1e-9
A2 = np.clip(A2, eps, 1 - eps)
return -np.mean(Y * np.log(A2) + (1 - Y) * np.log(1 - A2))
def backward(self, Y, cache):
X, Z1, A1, Z2, A2 = cache
m = X.shape[1]
dZ2 = A2 - Y
dW2 = (dZ2 @ A1.T) / m
db2 = np.sum(dZ2, axis=1, keepdims=True) / m
dZ1 = (self.W2.T @ dZ2) * self.relu_deriv(Z1)
dW1 = (dZ1 @ X.T) / m
db1 = np.sum(dZ1, axis=1, keepdims=True) / m
return dW1, db1, dW2, db2
def fit(self, X, Y):
for i in range(self.epochs):
A2, cache = self.forward(X)
cost = self.compute_cost(Y, A2)
dW1, db1, dW2, db2 = self.backward(Y, cache)
self.W1 -= self.lr * dW1
self.b1 -= self.lr * db1
self.W2 -= self.lr * dW2
self.b2 -= self.lr * db2
self.costs.append(cost)
if i % 1000 == 0:
print(f"Epoch {i}: Cost = {cost:.6f}")
def predict(self, X):
A2, _ = self.forward(X)
return (A2 >= 0.5).astype(int)
X = np.array([[0, 0, 1, 1],
[0, 1, 0, 1]], dtype=float)
Y = np.array([[0, 0, 0, 1]], dtype=float)
model = SimpleANN(input_size=2, hidden_size=4, lr=0.1, epochs=10000)
model.fit(X, Y)
pred = model.predict(X)
print("Predictions:", pred)
Dataset Used: AND Gate
The network is trained on the AND truth table:
Input matrix:
X = np.array([[0, 0, 1, 1],
[0, 1, 0, 1]], dtype=float)
Target matrix:
Y = np.array([[0, 0, 0, 1]], dtype=float)
Shape meaning:
- has shape
- 2 input features
- 4 training examples
So each column is one example:
Network Architecture
This ANN has:
- 2 input neurons
- 4 hidden neurons
- 1 output neuron
So parameter shapes are:
Step-by-Step Algorithm
Step 1: Initialize weights and biases
We begin with small random weights and zero biases.
Code:
self.W1 = np.random.randn(hidden_size, input_size) * 0.1
self.b1 = np.zeros((hidden_size, 1))
self.W2 = np.random.randn(1, hidden_size) * 0.1
self.b2 = np.zeros((1, 1))
Concept:
- weights decide how strongly neurons influence the next layer
- biases shift the activation
- small random values break symmetry
- if all weights start the same, neurons learn the same thing
Why not large random weights:
- large values can make training unstable
- small values help smoother learning at the start
Step 2: Hidden layer linear transformation
Each hidden neuron computes:
Code:
Z1 = self.W1 @ X + self.b1
Concept: This is the weighted sum of inputs plus bias.
For one hidden neuron:
Since there are 4 hidden neurons, this is done 4 times in parallel.
Step 3: Apply ReLU activation
ReLU function is:
Code:
A1 = self.relu(Z1)
and:
def relu(self, Z):
return np.maximum(0, Z)
Concept:
- negative values become 0
- positive values remain unchanged
Why ReLU:
- introduces non-linearity
- lets the network learn more complex patterns
- simple and efficient
Without activation, multiple layers would collapse into just one linear transformation.
Step 4: Output layer linear transformation
Now hidden activations are passed to the output neuron:
Code:
Z2 = self.W2 @ A1 + self.b2
Concept: This combines the hidden-layer outputs into one final score.
Step 5: Apply Sigmoid to get probability
Sigmoid function:
Code:
A2 = self.sigmoid(Z2)
and:
def sigmoid(self, Z):
Z = np.clip(Z, -500, 500)
return 1 / (1 + np.exp(-Z))
Concept:
- converts raw score into probability
- output is between 0 and 1
- suitable for binary classification
Meaning:
Why clip is used:
- prevents overflow in
exp - improves numerical stability
Step 6: Compute the cost
For binary classification, we use binary cross-entropy loss:
Code:
def compute_cost(self, Y, A2):
eps = 1e-9
A2 = np.clip(A2, eps, 1 - eps)
return -np.mean(Y * np.log(A2) + (1 - Y) * np.log(1 - A2))
Concept:
- if actual label is 1, we want output close to 1
- if actual label is 0, we want output close to 0
- confident wrong predictions get heavily penalized
Why clip again:
- avoids which is undefined
Step 7: Backpropagation for output layer
The error at the output layer is:
Code:
dZ2 = A2 - Y
dW2 = (dZ2 @ A1.T) / m
db2 = np.sum(dZ2, axis=1, keepdims=True) / m
Equations:
Concept: This tells how much the output weights and bias contributed to the error.
Step 8: Backpropagation for hidden layer
The hidden layer error is:
Code:
dZ1 = (self.W2.T @ dZ2) * self.relu_deriv(Z1)
dW1 = (dZ1 @ X.T) / m
db1 = np.sum(dZ1, axis=1, keepdims=True) / m
Equations:
ReLU derivative:
Code:
def relu_deriv(self, Z):
return (Z > 0).astype(float)
Concept:
- output error is sent backward into the hidden layer
- only active ReLU neurons pass gradient
- this is how the network learns internal representations
Step 9: Update parameters
Gradient descent update rule:
Code:
self.W1 -= self.lr * dW1
self.b1 -= self.lr * db1
self.W2 -= self.lr * dW2
self.b2 -= self.lr * db2
Concept:
- move parameters in the direction that reduces loss
- repeat this many times
- gradually improve predictions
Here:
- is the learning rate
- a higher learning rate updates faster, but may overshoot
- a lower learning rate is safer, but slower
Step 10: Make predictions
After training, the network outputs probabilities. Convert them into classes using threshold 0.5:
Code:
def predict(self, X):
A2, _ = self.forward(X)
return (A2 >= 0.5).astype(int)
Concept -> Equation -> Code Mapping
1. Weighted input
Concept: Each neuron forms a weighted sum of inputs.
Equation:
Code:
Z1 = self.W1 @ X + self.b1
Z2 = self.W2 @ A1 + self.b2
2. Non-linearity
Concept: Activation functions allow the network to learn beyond straight-line relationships.
Equations:
Code:
A1 = self.relu(Z1)
A2 = self.sigmoid(Z2)
3. Forward propagation
Concept: Data flows from input to hidden to output.
Equations:
Code:
def forward(self, X):
Z1 = self.W1 @ X + self.b1
A1 = self.relu(Z1)
Z2 = self.W2 @ A1 + self.b2
A2 = self.sigmoid(Z2)
4. Loss measurement
Concept: We need to measure how wrong predictions are.
Equation:
Code:
cost = self.compute_cost(Y, A2)
5. Error propagation backward
Concept: The network computes gradients layer by layer from output back to input.
Equations:
Code:
dZ2 = A2 - Y
dW2 = (dZ2 @ A1.T) / m
dZ1 = (self.W2.T @ dZ2) * self.relu_deriv(Z1)
dW1 = (dZ1 @ X.T) / m
6. Learning
Concept: Use gradients to improve parameters.
Equation:
Code:
self.W1 -= self.lr * dW1
self.b1 -= self.lr * db1
self.W2 -= self.lr * dW2
self.b2 -= self.lr * db2
Solving the AND Gate Example
The AND gate outputs 1 only when both inputs are 1:
During training:
- the network starts with random weights
- predictions are poor at first
- after many epochs, weights and biases adjust
- the cost decreases
- final outputs approach the correct AND values
Expected final prediction:
Code:
pred = model.predict(X)
print("Predictions:", pred)
If training succeeds, output becomes:
Predictions: [[0 0 0 1]]
One Forward Pass Example
Suppose for one sample:
Assume one hidden neuron has:
Then:
Apply ReLU:
Then output neuron may combine hidden activations and pass through sigmoid. If final output score is: then:
Since: prediction is:
This is how the ANN converts inputs into a class decision.
Why ANN Works
A neural network works because:
- weights learn which inputs matter
- biases shift decision boundaries
- activation functions add non-linearity
- backpropagation tells each parameter how it contributed to the error
- gradient descent improves the parameters repeatedly
So the network gradually learns a function that maps input to output.
Why Hidden Layers Matter
A single linear model can only learn a linear boundary. A hidden layer with activation allows:
- combinations of features
- piecewise linear transformations
- more expressive decision boundaries
Even though AND is simple, this example demonstrates the full learning pipeline of an ANN.
Cost Curve Meaning
The printed cost every 1000 epochs tells whether learning is working.
If cost decreases:
- predictions are improving
- gradients are useful
- parameter updates are moving in the correct direction
If cost does not decrease:
- learning rate may be wrong
- architecture may be unsuitable
- initialization may be poor
Practical Notes
1. Initialization matters
Bad initialization can slow or break learning.
2. Learning rate matters
If learning rate is:
- too high -> unstable training
- too low -> very slow training
3. Activation choice matters
- ReLU is common in hidden layers
- Sigmoid is common for binary output
4. More layers increase capacity
Deeper networks can learn more complex patterns, but are also harder to train.
Exam-Oriented Summary
Definition
An ANN is a layered network of neurons that learns by adjusting weights and biases using backpropagation and gradient descent.
Architecture Used
- 2 input neurons
- 1 hidden layer with 4 neurons
- 1 output neuron
Important Equations
Hidden layer: Output layer: Sigmoid: ReLU: Loss: Gradient descent:
Training Steps
- initialize parameters
- perform forward propagation
- compute loss
- perform backpropagation
- update parameters
- repeat for many epochs
Very Short Revision
- input passes through weights and bias
- ReLU activates hidden layer
- sigmoid gives output probability
- cross-entropy measures error
- backpropagation computes gradients
- gradient descent updates weights
- repeat until cost decreases and predictions improve
Final Takeaway
This ANN from scratch shows the complete neural-network learning process:
- forward propagation computes predictions
- loss measures error
- backpropagation computes gradients
- gradient descent updates parameters
For the AND gate dataset, the network learns the correct truth table: which shows that it has successfully learned the mapping from inputs to outputs.