Artificial Neural Network (ANN)

An Artificial Neural Network is a model made of layers of neurons. A basic ANN has:

input layer
hidden layer(s)
output layer
weights and biases
activation functions

A neuron computes: $z = W x + b$ and then applies an activation function: $a = g (z)$

For this example, we build a 2-layer neural network for binary classification:

hidden layer uses ReLU
output layer uses Sigmoid

The final output is a probability: $\overset{y}{^} \in (0, 1)$ and prediction is: $\overset{y}{^} \geq 0.5 \Rightarrow 1, \overset{y}{^} < 0.5 \Rightarrow 0$

What This Network Learns

For input $X$ , the network computes: $Z_{1} = W_{1} X + b_{1}$ $A_{1} = ReLU (Z_{1})$ $Z_{2} = W_{2} A_{1} + b_{2}$ $A_{2} = σ (Z_{2}) = \frac{1}{1 + e ^{- Z_{2}}}$

Where:

$Z_{1}, Z_{2}$ = linear outputs
$A_{1}$ = hidden layer activations
$A_{2}$ = final predicted probability

Short and Clean Code

import numpy as np

class SimpleANN:
    def __init__(self, input_size, hidden_size, lr=0.1, epochs=10000):
        np.random.seed(42)
        self.lr = lr
        self.epochs = epochs
        self.W1 = np.random.randn(hidden_size, input_size) * 0.1
        self.b1 = np.zeros((hidden_size, 1))
        self.W2 = np.random.randn(1, hidden_size) * 0.1
        self.b2 = np.zeros((1, 1))
        self.costs = []

    def relu(self, Z):
        return np.maximum(0, Z)

    def relu_deriv(self, Z):
        return (Z > 0).astype(float)

    def sigmoid(self, Z):
        Z = np.clip(Z, -500, 500)
        return 1 / (1 + np.exp(-Z))

    def forward(self, X):
        Z1 = self.W1 @ X + self.b1
        A1 = self.relu(Z1)
        Z2 = self.W2 @ A1 + self.b2
        A2 = self.sigmoid(Z2)
        cache = (X, Z1, A1, Z2, A2)
        return A2, cache

    def compute_cost(self, Y, A2):
        eps = 1e-9
        A2 = np.clip(A2, eps, 1 - eps)
        return -np.mean(Y * np.log(A2) + (1 - Y) * np.log(1 - A2))

    def backward(self, Y, cache):
        X, Z1, A1, Z2, A2 = cache
        m = X.shape[1]

        dZ2 = A2 - Y
        dW2 = (dZ2 @ A1.T) / m
        db2 = np.sum(dZ2, axis=1, keepdims=True) / m

        dZ1 = (self.W2.T @ dZ2) * self.relu_deriv(Z1)
        dW1 = (dZ1 @ X.T) / m
        db1 = np.sum(dZ1, axis=1, keepdims=True) / m

        return dW1, db1, dW2, db2

    def fit(self, X, Y):
        for i in range(self.epochs):
            A2, cache = self.forward(X)
            cost = self.compute_cost(Y, A2)
            dW1, db1, dW2, db2 = self.backward(Y, cache)

            self.W1 -= self.lr * dW1
            self.b1 -= self.lr * db1
            self.W2 -= self.lr * dW2
            self.b2 -= self.lr * db2

            self.costs.append(cost)
            if i % 1000 == 0:
                print(f"Epoch {i}: Cost = {cost:.6f}")

    def predict(self, X):
        A2, _ = self.forward(X)
        return (A2 >= 0.5).astype(int)

X = np.array([[0, 0, 1, 1],
              [0, 1, 0, 1]], dtype=float)

Y = np.array([[0, 0, 0, 1]], dtype=float)

model = SimpleANN(input_size=2, hidden_size=4, lr=0.1, epochs=10000)
model.fit(X, Y)

pred = model.predict(X)
print("Predictions:", pred)

Dataset Used: AND Gate

The network is trained on the AND truth table: $x_{1} 0011 x_{2} 0101 y 0001$

Input matrix:

X = np.array([[0, 0, 1, 1],
              [0, 1, 0, 1]], dtype=float)

Target matrix:

Y = np.array([[0, 0, 0, 1]], dtype=float)

Shape meaning:

$X$ has shape $(2, 4)$
2 input features
4 training examples

So each column is one example: $X = [00011011]$

Network Architecture

This ANN has:

2 input neurons
4 hidden neurons
1 output neuron

So parameter shapes are: $W_{1} \in R^{4 \times 2}, b_{1} \in R^{4 \times 1}$ $W_{2} \in R^{1 \times 4}, b_{2} \in R^{1 \times 1}$

Step 1: Initialize weights and biases

We begin with small random weights and zero biases.

Code:

self.W1 = np.random.randn(hidden_size, input_size) * 0.1
self.b1 = np.zeros((hidden_size, 1))
self.W2 = np.random.randn(1, hidden_size) * 0.1
self.b2 = np.zeros((1, 1))

Concept:

weights decide how strongly neurons influence the next layer
biases shift the activation
small random values break symmetry
if all weights start the same, neurons learn the same thing

Why not large random weights:

large values can make training unstable
small values help smoother learning at the start

Step 2: Hidden layer linear transformation

Each hidden neuron computes: $Z_{1} = W_{1} X + b_{1}$

Code:

Z1 = self.W1 @ X + self.b1

Concept: This is the weighted sum of inputs plus bias.

For one hidden neuron: $z = w_{1} x_{1} + w_{2} x_{2} + b$

Since there are 4 hidden neurons, this is done 4 times in parallel.

Step 3: Apply ReLU activation

ReLU function is: $ReLU (z) = max (0, z)$

Code:

A1 = self.relu(Z1)

and:

def relu(self, Z):
    return np.maximum(0, Z)

Concept:

negative values become 0
positive values remain unchanged

Why ReLU:

introduces non-linearity
lets the network learn more complex patterns
simple and efficient

Without activation, multiple layers would collapse into just one linear transformation.

Step 4: Output layer linear transformation

Now hidden activations are passed to the output neuron: $Z_{2} = W_{2} A_{1} + b_{2}$

Code:

Z2 = self.W2 @ A1 + self.b2

Concept: This combines the hidden-layer outputs into one final score.

Step 5: Apply Sigmoid to get probability

Sigmoid function: $σ (z) = \frac{1}{1 + e ^{- z}}$

Code:

A2 = self.sigmoid(Z2)

and:

def sigmoid(self, Z):
    Z = np.clip(Z, -500, 500)
    return 1 / (1 + np.exp(-Z))

Concept:

converts raw score into probability
output is between 0 and 1
suitable for binary classification

Meaning: $A_{2} = P (y = 1 ∣ X)$

Why clip is used:

prevents overflow in exp
improves numerical stability

Step 6: Compute the cost

For binary classification, we use binary cross-entropy loss: $J = - \frac{1}{m} \sum [Y lo g (A_{2}) + (1 - Y) lo g (1 - A_{2})]$

Code:

def compute_cost(self, Y, A2):
    eps = 1e-9
    A2 = np.clip(A2, eps, 1 - eps)
    return -np.mean(Y * np.log(A2) + (1 - Y) * np.log(1 - A2))

Concept:

if actual label is 1, we want output close to 1
if actual label is 0, we want output close to 0
confident wrong predictions get heavily penalized

Why clip again:

avoids $lo g (0)$ which is undefined

Step 7: Backpropagation for output layer

The error at the output layer is: $d Z_{2} = A_{2} - Y$

Code:

dZ2 = A2 - Y
dW2 = (dZ2 @ A1.T) / m
db2 = np.sum(dZ2, axis=1, keepdims=True) / m

Equations: $d W_{2} = \frac{1}{m} d Z_{2} A_{1}^{T}$ $d b_{2} = \frac{1}{m} \sum d Z_{2}$

Concept: This tells how much the output weights and bias contributed to the error.

Step 8: Backpropagation for hidden layer

The hidden layer error is: $d Z_{1} = (W_{2}^{T} d Z_{2}) ⊙ ReLU^{'} (Z_{1})$

Code:

dZ1 = (self.W2.T @ dZ2) * self.relu_deriv(Z1)
dW1 = (dZ1 @ X.T) / m
db1 = np.sum(dZ1, axis=1, keepdims=True) / m

Equations: $d W_{1} = \frac{1}{m} d Z_{1} X^{T}$ $d b_{1} = \frac{1}{m} \sum d Z_{1}$

ReLU derivative: $ReLU^{'} (z) = {1, 0, z > 0 z \leq 0$

Code:

def relu_deriv(self, Z):
    return (Z > 0).astype(float)

Concept:

output error is sent backward into the hidden layer
only active ReLU neurons pass gradient
this is how the network learns internal representations

Step 9: Update parameters

Gradient descent update rule: $W := W - α d W$ $b := b - α d b$

Code:

self.W1 -= self.lr * dW1
self.b1 -= self.lr * db1
self.W2 -= self.lr * dW2
self.b2 -= self.lr * db2

Concept:

move parameters in the direction that reduces loss
repeat this many times
gradually improve predictions

Here:

$α$ is the learning rate
a higher learning rate updates faster, but may overshoot
a lower learning rate is safer, but slower

Step 10: Make predictions

After training, the network outputs probabilities. Convert them into classes using threshold 0.5: $\overset{y}{^} = {1, 0, A_{2} \geq 0.5 A_{2} < 0.5$

Code:

def predict(self, X):
    A2, _ = self.forward(X)
    return (A2 >= 0.5).astype(int)

Concept -> Equation -> Code Mapping

1. Weighted input

Concept: Each neuron forms a weighted sum of inputs.

Equation: $z = W x + b$

Code:

Z1 = self.W1 @ X + self.b1
Z2 = self.W2 @ A1 + self.b2

2. Non-linearity

Concept: Activation functions allow the network to learn beyond straight-line relationships.

Equations: $A_{1} = ReLU (Z_{1})$ $A_{2} = σ (Z_{2})$

Code:

A1 = self.relu(Z1)
A2 = self.sigmoid(Z2)

3. Forward propagation

Concept: Data flows from input to hidden to output.

Equations: $Z_{1} = W_{1} X + b_{1}$ $A_{1} = ReLU (Z_{1})$ $Z_{2} = W_{2} A_{1} + b_{2}$ $A_{2} = σ (Z_{2})$

Code:

def forward(self, X):
    Z1 = self.W1 @ X + self.b1
    A1 = self.relu(Z1)
    Z2 = self.W2 @ A1 + self.b2
    A2 = self.sigmoid(Z2)

4. Loss measurement

Concept: We need to measure how wrong predictions are.

Equation: $J = - \frac{1}{m} \sum [Y lo g (A_{2}) + (1 - Y) lo g (1 - A_{2})]$

Code:

cost = self.compute_cost(Y, A2)

5. Error propagation backward

Concept: The network computes gradients layer by layer from output back to input.

Equations: $d Z_{2} = A_{2} - Y$ $d W_{2} = \frac{1}{m} d Z_{2} A_{1}^{T}$ $d Z_{1} = (W_{2}^{T} d Z_{2}) ⊙ ReLU^{'} (Z_{1})$ $d W_{1} = \frac{1}{m} d Z_{1} X^{T}$

Code:

dZ2 = A2 - Y
dW2 = (dZ2 @ A1.T) / m
dZ1 = (self.W2.T @ dZ2) * self.relu_deriv(Z1)
dW1 = (dZ1 @ X.T) / m

6. Learning

Concept: Use gradients to improve parameters.

Equation: $θ := θ - α \nabla J$

Code:

self.W1 -= self.lr * dW1
self.b1 -= self.lr * db1
self.W2 -= self.lr * dW2
self.b2 -= self.lr * db2

Solving the AND Gate Example

The AND gate outputs 1 only when both inputs are 1: $(0, 0) \to 0$ $(0, 1) \to 0$ $(1, 0) \to 0$ $(1, 1) \to 1$

During training:

the network starts with random weights
predictions are poor at first
after many epochs, weights and biases adjust
the cost decreases
final outputs approach the correct AND values

Expected final prediction: $[0, 0, 0, 1]$

Code:

pred = model.predict(X)
print("Predictions:", pred)

If training succeeds, output becomes:

Predictions: [[0 0 0 1]]

One Forward Pass Example

Suppose for one sample: $x = [11]$

Assume one hidden neuron has: $w = [0.6 0.4], b = - 0.2$

Then: $z = 0.6 (1) + 0.4 (1) - 0.2 = 0.8$

Apply ReLU: $a = max (0, 0.8) = 0.8$

Then output neuron may combine hidden activations and pass through sigmoid. If final output score is: $z_{2} = 2.1$ then: $A_{2} = σ (2.1) = \frac{1}{1 + e ^{- 2.1}} \approx 0.8909$

Since: $0.8909 > 0.5$ prediction is: $1$

This is how the ANN converts inputs into a class decision.

Why ANN Works

A neural network works because:

weights learn which inputs matter
biases shift decision boundaries
activation functions add non-linearity
backpropagation tells each parameter how it contributed to the error
gradient descent improves the parameters repeatedly

So the network gradually learns a function that maps input to output.

Why Hidden Layers Matter

A single linear model can only learn a linear boundary. A hidden layer with activation allows:

combinations of features
piecewise linear transformations
more expressive decision boundaries

Even though AND is simple, this example demonstrates the full learning pipeline of an ANN.

Cost Curve Meaning

The printed cost every 1000 epochs tells whether learning is working.

If cost decreases:

predictions are improving
gradients are useful
parameter updates are moving in the correct direction

If cost does not decrease:

learning rate may be wrong
architecture may be unsuitable
initialization may be poor

1. Initialization matters

Bad initialization can slow or break learning.

2. Learning rate matters

If learning rate is:

too high -> unstable training
too low -> very slow training

3. Activation choice matters

ReLU is common in hidden layers
Sigmoid is common for binary output

4. More layers increase capacity

Deeper networks can learn more complex patterns, but are also harder to train.

Definition

An ANN is a layered network of neurons that learns by adjusting weights and biases using backpropagation and gradient descent.

Architecture Used

2 input neurons
1 hidden layer with 4 neurons
1 output neuron

Hidden layer: $Z_{1} = W_{1} X + b_{1}, A_{1} = ReLU (Z_{1})$ Output layer: $Z_{2} = W_{2} A_{1} + b_{2}, A_{2} = σ (Z_{2})$ Sigmoid: $σ (z) = \frac{1}{1 + e ^{- z}}$ ReLU: $ReLU (z) = max (0, z)$ Loss: $J = - \frac{1}{m} \sum [Y lo g (A_{2}) + (1 - Y) lo g (1 - A_{2})]$ Gradient descent: $W := W - α d W, b := b - α d b$

Training Steps

initialize parameters
perform forward propagation
compute loss
perform backpropagation
update parameters
repeat for many epochs

Very Short Revision

input passes through weights and bias
ReLU activates hidden layer
sigmoid gives output probability
cross-entropy measures error
backpropagation computes gradients
gradient descent updates weights
repeat until cost decreases and predictions improve

Final Takeaway

This ANN from scratch shows the complete neural-network learning process:

forward propagation computes predictions
loss measures error
backpropagation computes gradients
gradient descent updates parameters

For the AND gate dataset, the network learns the correct truth table: $[0, 0, 0, 1]$ which shows that it has successfully learned the mapping from inputs to outputs.

Keyboard shortcuts

Notes