📖 Review the Lecture notes

Artificial Neural Networks for Multi-Class Classification Problems

This lab accompanies the lecture on Artificial Neural Networks and implements an example to classify images of ‘fashion’ items from the fashion-MNIST dataset. To read the specifics about this dataset, visit https://www.kaggle.com/datasets/zalando-research/fashionmnist

This module demonstrates multiclass classification with an Artificial Neural Network (ANN).

Here, we use the Fashion-MNIST dataset, which contains images of clothing items from ten different categories: T-shirt/top, Trouser, Pullover, Dress, Coat, Sandal, Shirt, Sneaker, Bag, and Ankle boot.

As before, each image is labeled 0 … 9 (representing the 10 fashion classes).
Each picture is 28×28 pixels, and every pixel intensity acts as one feature (independent variable $x_i$), while the label $y$ is the multiclass target to be predicted by the network.

Side note: I would love to have a more science-based dataset for you to experiment on, but it is hard to construct a meaningful dataset that is small enough to train and make predictions in just a 1:20hr lab. A lot of effort has gone into creating these toy datasets for such practice purposes. In the next weeks, we’ll have labs to use actual satellite images after you’ve learned about even more efficient convolutional neural networks (CNN).

Load libraries

Cleaned up a bit to only load necessary libraries.

Code

from __future__ import division
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import time
import pandas as pd
from scipy import optimize
from sklearn.metrics import mean_squared_error
from tabulate import tabulate
from progressbar import ProgressBar
from tensorflow.keras.datasets import fashion_mnist

Parameters to make figures look nicer:

Code

# Change some default figure properties to make them prettier:
plt.rcParams['figure.dpi'] = 100
plt.rcParams.update({'font.size': 18, 'font.family': 'STIXGeneral', 'mathtext.fontset': 'stix'})
plt.rcParams['figure.figsize'] = 7,5
plt.rcParams['axes.spines.top'] = False
plt.rcParams['axes.spines.right'] = False
plt.rcParams['axes.linewidth'] = 2
plt.rcParams['lines.linewidth'] = 3

Load labeled data

Load data for the images and construct feature array $x$ and targets $y$.

I’ll provide you with an easy code to download this “fashion MNIST” practice dataset.

Code

num_labels = 10  # Fashion-MNIST has 10 classes (0..9)

# The dataset actually already has separate training and test data
(X_tr, y_tr), (X_te, y_te) = fashion_mnist.load_data()

# For now, we just combine all the data together, and let you split as you choose below
X = np.concatenate([X_tr, X_te], axis=0).reshape(-1, 28*28).astype(np.float32) / 255.0
y = np.concatenate([y_tr, y_te], axis=0).astype(np.int64)

How may samples does this dataset contain?

The dataset is too large to consider all at once within the time-frame of a short lab.

Task: figure out how to only consider the first 5000 samples and adjust X and y accordingly to only contain those samples. Check the shapes of both after you do this.

Code

m =  # choose amount of samples you want to consider

# Write your own code to only consider the first m samples out of the 70000 in total.

In interpreting and visualizing results below it will be useful to have a variable that contains the 10 different classes in this dataset:

Code

# Fashion-MNIST class names
class_names = [
    'T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
    'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot'
]

What are the labels of the first 10 pictures?

Hint

You can use clever and concise python/numpy indexing: np.array(class_names)[y]

Code

# Extract the first 10 indices of target vector y and map them to the corresponding class names

Split into training and validation data

Split your feature array and associated labels into 80% training samples and 20% validation samples. You can try to use scikit-learn routines, but it may be easier to just use index notation for the first 80% and last 20% fo the samples, because they are already in random order.

Code

m_train =   # 4000
X_train =
X_val   =
y_train =
y_val   =

print(f"Shapes:")
print(f"  Training: X={X_train.shape}, y={y_train.shape}")
print(f"  Validation: X={X_val.shape}, y={y_val.shape}")

Visualize data

As a quick reminder, the same visualization of some of the data as last week. The code below will visualize the first 100 samples in our training data. You can play with this to show the second 100 samples, or perhaps 100 of the validation samples.

Code

# Show the first 100 training samples (10×10 grid)
x_sq = np.reshape(X_train[:100], [100, 28, 28])
y_sq = y_train[:100]

nrcols = 10
nrrows = 10
fig, axs = plt.subplots(nrrows, nrcols, figsize=(7, 7))

for i in range(nrrows):
    for j in range(nrcols):
        idx = i * nrcols + j
        axs[i, j].imshow(x_sq[idx], cmap='viridis', origin='upper')
        axs[i, j].axis('off')

fig.subplots_adjust(wspace=0, hspace=0)
plt.show()

# Optionally, reshape labels into a 10×10 grid for inspection
np.reshape(y_sq, [10, 10], order='C')

For this case, the mapping between the labels and the images is a bit less clear. For the handwritten samples in the NIST dataset, label 1 corresponded to an image of a 1 etc, so it was straightforward.

Below is another visualization that adds a title to each panel based on the value of class_names matched to the integer labels.

Code

# Reshape flattened images (selection_x) into 28×28 arrays
x_sq = np.reshape(X_train[:100], [100, 28, 28])

nrcols = 10
nrrows = 10
fig, axs = plt.subplots(nrrows, nrcols, figsize=(10, 10))

for i in range(nrrows):
    for j in range(nrcols):
        idx = i * nrcols + j
        axs[i, j].imshow(x_sq[idx], cmap='viridis', origin='upper')
        axs[i, j].set_title(class_names[y_train[idx]], fontsize=7)
        axs[i, j].axis('off')

fig.subplots_adjust(wspace=0.4, hspace=0.6)
plt.show()

Sigmoid and other Activation Functions - Information only

The sigmoid function ($\frac{1}{1+\mathrm{e}^{-z}}$) was already defined. We will also need its derivative, which can be written in terms of the sigmoid itself.

There are a number of other functions, such as the hyperbolic tangent, that basically turn a continuous variable into binary categories (sometimes labeled as 0 and 1, sometimes as -1 and 1, and sometimes as 0 or -1 for an argument $x<0$ and increasingly large for $x>0$) with some transition in between. In ML lingo, these are called activation functions. In neural networks, in particular, (presented below) sometimes activation functions other than the sigmoid have better performance.

Below we define functions sigmoid and their derivatives d_sigmoid that are already generalized to allow for any activation function that I could find in the literature on ML and neural networks in particular, for future use.

These different activation functions are provided to give you a throrough overview of different ANN approaches, but the familiar sigmoid function is a perfectly fine choice for anything discussed here.

Code

def sigmoid(z):
  global activation_function_choice
  if activation_function_choice==1:
    return 1/(1 + np.exp(-z))                     # sigmoid
  elif activation_function_choice==2:
    #return 2/(1 + np.exp(-2*z)) - 1              # tanh
    return np.tanh(z)
  elif activation_function_choice==3:
    return np.where(z<0, 0, z)                    # RELU
  elif activation_function_choice==4:
    return np.where(z<0, 0.1*z, z)                # Leaky RELU
  elif activation_function_choice==5:
    return np.where(z<0, np.exp(z) - 1, z)    # Exponential ELU
  elif activation_function_choice==6:
    return z/(1 + np.exp(-z))                     # Google's Swish

def d_sigmoid(z):
  global activation_function_choice
  if activation_function_choice==1:
    sigmoid = 1/(1 + np.exp(-z))
    dg = sigmoid*(1-sigmoid)                        # sigmoid'
  elif activation_function_choice==2:
    dg = 1 - np.tanh(z)**2                          # tanh'
  elif activation_function_choice==3:
    dg = np.where(z<0, 0, 1)                        # RELU
  elif activation_function_choice==4:
    dg = np.where(z<0, 0.1, 1)                      # Leaky RELU
  elif activation_function_choice==5:
    dg = np.where(z<0, (np.exp(z) - 1), 1)      # Exponential ELU
  elif activation_function_choice==6:
      sigmoid = 1/(1 + np.exp(-z))                  # Google's Swish
      swish = z*sigmoid
      dg = swish + sigmoid*(1-swish)
  return dg

Here’s a simple visualization of what all these activation functions look like, such as the various ‘rectified linear units (RELU link text)’ functions.

Code

x = np.linspace(-5,5, 100)
fig  = plt.figure(figsize=(10,5))
plt.subplot(2,3,1)
plt.title('sigmoid')
activation_function_choice = 1
plt.plot(x, sigmoid(x),label='sigmoid')
plt.subplot(2,3,2)
activation_function_choice = 2
plt.plot(x, sigmoid(x),label='tanh')
plt.title('tanh')

plt.subplot(2,3,3)
activation_function_choice = 3
plt.plot(x, sigmoid(x),label='RELU')
plt.title('RELU')

plt.subplot(2,3,4)
activation_function_choice = 4
plt.plot(x, sigmoid(x),label='leaky RELU')
plt.title('leaky RELU')

plt.subplot(2,3,5)
activation_function_choice = 5
plt.plot(x, sigmoid(x),label='exponential RELU')
plt.title('exponential RELU')

plt.subplot(2,3,6)
activation_function_choice = 6
plt.plot(x, sigmoid(x),label='google swish')
plt.title('google swish')

fig.tight_layout()

Artificial Neural Network

I will not discuss the philosophy behind articifial neural networks, inspired by the functioning of the human brain, or some of the technical aspects in full detail here. A number of videos are provided to help with that.

In practical terms, a (artificial) neural network is similar to the logistic regression discussed before. Basically, a ANN just does the same thing twice, or more. As an example, we will consider a shallow artifical neural network with one ‘hidden layer’, which in a sense just means we do logistic regression twice. A deep neural network has multiple hidden layers, which effectively means it does logistic regression multiple times.

Let me explain:

In the logistic regression you have seen before, we have, say, 401 features, we take a linear combination of those with fitting parameters $\theta$ ($z = X \cdot \theta$), and then apply a sigmoid function to the outcome ($y = g(z)$, the activation function) to determine the probability of a sample/picture/measurement belonging to category 0 or 1 (or any number of categories when doing one-versus-all).

In a shallow neural network (with 1 hidden layer), we try multiple linear combinations of the features with multiple sets of $\theta$ (this is the first layer); then we apply the sigmoid function to each of those linear combinations, and construct a second linear combination of those sigmoids with an independent set of fitting parameters $\theta$ to come up with a single final prediction for $y$.

The basic strength of this approach is that instead of a linear combination of the feature, we have a linear combination of sigmoids of linear combinations of features (… are you still with me?). By introducing those sigmoid in the intermediate step (the hidden layer), the final prediction is actually a highly non-linear model. As you have seen before, non-linear models like high-order polynomials can fit much more complex data, and the same is true for these neural networks (which are not polynomials, but linear combinations of sigmoids).

A different way of thinking about this is that the first layer, which takes many different combinations of your input features, can learn/engineer new features, and then the second layer just performs a regular logistic regression on those new features.

To be a more specific: * In the example below we use an ‘input layer’, which are simply our features, which in an ANN are also called (input) nodes. * The new thing the ANN adds is another layer, the hidden layer, which can be any size we choose; in the example below we choose hidden_layer_size = 25 nodes. But after trying that, come back here and maybe try a different number of nodes in the hidden layer! * The output from the hidden layer is mapped (by a linear combination) to the ‘output layer’, which is either a single node for a binary classification scheme, or 10 nodes for our multiclass problem here (num_labels = 10).

Task: how many features (without bias) do we have in this case? Set input_layer_size to that value.

hidden_layer_size is basically the number of new features that we will engineer from the raw pixel inputs. This is analogous to polynomial terms in non-linear numerical or logistic regression. I have tested this notebook for 25 such hidden notes = new features, but you can experiment with this.

activation_function_choice sets the activation function in the hidden layer. I’ve tested this with just a sigmoid (activation_function_choice=1) but you can try other options to see if it improves performance.

Code

input_layer_size  =
hidden_layer_size =
num_labels = 10

activation_function_choice = 1

Neural Networks — Conceptual Overview

In a neural network, there’s a bit more bookkeeping, but conceptually it’s not as complicated as it may first appear.

Logistic Regression Recap

Consider a single image (or, more generally, one data sample) with 784 features.
In a standard logistic regression, we would:

Add a bias feature to form a feature vector $X$.
For a single image, $X$ is just a row vector of length 785 (784 pixels + 1 bias).
Choose initial parameters $\theta$ (a vector of 785 values) and compute
\[ z = X \cdot \theta \] which is a single number — the dot product of $X$ and $\theta$.
Apply the sigmoid function to obtain the prediction
\[ h = g(z) \] For binary classification, $h\in[0,1]$ and we predict $y=1$ if $h\ge0.5$, otherwise $y=0$.
(You’ve already seen how this generalizes to multiclass “one-vs-all” classification.)
Define and minimize a cost function comparing predicted $h$ and actual $y$.
The optimized parameters $\theta$ that minimize this cost effectively define our logistic-regression model.

Extending to a Neural Network

Now we build an artificial neural network (ANN) — still starting from one input image:

Input layer: Begin with the same feature vector $X^{[1]}$ (of 785 values including bias).
Hidden layer: Instead of multiplying by a single parameter vector $\theta$,
we use a matrix of parameters $\Theta^{[1]}$ of size $25\times785$.
This means we perform 25 dot products — one per hidden-layer neuron — yielding
\[ \mathbf{z}^{[2]} = X^{[1]} \cdot \Theta^{[1]} \] a vector of 25 values.
Apply activations: Each hidden node applies the sigmoid (or another nonlinear function)
\[ \mathbf{h}^{[2]} = g(\mathbf{z}^{[2]}) \] producing 25 activations — the outputs of the hidden layer.
Output layer: Treat those activations $\mathbf{h}^{[2]}$ (plus a bias) as new features
$X^{[2]}$, then compute
\[ \mathbf{z}^{[3]} = X^{[2]} \cdot \Theta^{[2]}, \qquad \mathbf{h}^{[3]} = g(\mathbf{z}^{[3]}) \] which gives the final hypotheses for the output layer.
Binary vs. multiclass:
- For binary classification, $\Theta^{[2]}$ is a vector of length 26 (25 hidden + 1 bias).
- For multiclass tasks like Fashion-MNIST (10 classes), we use an array $\Theta^{[2]}$ of size $10\times26$.
  Modern implementations often replace the sigmoid with a softmax activation in the final layer to yield class probabilities that sum to 1.

Cost function

The cost/error of a binary prediction from our neural network is defined exactly the same as for logistical regression. For a single measurement:

$J(\theta) = - y \log(h(x,\theta)) -(1-y) \log(1-h(x,\theta)),$

except that it needs to be evaluated for the hypotheses $h^{[3]}$ from the output layer. For a multiclass problem, we don’t do one-versus-all, but automatically consider all classes at once. Specifically, the labels for each image –in our image classification problem– are defined as vectors $y = [0,0,1,0,0,0,0,0,0,0]$ when the image is of a ‘2’, etc. The hypothesis for each node in the output layer is a (pseudo-) probability, so the output would look something like $h(x,\theta) = [0.01, -0.02, 0.90, 0.4, ..., -0.01]$, and the errors/costs for the 10 categories would be summed to obtain the total error (and also summed over all images/‘measurements’).

As you should know by now, defining the cost function is the key to any machine learning algorithm.

Below we define it for any shallow network (i.e., you can also run it for more or fewer hidden nodes). The inputs are the size of the input layer (nr of features), hidden layer, and output layer (which is the number of labels), the actual features (pixels) $X$, labels $y$, and one giant vector of fitting parameters theta. The latter contains both $\Theta^{[1]}$ and $\Theta^{[2]}$ sequentially because a general-purpose minimization scheme, like gradient descent, conjugate gradient, Newton etc, don’t care that these are parameters for different layers in an ANN; for a minimization scheme, these are just a long list of numbers to minimize. Turning rank-2 (two-dimensional) arrays $\Theta^{[1]}$ and $\Theta^{[2]}$ into one long vector is called unrolling.

Finally, we also Ridge regularize the scheme by adding the summed squares of all fitting parameters $\Theta^{[1]}$ and $\Theta^{[2]}$.

First, we define a function that does the forward propagation, which means computing our final hypothesis $y$, given inputs $X$, and two arrays of fitting parameters / weights $\Theta^{[1]}$ and $\Theta^{[2]}$ for the first and second layer of our neural network. Before we even do that, codes will look more clean when we define a little helper function that pulls $\Theta^{[1]}$ and $\Theta^{[2]}$ out of a single long vector of $$ values and reshapes them into the appropriate array dimensions.

Code

def unwrap_theta(theta, input_layer_size, hidden_layer_size, num_labels):
    """
    Unroll flattened theta vector back into Theta1 and Theta2 using
    the new shape convention:
      Theta1: (input_layer_size + 1, hidden_layer_size)
      Theta2: (hidden_layer_size + 1, num_labels)
    """
    Theta1_dims = (input_layer_size + 1, hidden_layer_size)
    Theta1_len  = np.prod(Theta1_dims)
    Theta2_dims = (hidden_layer_size + 1, num_labels)

    Theta1 = theta[:Theta1_len].reshape(Theta1_dims)
    Theta2 = theta[Theta1_len:].reshape(Theta2_dims)

    return Theta1, Theta2

Now we can write the forward pass as a concise function. Lines 13-15 are for the first layer and lines 16-18 for the second layer of our ANN. The function returns our hypothesis, which is the final predictions as a probability for each class. For convenience, we also return the reshaped weights so they can be used more easily in the cost function, which is next.

Code

# --------------------------------------------------------
# Forward propagation
# --------------------------------------------------------
def nn_forward(theta, input_layer_size, hidden_layer_size, num_labels, X):
    """Forward pass through a 2-layer neural network.
       Shapes chosen so that no .T transposes are needed."""
    m, n = X.shape

    # Use the helper function to unpack theta
    Theta1, Theta2 = unwrap_theta(theta, input_layer_size, hidden_layer_size, num_labels)

    # Forward pass
    X1 = np.c_[np.ones((m, 1)), X]     # Add bias column to input layer
    Z2 = X1 @ Theta1                   # (m, hidden_layer_size)
    A2 = sigmoid(Z2)
    A2 = np.c_[np.ones((m, 1)), A2]    # Add bias column to hidden layer
    Z3 = A2 @ Theta2                   # (m, num_labels)
    H3 = sigmoid(Z3)                   # Output layer activations (hypotheses)

    return H3, Theta1, Theta2

Before we define our cost function, let me explain an elegant python trick to do the one-hot-encoding when we want to predict multiple classes.

In multiclass classification, we often need to represent each class label as a one-hot vector —
a row of zeros with a single 1 marking the true class.

For example, suppose our dataset has 4 samples and 3 possible classes (num_labels = 3):
if the labels are y = [2, 0, 1, 2], the one-hot encoding should be:

Sample	Label	One-Hot Vector
1	2	[0, 0, 1]
2	0	[1, 0, 0]
3	1	[0, 1, 0]
4	2	[0, 0, 1]

We can create this efficiently in NumPy without loops:

np.zeros((m, num_labels)) → make an all-zero matrix
np.arange(m) → make a list of row indices [0, 1, 2, ..., m-1]
Y[np.arange(m), y] = 1 → set the position (row i, column y[i]) to 1 for each sample

This results in a fully vectorized, fast one-hot encoding operation.

Code

# Example data
ys = np.array([2, 0, 1, 2])
m = len(ys)
numlabels = 3

# One-hot encoding
Y = np.zeros((m, numlabels))
Y[np.arange(m), ys] = 1

print("Labels (y):", ys)
print("One-hot encoded (Y):\n", Y)

Inspect the above code and its result and see if you understand how it works.

With our definition of our new ANN hypothesis, and the one-hot-encoding of our target labels, we can use the same logistic cost function + Ridge regularization as before:

Code

# --------------------------------------------------------
# Cost function with regularization
# --------------------------------------------------------
def nn_cost(theta, input_layer_size, hidden_layer_size, num_labels, X, y, lambda_r):
    """Compute the cost J(Θ) for a 2-layer neural network."""
    m = X.shape[0]
    # Do the forward pass to get a model prediction, and pull out the weights theta for the Ridge term
    H3, Theta1, Theta2 = nn_forward(theta, input_layer_size, hidden_layer_size, num_labels, X)

    # One-hot encode labels
    Y = np.zeros((m, num_labels))
    Y[np.arange(m), y] = 1

    # --- Core cost (cross-entropy) ---
    eps = 1e-12
    J = -np.sum(Y * np.log(H3 + eps) + (1 - Y) * np.log(1 - H3 + eps)) / m

    # --- Regularization (ignore bias columns) ---
    # Strip the first column (bias weights)
    reg = (np.sum(Theta1[1:, :]**2) + np.sum(Theta2[1:, :]**2)) * lambda_r / (2*m)
    J += reg

    return J

Cost gradient

Of course the other ingredient that we need for most solution schemes is the gradient of our cost function, which are the partial derivatives of the cost with respect to each fitting parameter. This may seem daunting when considering the beast of a cost function above, and many think it is, but -certainly in terms of implementation- it is actually not that hard. Without going into too much detail, it essentially builds up the derivatives by the familiar chain rule. This effectively means starting from the right (the output layer) and working back to the left, which is why this step is called the back propagation, or even more slangly, back prop.

In the long-looking code below, the first 26 lines are actually the same as in the cost function, while the back prop itself is only a few short lines between lines 29 and 36. In line 31, d3ij[:,k] simply takes the difference between our final/output prediction/hypothesis for each label $k$ versus the true label $y$. Line 33 shows the whole chain rule. Lines 42-45 just stick all the fitting parameters back into a single long vector/list in the right order.

Code

def nn_gradient(theta, input_layer_size, hidden_layer_size, num_labels, X, y, lambda_r):
    """
    Compute backpropagation gradients for a 2-layer neural network using
    the same conventions as nn_forward (no transposes in the forward pass).

    Shapes:
      Theta1: (input_layer_size + 1, hidden_layer_size)
      Theta2: (hidden_layer_size + 1, num_labels)
      X: (m, input_layer_size)
    """

    m = X.shape[0]

    # Forward pass
    H3, Theta1, Theta2 = nn_forward(theta, input_layer_size, hidden_layer_size, num_labels, X)

    # Recompute intermediates needed for backprop
    X1 = np.c_[np.ones((m, 1)), X]                # (m, input+1)
    Z2 = X1 @ Theta1                              # (m, hidden)
    A2_no_bias = sigmoid(Z2)                      # (m, hidden)
    A2 = np.c_[np.ones((m, 1)), A2_no_bias]       # (m, hidden+1)

    # One-hot encode labels
    Y = np.zeros((m, num_labels))
    Y[np.arange(m), y] = 1

    # Error in output layer
    d3 = H3 - Y                                   # (m, num_labels)

    # Error in hidden layer (excluding bias)
    d2 = (d3 @ Theta2[1:, :].T) * (A2_no_bias * (1 - A2_no_bias))  # (m, hidden)

    # Unregularized gradients
    Theta2_grad = (A2.T @ d3) / m                 # (hidden+1, num_labels)
    Theta1_grad = (X1.T @ d2) / m                 # (input+1, hidden)

    # Regularization (ignore bias row 0)
    Theta1_grad[1:, :] += (lambda_r / m) * Theta1[1:, :]
    Theta2_grad[1:, :] += (lambda_r / m) * Theta2[1:, :]

    # Unroll gradients
    grad = np.concatenate([Theta1_grad.ravel(), Theta2_grad.ravel()])
    return grad

Prediction

Once we have optimized our ANN by finding the best-fit $\Theta^{[1]}$ and $\Theta^{[2]}$ from some big set of training data, we want to make predictions for any new data (e.g. new pictures).

The little function below takes new features $X$, does the forward propagation through the ANN with the known fitting parameters, and spits out the classification of the input picture(s) (note that as before the last line pred = np.argmax(h3) picks the classification with the highest probability).

Code

def nn_predict(theta, input_layer_size, hidden_layer_size, num_labels, X):
    """
    Predict labels for input data X using trained network parameters.
    Uses nn_forward() for the forward pass.
    """
    # Forward pass (reuse your function)
    H3, _, _ = nn_forward(theta, input_layer_size, hidden_layer_size, num_labels, X)

    # Pick the class with highest activation for each example
    pred = np.argmax(H3, axis=1)
    return pred

Initialization

One subtlety with neural networks is how to initialize the fitting parameters. For the logistic regressions before, we could just set all initial $\theta = 0$ and that worked fine. However, for an ANN, if you set all the $\Theta^{[1]}$ the same, then all the hidden nodes will be the same, and there is no benefit to the approach (try it out!). Below, we simply assign random numbers between $-0.12< \Theta^{[1]}<0.12$ and similarly for $\Theta^{[2]}$. We can also try to solve the problem without regularization (lambda_r = 0).

Code

epsilon_init = 0.12

# Shape convention: (input+1, hidden)
Theta1 = np.random.rand(input_layer_size + 1, hidden_layer_size) * 2 * epsilon_init - epsilon_init

# (hidden+1, output)
Theta2 = np.random.rand(hidden_layer_size + 1, num_labels) * 2 * epsilon_init - epsilon_init

# Flatten and concatenate
initial_theta = np.concatenate([Theta1.ravel(), Theta2.ravel()])

# Regularization
lambda_r = 0

To define the initial values of all weights $\theta$, of course we don’t really need to first define those as matrices/arrays $\Theta^{[1]}$ and $\Theta^{[2]}$ and then roll them out into only long vector of all those values. Instead, we can also just compute the total number of elements in $\Theta^{[1]}$ and $\Theta^{[2]}$ combined and directly create a vector of that length, populated with random initial small values:

Code

# Total number of parameters (weights + biases)
n_params = ((input_layer_size + 1) * hidden_layer_size) + ((hidden_layer_size + 1) * num_labels)

# Initialize directly as a 1D flattened vector
initial_theta = np.random.rand(n_params) * 2 * epsilon_init - epsilon_init

Solve the ANN problem

Now that we have defined a cost function and its gradient and some suitable random initial guesses, solving the problem is exactly the same as every one before. We simply choose any minimization routine and find the fitting parameters that minimize the cost function.

Here is bascially exactly the same cell as for logistic regression, except that it calls nn_cost and nn_gradient.

A few notes, though. This problem is about twice as large as the handwriting classification and with default parameters, it takes about 10 mins to run. For reference, fmin_cg runs over all samples at once, which takes a lot of memory and computations. For the purpose of this lab, lets relax the tolerance and impose a maximum number of iterations. You can always come back later and decrease the tolerance or increase the number of iterations, which will actually achieve > 99% accuracy.

Code

%%time
all_theta_CG = optimize.fmin_cg(
    nn_cost,
    initial_theta,
    fprime=nn_gradient,
    args=(input_layer_size, hidden_layer_size, num_labels, X_train, y_train, lambda_r),
    gtol=1e-3,         # relaxed tolerance
    maxiter=100,       # optional safety cap
    disp=True          # show summary only
)

Gradient descent

Even here, we can use the gradient descent method from the very first lecture on univariate linear regression to mimimize the ANN cost function. However, similar to what I explained above, full gradient descent, which uses all samples at once, is likely a bit too slow for a short lab. So instead, this is a good opportunity to try out our hard-coded mini-batch gradient descent for better computational efficienty for large data. Feel free to experiment with the hyper-parameters, like adjusting the batch_size to see how it impacts results.

Code

%%time
# --- Hyperparameters ---
alpha      = 0.1         # learning rate
lambda_r   = 1         # regularization parameter
tolerance  = 1e-5        # convergence threshold (based on gradient norm)
batch_size = 200         # mini-batch size
max_epochs = 500         # safeguard upper limit

# --- Initialization ---
theta = initial_theta.copy()
converged = False
diverged  = False

m = X.shape[0]
epoch = 0
grad_norm = np.inf       # initialize to large number

# --- Main training loop ---
while not (converged or diverged):
    epoch += 1

    # Shuffle data at start of each epoch
    idx = np.random.permutation(m)
    X = X[idx]
    y = y[idx]

    # Mini-batch updates
    for start in range(0, m, batch_size):
        end = start + batch_size
        Xb = X[start:end]
        yb = y[start:end]

        grad = nn_gradient(theta, input_layer_size, hidden_layer_size, num_labels, Xb, yb, lambda_r)
        theta -= alpha * grad

    # --- Convergence check based on gradient norm ---
    grad_norm = np.linalg.norm(grad) / len(grad)
    if grad_norm < tolerance:
        converged = True
        print(f"Epoch {epoch}: converged (‖grad‖={grad_norm:.2e})")
        break

    if np.isnan(grad_norm) or grad_norm > 1e3:
        diverged = True
        print(f"Epoch {epoch}: diverging (‖grad‖={grad_norm:.2e})")
        break

    if epoch >= max_epochs:
        print(f"Reached {max_epochs} epochs without convergence.")
        break

    theta_GD = theta
    print(f"Stopped after {epoch} epochs, final ‖grad‖={grad_norm:.2e}")

Accuracy on training data

Let’s look at the predictions for the conjugate gradient method:

Code

# Evaluate trained model from conjugate gradient optimization (CG)
pred_GC = nn_predict(all_theta_CG, input_layer_size, hidden_layer_size, num_labels, X_train)
success_GC = np.mean(pred_GC == y_train) * 100

# Compute final cost
total_error = nn_cost(all_theta_CG, input_layer_size, hidden_layer_size, num_labels, X_train, y_train, lambda_r)

print(f"Success rate of neural network with CG acceleration: {success_GC:.2f}%")
print(f"Final cost: {total_error:.4f}")

Do the same for our mini-batch gradient descent (i.e. using those optimal fitting parameters):

Code

As you should see, the ANN is quite accurate on the training data. The reason that gradient descent may be slightly less accurate is because of the chosen tolerance. At the current tolerance, it will run the model pretty fast but is a bit less accurate. You can play with the parameters to see how much you can improve the accuracy, time permitting.

Validation

Now test the neural network on the validation data set that was not yet seen by the algorithm:

Code

# --- Validation Evaluation ---
# add code here
print(f"Success rate of neural network on validation set: {success_val:.2f}%")

Here’s some code to visualize the performance on validation data. Note that it assumes that you have defined pred_val as your predictions on validation data. Edit if you chose a different variable name.

Code

# --- Compare true vs predicted labels (first 100) ---
print("\nValidation (true) labels:\n", np.reshape(y_val[:100], [10, 10]))
print("\nPredicted labels:\n", np.reshape(pred_val[:100], [10, 10]))
print("\nComparison (1 = misclassified):\n", np.reshape(1 - 1*(pred_val[:100] == y_val[:100]), [10, 10]))

# --- Visualization of first 100 validation images ---
x_sq = np.reshape(X_val[:100], [100, 28, 28])
nrcols, nrrows = 10, 10
fig, axs = plt.subplots(nrrows, nrcols, figsize=(7, 7))

for i in range(nrrows):
    for j in range(nrcols):
        idx = i * nrcols + j
        axs[i, j].imshow(x_sq[idx], cmap='viridis', origin='upper')
        axs[i, j].axis('off')

fig.subplots_adjust(wspace=0, hspace=0)
plt.show()

Here’s a nicer visualization of misclassified samples. We run inference over all our validation data, see how many and what fraction of those samples were misclassified, and then show the first 25 of such examples. The titles give the true label and what your model thought they were. You’ll likely see that its not that surprising that those samples were misclassified. But tweaking the training parameters can give you better results.

Code

# --- Identify misclassified validation examples ---
pred_val = nn_predict(all_theta_CG, input_layer_size, hidden_layer_size, num_labels, X_val)
mis_idx = np.where(pred_val != y_val)[0]

print(f"{len(mis_idx)} misclassified out of {len(y_val)} ({len(mis_idx)/len(y_val)*100:.1f}%)")

# Select up to 25 misclassified samples to display
n_show = min(25, len(mis_idx))
mis_idx = mis_idx[:n_show]

# Reshape those samples into 28×28 images
x_mis = np.reshape(X_val[mis_idx], [n_show, 28, 28])
y_true = y_val[mis_idx]
y_pred = pred_val[mis_idx]

# --- Plot ---
nrcols = 5
nrrows = int(np.ceil(n_show / nrcols))
fig, axs = plt.subplots(nrrows, nrcols, figsize=(10, 10))

for k, ax in enumerate(axs.flat):
    if k < n_show:
        ax.imshow(x_mis[k], cmap='gray', origin='upper')
        true_lbl = class_names[y_true[k]]
        pred_lbl = class_names[y_pred[k]]
        ax.set_title(f"T: {true_lbl}\nP: {pred_lbl}", fontsize=8, color='crimson')
        ax.axis('off')
    else:
        ax.axis('off')

fig.suptitle("Misclassified Validation Samples", fontsize=14, y=0.93)
fig.subplots_adjust(wspace=0.5, hspace=0.7)
plt.show()

Task: Interpret your findings! How do you feel about the accuracy of your model? What hyperparamerters can you tune to try and get better results? Is your model underfitting or overfitting? If so, what can you do next to resolve any such issues?

You can spend most of the rest of your time to experiment and try and get the best results you can. Of course the goal should always be to get the best results on validation data, which represent your predictive power on future measurements, rather than just getting the highest accuracy on training data (where a model can just learn the training data but not necessarily generalize well to new data, i.e. overfitting).

Optional you could also try to copy over the cost function and its gradient for regular logistic regression and see how well that model works for this dataset to compare to the performance of the ANN. Given the size of the dataset, I’d suggest using the same fmin_cg and/or mini-batch gradient descent to minimize the logistic cost function.

The rest of this notebook, similar to the lecture notes jupyter notebook, tries to provide you with some illustrations of how the ANN learns new features in a way that is different from logistic regression.

How to interpret an ANN?

While ANN are very good at modeling/predicting complex problems, it is notoriously hard to interpret how they actually do it when all these different layers compound and convolve ever more complexity.

If you try to google how to interpret these types of ANN for image recognition you will not find much. Below, I attempt to give you at least some more intuition of how the final tuned ANN makes predictions.

Let’s say we start with a single picture and we want our ANN to predict what it is. In the first step, each hidden node will get a different linear combination of all the pixes in the picture, in other words a different weighting of the importance of each pixel. We can visualize these weights $\Theta^{[1]}$ directly as I do below. Each frame in this figure corresponds to one of the 25 nodes in the hidden layer, and the 28x28 pixels in each panel below correspond to the 28x28 weights that $\Theta^{[1]}$ will give to the corresponding original 28x28 pixels in the picture.

In some sense, this step acts as a filter that will enhance some pixels and suppress others. Actually, it is applying 25 different filters; one will ‘brighten’ the pixels in the center-ish, another filter will enhance diagonal pixels etc.

These are the new features that the hidden layers constructs. Again, this is analogous to adding, e.g., polynomial features.

In the end, 10 different weighted sums of these new features, or filters, will give 10 probabilities that the picture corresponds to each of the possible classes.

Code

Theta1_GC, Theta2_GC = unwrap_theta(all_theta_CG, input_layer_size, hidden_layer_size, num_labels)
print("Theta1:", Theta1_GC.shape)
print("Theta2:", Theta2_GC.shape)

# Remove bias weights
Theta1_no_bias = Theta1_GC[1:, :]     # shape (input_layer_size, hidden_layer_size)
n_input = Theta1_no_bias.shape[0]
n_hidden = Theta1_no_bias.shape[1]

# Infer square side length if possible
side = int(np.sqrt(n_input))
print(f"Detected input side length = {side}")

x_sq = Theta1_no_bias.reshape(side, side, n_hidden)

nrcols, nrrows = int(np.ceil(np.sqrt(n_hidden))), int(np.ceil(np.sqrt(n_hidden)))
fig, axs = plt.subplots(nrrows, nrcols, figsize=(7, 7))

for i in range(nrrows):
    for j in range(nrcols):
        idx = i * nrcols + j
        if idx >= n_hidden:
            axs[i, j].axis('off')
            continue
        axs[i, j].pcolor(x_sq[:, :, idx][::-1, :])
        axs[i, j].set_aspect('equal', 'box')
        axs[i, j].axis('off')

fig.subplots_adjust(wspace=0, hspace=0)
plt.show()

You might see that some of these filters indeed already look like the outlines of, e.g., a shirt.

To see what happens next, we select some arbitrary picture (you can change the selection of picknr to try another one):

Code

picknr = 9

xtry = X_train[picknr,:]
x_sq1 = np.reshape(xtry,[28,28])
fig, axs = plt.subplots(1,1,figsize=(7,7))
axs.pcolor(x_sq1[-1::-1,:])
axs.set_aspect('equal', 'box')
axs.axis('off');

In the hidden layer, each node ‘applies’ one of the 25 filters to this picture, if you will, and gives a ‘score’ ($ z = X $):

Code

X1 = np.append(1, xtry)
z1 = X1.dot(Theta1_GC)

print("The linear fit gives these scores (linear model, or logit) to the 25 hidden nodes:\n", np.reshape(np.round(z1,1),[5,5]))

Next, the sigmoid function is applied to these scores. So what does that mean? The nodes that got a (large) negative score will get a weight of (near) zero and the nodes with a (large) positive score will get a weight of (near) 1.

Below, in the left panel, you can see these exact probabilities of each node (which, again, is like weighting different filters applied to this particular picture). The right panel shows what the corresponding true binary sigmoid predictions would be like, i.e. 1 for $g(z)\ge 0.5$ and $0$ for $g(z)<0.5$.

Finally, the last plot visualizes the $\Theta^{[1]}$ weights/filters for each of the 25 nodes with $g(z)\ge 0.5$.

Code

X2 = sigmoid(z1)
print("The corresponding sigmoid gives probabilities to each of those combinations\n", np.reshape(np.round(X2,2),[5,5]))
plota2 = np.reshape(X2,[5,5])

fig, ax = plt.subplots(1,2, figsize=(14,7))

# Probabilities:
ax[0].pcolor(plota2)
ax[0].invert_yaxis()
ax[0].set_aspect('equal', 'box')
ax[0].axis('off');
ax[0].set_title('Probabilities');

# Binary 1/0 classification:
plota2 = np.where(plota2>=0.5, 1, plota2)
plota2 = np.where(plota2<0.5, 0, plota2)

ax[1].pcolor(plota2)
ax[1].set_aspect('equal', 'box')
ax[1].invert_yaxis()
ax[1].set_title('Binary classification (0/1)');
ax[1].axis('off');
print("Binary classification to each of those combinations\n", np.reshape(np.round(plota2,2),[5,5]))


nrrows = 5
nrcols = 5
fig, axs = plt.subplots(nrrows, nrcols, figsize=(7, 7))

for i in range(nrrows):
    for j in range(nrcols):
        idx = i * nrcols + j
        if idx >= hidden_layer_size:
            axs[i, j].axis('off')
            continue
        axs[i, j].pcolor(x_sq[:, :, idx][::-1, :] * plota2[i, j])
        axs[i, j].set_aspect('equal', 'box')
        axs[i, j].axis('off')

fig.subplots_adjust(wspace=0, hspace=0)
plt.show()

Note: a hidden node with output $g(z)=0$ for the sigmoid function does not contribute any more at all to anything behind that hidden layer (if $X^{[2]}=g(z)=0$ then $X^{[2]}\cdot \Theta^{[2]}=0$ for any fitting parameter $\Theta^{[2]}$ for that node).

For deep neural networks, in particular, where we can have multiple hidden layers, you may not want to completely kill off one of the neurons, i.e. the ones associated with nodes that have sigmoids of 0. This is where the alternative activation functions come in, such as the ‘leaky RELU’, which gives a probability close-to-but-not-exactly-zero to $z<0$ and an ever larger probability $g(z)$ for $z>0$. That way, certain nodes/neurons can be given a small weight without completely eliminating them (their overal importance to the full network can still be determined to be relevant by the fitting parameters of subsequent layers). You still want to use a sigmoid function for the output layer to only have probabilities between 0 and 1 for each classification.

When moving from the hidden layer to the output layer/predictions, we again take linear combinations of the 25 numbers ($g(z_2)$) between 0 and 1 (as printed above) that were assigned to the hidden nodes. The results, $\mathbf{z}^{[2]} = X^{[2]} \cdot \Theta^{[2]}$, are like a score for each possible prediction:

Code

z3 = np.append(1, X2).dot(Theta2_GC)

print("Linear fitted z3 for output nodes 0, 1, ..., 9:\n", np.round(z3,0))

This mapping from the 25 hidden nodes to the 10 output nodes is done by the 26$$10 (induding the hidden bias) fitting parameters $\Theta^{[2]}$. Visualizing all is maybe too much, but below I visualize the 25 fitting parameters (excluding the bias) in a 25$$25 color plot for 1. the output node that tests whether the picture is of class=0 (or pick another class with show_class), and 2. the output node that tests whether the picture is the correct number (which is different each time you run this notebook).

The numerical values for all the fitting parameters $\Theta^{[2]}$ are printed out as well. Note that negative values will reduce the importance of a given node to a given prediction (because of the sigmoid function that will be applied in the end) whereas positive numbers increase the importance of a given node for a given classification.

Code

Theta2_GCT = Theta2_GC

# Pick on class
show_class = 0

hidden_to_output_0 = Theta2_GCT[1:,show_class]
hidden_to_output_correct = Theta2_GCT[1:,np.argmax(z3)]

fig, ax = plt.subplots(1,2, figsize=(14,7))

# Weights for output node testing whether the picture is of class show_class:
plota2 = np.reshape(hidden_to_output_0,[5,5])

ax[0].pcolor(plota2)
ax[0].invert_yaxis()
ax[0].set_aspect('equal', 'box')
ax[0].axis('off');
ax[0].set_title(r'$\Theta^{[2]}$'+' for probability that picture is class 0.');

# Weights for output node picture is a [correct number] (changes each time):
plota2 = np.reshape(hidden_to_output_correct,[5,5])

ax[1].pcolor(plota2)
ax[1].set_aspect('equal', 'box')
ax[1].invert_yaxis()
ax[1].set_title(r'$\Theta^{[2]}$'+' for probability that picture is class '+str(np.argmax(z3))+'.');
ax[1].axis('off');

for i in range(num_labels):
  if i == np.argmax(z3):
    print("Weights Theta2 from 25-node hidden layer to probability that picture is class "+str(i)+': [correct number!]\n', np.reshape(np.round(Theta2_GCT[1:,i],0),[5,5]))
  else:
    print("Weights Theta2 from 25-node hidden layer to probability that picture is class "+str(i)+':\n', np.reshape(np.round(Theta2_GCT[1:,i],0),[5,5]))

In a sense, each of the figures above (or the numerical values printed) is again a mask or a filter applied to the result from the previous one in the hidden layer. The output from the hidden layer was visualized earlier has 25$$25 values of 1 or 0, i.e. yes or no. The 10 masks/filters above essentially sum up the yesses and nos with different weights corresponding to each outcome 0, 1, …, 9. Ideally, only one of those linear combinations is positive and all the others are negative. Here are those scores again:

Code

print("Scores z3 for output nodes 0, 1, ..., 9:\n", np.round(z3,0))

In that case, when we apply to final sigmoid function to the output layer, only one class will have a probability close to 1. The class with the highest probability (max$(g(z^{[3]}))$, or np.argmax(h3)) determines how the ANN classifies this example picture:

Code

h3 = sigmoid(z3)
correct_nr = np.argmax(h3)

print("The probabilities that this picture is a 0, 1, ..., 9 are:\n", \
      np.round(h3,3), '\n\nSo the most likely classification is:', correct_nr , "=", class_names[correct_nr], \
      "\n\nOr graphically, these are the values of the 10 output nodes:\n")


toplot = np.zeros([1,10])
toplot[0,:] = h3

plt.figure();
plt.pcolor(toplot);

I realize that all this probably still does not make this ANN super easy to interpret…

Unfortunately, that is a well-acknowledged weakness of all types of neural networks, their lack of interpretability.

📖 Next module: Module 8: Convolutional Neural Networks