Lab 12 — GPU jobs (or alternative: hyperparameter sweep)

Goal

Run your first GPU job on Unity — train a small PyTorch model and analyze GPU utilization to spot dataloader bottlenecks (Track A).

If you don’t have GPU access yet, run a CPU-side hyperparameter sweep as a job array and find the best config (Track B). The skills (resource management, sweep design, results aggregation) transfer directly.

Either track is fine — pick whichever matches your access and your research direction.


Reading

Budget ~30 minutes for the reading.


Learning objectives

Track A (GPU):

  1. Request a GPU in a Slurm job and verify your code can see it (nvidia-smi, torch.cuda.is_available()).
  2. Run a small PyTorch training (MNIST classifier) and capture GPU utilization over time.
  3. Read logs/gpu-<jobid>.log and decide whether you have a dataloader bottleneck.
  4. Tune num_workers and batch size based on what you see.

Track B (CPU sweep):

  1. Design a hyperparameter grid (e.g. lr × seed × batch_size) and map indices to combinations.
  2. Submit a job array that runs one combination per task.
  3. Aggregate results into a final summary table and pick the winning config.

Setup / prerequisites

  • Labs 01–11 complete.
  • For Track A: You have access to a partition that has GPUs. If you’re on batch only, check with sinfo -o "%P %G" — partitions with GPUs show non-(null) in the GRES column. If you don’t have GPU access, do Track B instead.
  • For both tracks: PyTorch installed in your eslab env (or a separate env). Install with: mamba install -n eslab pytorch torchvision -c pytorch -c conda-forge. (This download is ~1.5 GB and takes a few minutes.)

Track A — GPU PyTorch training

A1. Verify GPU access on a compute node (10 min)

sinteractive -p batch --gres=gpu:1 --cpus-per-task=4 --mem=16G --time=01:00:00

(Adjust -p if you have a lab GPU partition.) Once allocated:

nvidia-smi
mamba activate eslab
python -c "import torch; print('CUDA available:', torch.cuda.is_available()); print('Device:', torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'CPU only')"

Self-check: nvidia-smi shows a GPU and torch.cuda.is_available() is True.

Exit the interactive session — we’ll run the actual training as a batch job.

A2. Write the training script (15 min)

Save ~/hpc_practicum/lab12/train_mnist.py:

"""
train_mnist.py — small MNIST CNN training, for Lab 12.
"""
import argparse, time, os, sys
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
from torchvision import datasets, transforms


class TinyCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 32, 3)
        self.conv2 = nn.Conv2d(32, 64, 3)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        x = torch.flatten(x, 1)
        x = F.relu(self.fc1(x))
        return self.fc2(x)


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--epochs", type=int, default=3)
    parser.add_argument("--batch-size", type=int, default=128)
    parser.add_argument("--workers", type=int, default=4)
    parser.add_argument("--data-dir", default=os.path.expanduser("~/hpc_practicum/lab12/data"))
    args = parser.parse_args()

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print(f"Device: {device}")
    if device.type == "cuda":
        print(f"GPU: {torch.cuda.get_device_name(0)}")

    tx = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,)),
    ])
    train = datasets.MNIST(args.data_dir, train=True, download=True, transform=tx)
    loader = DataLoader(
        train,
        batch_size=args.batch_size,
        shuffle=True,
        num_workers=args.workers,
        pin_memory=(device.type == "cuda"),
    )

    model = TinyCNN().to(device)
    opt = torch.optim.AdamW(model.parameters(), lr=1e-3)

    for epoch in range(args.epochs):
        t0 = time.time()
        for i, (x, y) in enumerate(loader):
            x, y = x.to(device, non_blocking=True), y.to(device, non_blocking=True)
            opt.zero_grad()
            loss = F.cross_entropy(model(x), y)
            loss.backward()
            opt.step()
            if i % 100 == 0:
                print(f"Epoch {epoch+1} step {i:4d}/{len(loader)}  loss={loss.item():.4f}", flush=True)
        print(f"Epoch {epoch+1} done in {time.time()-t0:.1f}s", flush=True)


if __name__ == "__main__":
    main()

A3. Write the Slurm script with GPU diagnostics (10 min)

Copy your ~/templates/diagnostic.slurm and adapt:

cp ~/templates/diagnostic.slurm ~/hpc_practicum/lab12/train.slurm

Edit it to:

#!/bin/bash
#SBATCH --job-name=lab12_train
#SBATCH --partition=batch                 # or your GPU partition
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=8                 # 4 dataloader workers + slack
#SBATCH --mem=16G
#SBATCH --time=01:00:00
#SBATCH --output=logs/%x-%j.out

set -euo pipefail
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MKL_NUM_THREADS=$SLURM_CPUS_PER_TASK

echo "=== JOB INFO ==="
echo "Job: $SLURM_JOB_ID on $(hostname)"
echo "CPUs: $SLURM_CPUS_PER_TASK   Mem (MB): $SLURM_MEM_PER_NODE"
echo "GPUs: ${SLURM_GPUS:-${SLURM_JOB_GPUS:-?}}"
echo "Start: $(date)"
echo "================"

source ~/miniforge3/etc/profile.d/conda.sh
mamba activate eslab

nvidia-smi
python -c "import torch; print('CUDA:', torch.cuda.is_available(), torch.cuda.get_device_name(0))"

# Background GPU usage logger — captures samples every 30 sec
(
  while sleep 30; do
    echo "--- $(date '+%H:%M:%S') ---"
    nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total,temperature.gpu --format=csv,noheader
  done
) > "logs/gpu-${SLURM_JOB_ID}.log" 2>&1 &
GPU_LOGGER_PID=$!

# Train
/usr/bin/time -v python train_mnist.py --epochs 3 --batch-size 128 --workers 4

# Stop the GPU logger
kill $GPU_LOGGER_PID 2>/dev/null || true

echo "End: $(date)"
echo "Inspect logs/gpu-${SLURM_JOB_ID}.log for GPU utilization over time."
echo "Run 'seff $SLURM_JOB_ID' for CPU/Mem efficiency."

A4. Submit and analyze (15 min)

cd ~/hpc_practicum/lab12
mkdir -p logs
sbatch train.slurm

Watch the queue. When it finishes:

cat logs/lab12_train-<jobid>.out          # training output
cat logs/gpu-<jobid>.log                  # GPU usage log
seff <jobid>

In the GPU log, look at the utilization.gpu column (first value on each line). Patterns:

  • Sitting near 95%: 🎉 GPU is saturated, you’re getting your money’s worth.
  • Oscillating 0% ↔︎ 90%: 🚫 Dataloader bottleneck — the GPU keeps finishing batches faster than the CPU can prepare new ones.
  • Stuck near 0%: 🤔 Either the model is too small, or there’s a different bottleneck.

For MNIST + TinyCNN on a modern GPU, you should see oscillation — the model is so small that the GPU finishes a batch in milliseconds and waits for the next one. Try increasing --batch-size 512 or --workers 8 (and bump --cpus-per-task=12 accordingly) to mitigate.


Track B — CPU hyperparameter sweep

B1. Write the trainable script (15 min)

Save ~/hpc_practicum/lab12/train_cpu.py:

"""
train_cpu.py — small sklearn experiment for hyperparameter sweep.
"""
import argparse, time, json, os
import numpy as np
from sklearn.datasets import make_classification
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--lr", type=float, required=True)
    parser.add_argument("--n-estimators", type=int, required=True)
    parser.add_argument("--max-depth", type=int, required=True)
    parser.add_argument("--seed", type=int, default=42)
    parser.add_argument("--out", required=True)
    args = parser.parse_args()

    t0 = time.time()
    X, y = make_classification(n_samples=50_000, n_features=30, random_state=args.seed)

    clf = GradientBoostingClassifier(
        learning_rate=args.lr,
        n_estimators=args.n_estimators,
        max_depth=args.max_depth,
        random_state=args.seed,
    )
    scores = cross_val_score(clf, X, y, cv=3, n_jobs=1)
    result = {
        "lr": args.lr,
        "n_estimators": args.n_estimators,
        "max_depth": args.max_depth,
        "seed": args.seed,
        "mean_accuracy": float(scores.mean()),
        "std_accuracy": float(scores.std()),
        "elapsed_s": time.time() - t0,
    }
    os.makedirs(os.path.dirname(args.out), exist_ok=True)
    with open(args.out, "w") as f:
        json.dump(result, f, indent=2)
    print(json.dumps(result, indent=2))


if __name__ == "__main__":
    main()

B2. Write the array sweep script (15 min)

Save as sweep.slurm:

#!/bin/bash
#SBATCH --job-name=lab12_sweep
#SBATCH --partition=batch
#SBATCH --time=00:15:00                   # per task
#SBATCH --cpus-per-task=1
#SBATCH --mem=4G
#SBATCH --output=logs/%x-%A_%a.out
#SBATCH --array=0-11%6                    # 12 tasks, max 6 concurrent

set -euo pipefail
export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1

# Hyperparameter grid (12 combinations: 3 lr × 2 n_est × 2 depth)
LRS=(0.01 0.05 0.1)
N_ESTS=(50 200)
DEPTHS=(3 5)

IDX=$SLURM_ARRAY_TASK_ID
LR=${LRS[$(( IDX / 4 ))]}
N_EST=${N_ESTS[$(( (IDX / 2) % 2 ))]}
DEPTH=${DEPTHS[$(( IDX % 2 ))]}

echo "Task $IDX: lr=$LR, n_est=$N_EST, depth=$DEPTH on $(hostname)"

source ~/miniforge3/etc/profile.d/conda.sh
mamba activate eslab

python train_cpu.py \
    --lr $LR \
    --n-estimators $N_EST \
    --max-depth $DEPTH \
    --out "results/result_${IDX}_lr${LR}_n${N_EST}_d${DEPTH}.json"

B3. Submit and analyze (15 min)

cd ~/hpc_practicum/lab12
mkdir -p logs results
sbatch sweep.slurm
squeue -u $USER

Wait for all 12 tasks to finish. Then aggregate:

# aggregate_sweep.py
import json, glob
import pandas as pd

paths = sorted(glob.glob("results/result_*.json"))
rows = [json.load(open(p)) for p in paths]
df = pd.DataFrame(rows).sort_values("mean_accuracy", ascending=False)
print(df.to_string(index=False))
df.to_csv("results/sweep_summary.csv", index=False)
python aggregate_sweep.py
cat results/sweep_summary.csv

The top row is your winning config — but note: with only 12 combinations and varying seeds, the winner is noisy. For real sweeps, run multiple seeds per config.


Deliverables

Save to lab12/ in your personal repo. Submit either Track A or Track B (or both, if you’re ambitious).

Track A:

  1. lab12/train_mnist.py — the training script.
  2. lab12/train.slurm — the Slurm script.
  3. lab12/training_log.txtlogs/lab12_train-<jobid>.out.
  4. lab12/gpu_log.txtlogs/gpu-<jobid>.log.
  5. lab12/seff.txtseff <jobid> output.
  6. lab12/analysis.md — 5–7 sentences:
    • Is your GPU saturated, oscillating, or idle?
    • If oscillating, do you have a dataloader bottleneck? What would you change?
    • How much GPU memory did you use vs. the GPU’s total?

Track B:

  1. lab12/train_cpu.py — the per-config training script.
  2. lab12/sweep.slurm — the array script.
  3. lab12/aggregate_sweep.py — the aggregator.
  4. lab12/sweep_summary.csv — all 12 results sorted by accuracy.
  5. lab12/analysis.md — 5–7 sentences:
    • Which hyperparameter combination won?
    • How confident are you (one seed × tiny dataset)?
    • What would a real, statistically-defensible sweep look like? (Hint: more seeds, more combinations, proper CV.)

Self-check

Track A:

Track B:


Common issues

❌ “no GPUs available” or partition rejection

You don’t have access to GPU partitions on Unity. Switch to Track B.

CUDA out of memory

Reduce --batch-size. MNIST is small but a tiny GPU (e.g. 4 GB) still fills up. Try --batch-size 64.

❌ GPU utilization is 0% the whole time

Your code is running on CPU even though a GPU is allocated. Check: - torch.cuda.is_available() should be True — if False, your PyTorch wasn’t built with CUDA. Reinstall: mamba install -n eslab pytorch torchvision pytorch-cuda=11.8 -c pytorch -c nvidia - The model is model.to(device) and tensors are x.to(device) before any compute

❌ One sweep task fails, the others succeed

Look at the failing task’s log: cat logs/lab12_sweep-<arrayid>_<index>.out. Usually an array-index math error or a typo in the hyperparameter array. Re-run just that task with sbatch --array=<index> sweep.slurm.

MNIST download fails inside the Slurm job

Compute nodes sometimes have restricted internet. Either: - Pre-download on the login node first (python -c "from torchvision.datasets import MNIST; MNIST('data', train=True, download=True)") - Or use --data-dir /fs/project/<group>/datasets/mnist if the data is already there


Time estimate

  • Reading: ~30 min
  • Tasks: ~75 min (most of it queue wait + training time)
  • Deliverables: ~15 min

Total: ~2 hours


Extensions (optional)

Track A:

  • Try mixed-precision with torch.cuda.amp.autocast — typically halves GPU memory and modestly speeds training. Compare GPU memory usage in the log before/after.
  • Push the model bigger (more conv layers / wider) until GPU utilization reaches ~95% and you stop seeing dataloader bottleneck. That’s the model size your GPU “wants.”
  • Try num_workers=8 and --cpus-per-task=12 — does GPU utilization improve?

Track B:

  • Add multiple seeds per config and average — gives a proper statistical comparison.
  • Replace the grid search with Optuna for smarter Bayesian sweeps over more dimensions.
  • Use --dependency=afterok to chain an aggregation job that auto-runs once the array completes.

What’s next?

You’ve now run real cluster jobs across CPU, GPU, and array workloads. The Capstone brings it all together with a 3-week project of your own.