Lab 12 — GPU jobs (or alternative: hyperparameter sweep)
Goal
Run your first GPU job on Unity — train a small PyTorch model and analyze GPU utilization to spot dataloader bottlenecks (Track A).
If you don’t have GPU access yet, run a CPU-side hyperparameter sweep as a job array and find the best config (Track B). The skills (resource management, sweep design, results aggregation) transfer directly.
Either track is fine — pick whichever matches your access and your research direction.
Reading
- Track A (GPU): Handbook: GPU Templates — sections 1, 2, 7. Also Slurm Best Practices §7 for CPU/RAM/GPU balancing.
- Track B (CPU sweep): Handbook: GPU Templates §4, adapted to CPU.
Budget ~30 minutes for the reading.
Learning objectives
Track A (GPU):
- Request a GPU in a Slurm job and verify your code can see it (
nvidia-smi,torch.cuda.is_available()). - Run a small PyTorch training (MNIST classifier) and capture GPU utilization over time.
- Read
logs/gpu-<jobid>.logand decide whether you have a dataloader bottleneck. - Tune
num_workersand batch size based on what you see.
Track B (CPU sweep):
- Design a hyperparameter grid (e.g. lr × seed × batch_size) and map indices to combinations.
- Submit a job array that runs one combination per task.
- Aggregate results into a final summary table and pick the winning config.
Setup / prerequisites
- Labs 01–11 complete.
- For Track A: You have access to a partition that has GPUs. If you’re on
batchonly, check withsinfo -o "%P %G"— partitions with GPUs show non-(null)in theGREScolumn. If you don’t have GPU access, do Track B instead. - For both tracks: PyTorch installed in your
eslabenv (or a separate env). Install with:mamba install -n eslab pytorch torchvision -c pytorch -c conda-forge. (This download is ~1.5 GB and takes a few minutes.)
Track A — GPU PyTorch training
A1. Verify GPU access on a compute node (10 min)
sinteractive -p batch --gres=gpu:1 --cpus-per-task=4 --mem=16G --time=01:00:00(Adjust -p if you have a lab GPU partition.) Once allocated:
nvidia-smi
mamba activate eslab
python -c "import torch; print('CUDA available:', torch.cuda.is_available()); print('Device:', torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'CPU only')"✅ Self-check: nvidia-smi shows a GPU and torch.cuda.is_available() is True.
Exit the interactive session — we’ll run the actual training as a batch job.
A2. Write the training script (15 min)
Save ~/hpc_practicum/lab12/train_mnist.py:
"""
train_mnist.py — small MNIST CNN training, for Lab 12.
"""
import argparse, time, os, sys
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
class TinyCNN(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(1, 32, 3)
self.conv2 = nn.Conv2d(32, 64, 3)
self.fc1 = nn.Linear(9216, 128)
self.fc2 = nn.Linear(128, 10)
def forward(self, x):
x = F.relu(self.conv1(x))
x = F.max_pool2d(F.relu(self.conv2(x)), 2)
x = torch.flatten(x, 1)
x = F.relu(self.fc1(x))
return self.fc2(x)
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--epochs", type=int, default=3)
parser.add_argument("--batch-size", type=int, default=128)
parser.add_argument("--workers", type=int, default=4)
parser.add_argument("--data-dir", default=os.path.expanduser("~/hpc_practicum/lab12/data"))
args = parser.parse_args()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Device: {device}")
if device.type == "cuda":
print(f"GPU: {torch.cuda.get_device_name(0)}")
tx = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,)),
])
train = datasets.MNIST(args.data_dir, train=True, download=True, transform=tx)
loader = DataLoader(
train,
batch_size=args.batch_size,
shuffle=True,
num_workers=args.workers,
pin_memory=(device.type == "cuda"),
)
model = TinyCNN().to(device)
opt = torch.optim.AdamW(model.parameters(), lr=1e-3)
for epoch in range(args.epochs):
t0 = time.time()
for i, (x, y) in enumerate(loader):
x, y = x.to(device, non_blocking=True), y.to(device, non_blocking=True)
opt.zero_grad()
loss = F.cross_entropy(model(x), y)
loss.backward()
opt.step()
if i % 100 == 0:
print(f"Epoch {epoch+1} step {i:4d}/{len(loader)} loss={loss.item():.4f}", flush=True)
print(f"Epoch {epoch+1} done in {time.time()-t0:.1f}s", flush=True)
if __name__ == "__main__":
main()A3. Write the Slurm script with GPU diagnostics (10 min)
Copy your ~/templates/diagnostic.slurm and adapt:
cp ~/templates/diagnostic.slurm ~/hpc_practicum/lab12/train.slurmEdit it to:
#!/bin/bash
#SBATCH --job-name=lab12_train
#SBATCH --partition=batch # or your GPU partition
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=8 # 4 dataloader workers + slack
#SBATCH --mem=16G
#SBATCH --time=01:00:00
#SBATCH --output=logs/%x-%j.out
set -euo pipefail
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MKL_NUM_THREADS=$SLURM_CPUS_PER_TASK
echo "=== JOB INFO ==="
echo "Job: $SLURM_JOB_ID on $(hostname)"
echo "CPUs: $SLURM_CPUS_PER_TASK Mem (MB): $SLURM_MEM_PER_NODE"
echo "GPUs: ${SLURM_GPUS:-${SLURM_JOB_GPUS:-?}}"
echo "Start: $(date)"
echo "================"
source ~/miniforge3/etc/profile.d/conda.sh
mamba activate eslab
nvidia-smi
python -c "import torch; print('CUDA:', torch.cuda.is_available(), torch.cuda.get_device_name(0))"
# Background GPU usage logger — captures samples every 30 sec
(
while sleep 30; do
echo "--- $(date '+%H:%M:%S') ---"
nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total,temperature.gpu --format=csv,noheader
done
) > "logs/gpu-${SLURM_JOB_ID}.log" 2>&1 &
GPU_LOGGER_PID=$!
# Train
/usr/bin/time -v python train_mnist.py --epochs 3 --batch-size 128 --workers 4
# Stop the GPU logger
kill $GPU_LOGGER_PID 2>/dev/null || true
echo "End: $(date)"
echo "Inspect logs/gpu-${SLURM_JOB_ID}.log for GPU utilization over time."
echo "Run 'seff $SLURM_JOB_ID' for CPU/Mem efficiency."A4. Submit and analyze (15 min)
cd ~/hpc_practicum/lab12
mkdir -p logs
sbatch train.slurmWatch the queue. When it finishes:
cat logs/lab12_train-<jobid>.out # training output
cat logs/gpu-<jobid>.log # GPU usage log
seff <jobid>In the GPU log, look at the utilization.gpu column (first value on each line). Patterns:
- Sitting near 95%: 🎉 GPU is saturated, you’re getting your money’s worth.
- Oscillating 0% ↔︎ 90%: 🚫 Dataloader bottleneck — the GPU keeps finishing batches faster than the CPU can prepare new ones.
- Stuck near 0%: 🤔 Either the model is too small, or there’s a different bottleneck.
For MNIST + TinyCNN on a modern GPU, you should see oscillation — the model is so small that the GPU finishes a batch in milliseconds and waits for the next one. Try increasing --batch-size 512 or --workers 8 (and bump --cpus-per-task=12 accordingly) to mitigate.
Track B — CPU hyperparameter sweep
B1. Write the trainable script (15 min)
Save ~/hpc_practicum/lab12/train_cpu.py:
"""
train_cpu.py — small sklearn experiment for hyperparameter sweep.
"""
import argparse, time, json, os
import numpy as np
from sklearn.datasets import make_classification
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--lr", type=float, required=True)
parser.add_argument("--n-estimators", type=int, required=True)
parser.add_argument("--max-depth", type=int, required=True)
parser.add_argument("--seed", type=int, default=42)
parser.add_argument("--out", required=True)
args = parser.parse_args()
t0 = time.time()
X, y = make_classification(n_samples=50_000, n_features=30, random_state=args.seed)
clf = GradientBoostingClassifier(
learning_rate=args.lr,
n_estimators=args.n_estimators,
max_depth=args.max_depth,
random_state=args.seed,
)
scores = cross_val_score(clf, X, y, cv=3, n_jobs=1)
result = {
"lr": args.lr,
"n_estimators": args.n_estimators,
"max_depth": args.max_depth,
"seed": args.seed,
"mean_accuracy": float(scores.mean()),
"std_accuracy": float(scores.std()),
"elapsed_s": time.time() - t0,
}
os.makedirs(os.path.dirname(args.out), exist_ok=True)
with open(args.out, "w") as f:
json.dump(result, f, indent=2)
print(json.dumps(result, indent=2))
if __name__ == "__main__":
main()B2. Write the array sweep script (15 min)
Save as sweep.slurm:
#!/bin/bash
#SBATCH --job-name=lab12_sweep
#SBATCH --partition=batch
#SBATCH --time=00:15:00 # per task
#SBATCH --cpus-per-task=1
#SBATCH --mem=4G
#SBATCH --output=logs/%x-%A_%a.out
#SBATCH --array=0-11%6 # 12 tasks, max 6 concurrent
set -euo pipefail
export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1
# Hyperparameter grid (12 combinations: 3 lr × 2 n_est × 2 depth)
LRS=(0.01 0.05 0.1)
N_ESTS=(50 200)
DEPTHS=(3 5)
IDX=$SLURM_ARRAY_TASK_ID
LR=${LRS[$(( IDX / 4 ))]}
N_EST=${N_ESTS[$(( (IDX / 2) % 2 ))]}
DEPTH=${DEPTHS[$(( IDX % 2 ))]}
echo "Task $IDX: lr=$LR, n_est=$N_EST, depth=$DEPTH on $(hostname)"
source ~/miniforge3/etc/profile.d/conda.sh
mamba activate eslab
python train_cpu.py \
--lr $LR \
--n-estimators $N_EST \
--max-depth $DEPTH \
--out "results/result_${IDX}_lr${LR}_n${N_EST}_d${DEPTH}.json"B3. Submit and analyze (15 min)
cd ~/hpc_practicum/lab12
mkdir -p logs results
sbatch sweep.slurm
squeue -u $USERWait for all 12 tasks to finish. Then aggregate:
# aggregate_sweep.py
import json, glob
import pandas as pd
paths = sorted(glob.glob("results/result_*.json"))
rows = [json.load(open(p)) for p in paths]
df = pd.DataFrame(rows).sort_values("mean_accuracy", ascending=False)
print(df.to_string(index=False))
df.to_csv("results/sweep_summary.csv", index=False)python aggregate_sweep.py
cat results/sweep_summary.csvThe top row is your winning config — but note: with only 12 combinations and varying seeds, the winner is noisy. For real sweeps, run multiple seeds per config.
Deliverables
Save to lab12/ in your personal repo. Submit either Track A or Track B (or both, if you’re ambitious).
Track A:
lab12/train_mnist.py— the training script.lab12/train.slurm— the Slurm script.lab12/training_log.txt—logs/lab12_train-<jobid>.out.lab12/gpu_log.txt—logs/gpu-<jobid>.log.lab12/seff.txt—seff <jobid>output.lab12/analysis.md— 5–7 sentences:- Is your GPU saturated, oscillating, or idle?
- If oscillating, do you have a dataloader bottleneck? What would you change?
- How much GPU memory did you use vs. the GPU’s total?
Track B:
lab12/train_cpu.py— the per-config training script.lab12/sweep.slurm— the array script.lab12/aggregate_sweep.py— the aggregator.lab12/sweep_summary.csv— all 12 results sorted by accuracy.lab12/analysis.md— 5–7 sentences:- Which hyperparameter combination won?
- How confident are you (one seed × tiny dataset)?
- What would a real, statistically-defensible sweep look like? (Hint: more seeds, more combinations, proper CV.)
Self-check
Track A:
Track B:
Common issues
❌ “no GPUs available” or partition rejection
You don’t have access to GPU partitions on Unity. Switch to Track B.
❌ CUDA out of memory
Reduce --batch-size. MNIST is small but a tiny GPU (e.g. 4 GB) still fills up. Try --batch-size 64.
❌ GPU utilization is 0% the whole time
Your code is running on CPU even though a GPU is allocated. Check: - torch.cuda.is_available() should be True — if False, your PyTorch wasn’t built with CUDA. Reinstall: mamba install -n eslab pytorch torchvision pytorch-cuda=11.8 -c pytorch -c nvidia - The model is model.to(device) and tensors are x.to(device) before any compute
❌ One sweep task fails, the others succeed
Look at the failing task’s log: cat logs/lab12_sweep-<arrayid>_<index>.out. Usually an array-index math error or a typo in the hyperparameter array. Re-run just that task with sbatch --array=<index> sweep.slurm.
❌ MNIST download fails inside the Slurm job
Compute nodes sometimes have restricted internet. Either: - Pre-download on the login node first (python -c "from torchvision.datasets import MNIST; MNIST('data', train=True, download=True)") - Or use --data-dir /fs/project/<group>/datasets/mnist if the data is already there
Time estimate
- Reading: ~30 min
- Tasks: ~75 min (most of it queue wait + training time)
- Deliverables: ~15 min
Total: ~2 hours
Extensions (optional)
Track A:
- Try mixed-precision with
torch.cuda.amp.autocast— typically halves GPU memory and modestly speeds training. Compare GPU memory usage in the log before/after. - Push the model bigger (more conv layers / wider) until GPU utilization reaches ~95% and you stop seeing dataloader bottleneck. That’s the model size your GPU “wants.”
- Try
num_workers=8and--cpus-per-task=12— does GPU utilization improve?
Track B:
- Add multiple seeds per config and average — gives a proper statistical comparison.
- Replace the grid search with Optuna for smarter Bayesian sweeps over more dimensions.
- Use
--dependency=afterokto chain an aggregation job that auto-runs once the array completes.
What’s next?
You’ve now run real cluster jobs across CPU, GPU, and array workloads. The Capstone brings it all together with a 3-week project of your own.