Lab 08 — Your first Slurm batch job
Goal
Write, submit, monitor, and post-mortem your first unattended Slurm batch job. By the end of this lab you’ll have run a short Python job under sbatch, found its output, cancelled a long-running one mid-flight, and used seff to inspect the job’s resource efficiency afterwards.
This is the foundation for everything that follows. Every research run on the cluster will use the same handful of commands you learn here.
Reading
- Handbook: Slurm Basics — end-to-end (~30 minutes).
Pay particular attention to: - Section 2 (the three ways to use Slurm — sbatch, sinteractive, srun) - Section 3 (the annotated minimal script) - Section 5 (job states: PD, R, CD, F, OOM, TO) - Section 9 (common newcomer gotchas)
Learning objectives
- Write a minimal
myjob.slurmbatch script with the right#SBATCHdirectives. - Submit it with
sbatch, monitor withsqueue -u $USER, find its output inlogs/. - Submit a long-running job, then cancel it with
scancel. - Read
seff <jobid>and explain each line of the efficiency report.
Setup / prerequisites
- Labs 01–05 complete — SSH working, mamba
eslabenv available - VS Code Remote-SSH connected to Unity (or any terminal session on Unity)
Tasks
1. Set up a project directory (3 min)
cd ~
mkdir -p hpc_practicum/lab08
cd hpc_practicum/lab08
mkdir -p logs # CRITICAL — Slurm fails silently if logs/ doesn't exist2. Write a Python script for Slurm to run (5 min)
Create hello_slurm.py:
import os, sys, socket, time
import numpy as np
print("=" * 50)
print("Python script starting")
print(f"Host: {socket.gethostname()}")
print(f"Python: {sys.version.split()[0]}")
print(f"Slurm job ID: {os.environ.get('SLURM_JOB_ID', 'not running under Slurm')}")
print(f"Slurm CPUs: {os.environ.get('SLURM_CPUS_PER_TASK', 'unknown')}")
print(f"Started at: {time.strftime('%Y-%m-%d %H:%M:%S')}")
print("=" * 50)
# Do something that takes a few seconds and uses a bit of memory
print("Doing some matrix math...")
A = np.random.randn(2000, 2000)
B = np.random.randn(2000, 2000)
C = A @ B
print(f"Result shape: {C.shape}, mean: {C.mean():.4f}")
print(f"Sleeping 20 seconds so you can watch it in `squeue`...")
time.sleep(20)
print("=" * 50)
print(f"Finished at: {time.strftime('%Y-%m-%d %H:%M:%S')}")
print("=" * 50)3. Write the Slurm script (10 min)
Create myjob.slurm:
#!/bin/bash
#SBATCH --job-name=lab08
#SBATCH --partition=batch # use your lab partition if you have one
#SBATCH --time=00:05:00 # 5-minute walltime cap
#SBATCH --cpus-per-task=1
#SBATCH --mem=2G
#SBATCH --output=logs/%x-%j.out # %x=jobname, %j=jobid
set -euo pipefail
# Vital signs — useful in every job log
echo "Job: $SLURM_JOB_ID ($SLURM_JOB_NAME)"
echo "Host: $(hostname)"
echo "Start: $(date)"
# Activate the mamba env (necessary; Slurm doesn't inherit your interactive shell)
source ~/miniforge3/etc/profile.d/conda.sh
mamba activate eslab
# Run the work
python hello_slurm.py
echo "End: $(date)"
echo "Run 'seff $SLURM_JOB_ID' for an efficiency report."Read every line and make sure you understand it. The source ~/miniforge3/... + mamba activate eslab block is what makes your eslab env available — without it, the job’s python is the system Python, not the env’s.
4. Submit the job (5 min)
sbatch myjob.slurm
# Submitted batch job 12345Note the job ID. Then immediately monitor:
squeue -u $USERWhat you should see, in rough sequence:
ST = PDwith reason(Resources)or(Priority)— pending, waiting for schedulerST = R— running, withTIMEticking up- (After ~30 seconds): gone from the queue — completed
Once finished:
ls -la logs/
cat logs/lab08-12345.out # use your real job IDYou should see the output from your echo lines and the Python script.
5. Watch a job from start to finish (5 min)
To see all three states in real time, in one terminal window:
watch -n 1 squeue -u $USERSubmit a new job in another terminal:
sbatch myjob.slurmWatch the state column transition PD → R → (gone). Press Ctrl+C to exit watch.
6. Use seff for the post-mortem (5 min)
seff 12345Output looks like:
Job ID: 12345
Cluster: unity
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 00:00:23
CPU Efficiency: 95.83% of 00:00:24 core-walltime
Job Wall-clock time: 00:00:24
Memory Utilized: 142.21 MB
Memory Efficiency: 6.95% of 2.00 GB
Two efficiency numbers:
- CPU Efficiency close to 100% = you used all the CPUs you asked for. Lower = you over-asked.
- Memory Efficiency close to 100% = you used all the memory you asked for. Much lower = you over-asked.
In this case, memory efficiency is 6.95% — you asked for 2 GB and used 142 MB. For a real job you’d dial --mem down. (You’ll do exactly this in Lab 9.)
7. Cancel a running job (10 min)
Now intentionally submit a long job so you can cancel it. Edit hello_slurm.py to time.sleep(600) (10 minutes) instead of 20 seconds. Then:
sbatch myjob.slurm
# Note the job ID
squeue -u $USER # confirm it's R (or PD, then will be R)Wait for it to enter R state, then:
scancel <jobid>
squeue -u $USER # confirm it's gone (may briefly show CG = completing)Look at the log to see how it was killed:
cat logs/lab08-<jobid>.outYou should see the start-of-job lines, then a partial output as the process was interrupted.
8. (Optional) Set up --mail-type for completion emails (3 min)
Add to myjob.slurm:
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=yourname@osu.eduSubmit again — when the job finishes (or fails) you’ll get an email. Useful for long-running jobs you don’t want to watch.
Deliverables
Save to lab08/ in your personal repo:
lab08/hello_slurm.py— the Python script (withtime.sleep(20)version, not the 600).lab08/myjob.slurm— your batch script. Redact any real OSU email if you added--mail-user.lab08/job_log.txt— the contents oflogs/lab08-<jobid>.outfrom a successful run.lab08/seff.txt— the output ofseff <jobid>for that successful run.lab08/cancelled_log.txt— the partial log from yourscanceled job in Task 7.lab08/reflection.md— 5–7 sentences answering:- Looking at your
seffoutput, was your--mem=2Grequest right-sized? What would you change? - What’s the difference between
--time(walltime) andseff’s “CPU Utilized” time? - If
squeueshows your job asPDwith reason(Resources), what does that mean — and what could you change to start it sooner?
- Looking at your
Self-check
Common issues
❌ “Submitted batch job 12345” but no output file appears
Most likely cause: logs/ doesn’t exist, so --output=logs/%x-%j.out couldn’t write. Run mkdir -p logs and resubmit. (Slurm fails silently on this — there’s no error in squeue.)
❌ The job ran but python hello_slurm.py failed with ModuleNotFoundError: No module named 'numpy'
The mamba env isn’t being activated correctly. Check your script: 1. Is the source ~/miniforge3/etc/profile.d/conda.sh line present? 2. Is mamba activate eslab after it? 3. Is eslab the env you actually want?
❌ Job dies with state OOM (out of memory)
You asked for less memory than the job actually needs. Bump --mem= (e.g. from 2G to 8G for this lab — but really, Lab 10 will teach you how to measure rather than guess).
❌ Job dies with state TO (timed out)
You asked for less walltime than the job needs. Bump --time=. For this lab, --time=00:05:00 should be plenty.
❌ squeue shows my job as PD for a long time
The cluster is busy. Read the NODELIST(REASON) column: - (Resources) — waiting for someone else to finish so resources free up - (Priority) — there are higher-priority jobs ahead of you - (QOSMaxJobsPerUserLimit) — you have too many jobs in flight - (AssocGrpJobsLimit) — your account hit a limit
Most often the answer is “be patient.” Sometimes the answer is “reduce your request so the job fits in more gaps” — that’s the lesson of Lab 9.
❌ seff says “Job not found in db”
The job is too recent (accounting hasn’t caught up yet — wait a minute) or too old (Slurm purged it). For currently-running jobs, use sstat -j <jobid> instead.
Time estimate
- Reading: ~30 min
- Tasks: ~45 min (including waiting for jobs to schedule)
- Deliverables: ~10 min
Total: ~1.5 hours
Extensions (optional)
Add the diagnostic vital-signs block
Take the longer “vital signs” block from Handbook Best Practices §8 and add it to your myjob.slurm. This logs every Slurm env var at the top of every output file — invaluable when you have dozens of past jobs and need to figure out which one used which resources.
Try sacct for richer accounting info
sacct -j 12345 --format=JobID,JobName,State,Elapsed,MaxRSS,ReqMem,CPUTimeThis is what seff is built on top of. Useful for batch-querying many jobs at once.
Submit a job that depends on another
Submit job A, get its ID, then submit job B that won’t start until A finishes successfully:
sbatch myjob.slurm # job 12345
sbatch --dependency=afterok:12345 followup.slurmUseful for multi-stage pipelines (preprocess → train → analyze).
What’s next?
With basic batch submission under your belt, Lab 09 — Right-sizing: the headline lab drives home the most important Slurm skill: asking for the resources you actually need.