Lab 08 — Your first Slurm batch job

Goal

Write, submit, monitor, and post-mortem your first unattended Slurm batch job. By the end of this lab you’ll have run a short Python job under sbatch, found its output, cancelled a long-running one mid-flight, and used seff to inspect the job’s resource efficiency afterwards.

This is the foundation for everything that follows. Every research run on the cluster will use the same handful of commands you learn here.


Reading

Pay particular attention to: - Section 2 (the three ways to use Slurm — sbatch, sinteractive, srun) - Section 3 (the annotated minimal script) - Section 5 (job states: PD, R, CD, F, OOM, TO) - Section 9 (common newcomer gotchas)


Learning objectives

  1. Write a minimal myjob.slurm batch script with the right #SBATCH directives.
  2. Submit it with sbatch, monitor with squeue -u $USER, find its output in logs/.
  3. Submit a long-running job, then cancel it with scancel.
  4. Read seff <jobid> and explain each line of the efficiency report.

Setup / prerequisites

  • Labs 01–05 complete — SSH working, mamba eslab env available
  • VS Code Remote-SSH connected to Unity (or any terminal session on Unity)

Tasks

1. Set up a project directory (3 min)

cd ~
mkdir -p hpc_practicum/lab08
cd hpc_practicum/lab08
mkdir -p logs                      # CRITICAL — Slurm fails silently if logs/ doesn't exist

2. Write a Python script for Slurm to run (5 min)

Create hello_slurm.py:

import os, sys, socket, time
import numpy as np

print("=" * 50)
print("Python script starting")
print(f"Host: {socket.gethostname()}")
print(f"Python: {sys.version.split()[0]}")
print(f"Slurm job ID: {os.environ.get('SLURM_JOB_ID', 'not running under Slurm')}")
print(f"Slurm CPUs:   {os.environ.get('SLURM_CPUS_PER_TASK', 'unknown')}")
print(f"Started at:   {time.strftime('%Y-%m-%d %H:%M:%S')}")
print("=" * 50)

# Do something that takes a few seconds and uses a bit of memory
print("Doing some matrix math...")
A = np.random.randn(2000, 2000)
B = np.random.randn(2000, 2000)
C = A @ B
print(f"Result shape: {C.shape}, mean: {C.mean():.4f}")

print(f"Sleeping 20 seconds so you can watch it in `squeue`...")
time.sleep(20)

print("=" * 50)
print(f"Finished at:  {time.strftime('%Y-%m-%d %H:%M:%S')}")
print("=" * 50)

3. Write the Slurm script (10 min)

Create myjob.slurm:

#!/bin/bash
#SBATCH --job-name=lab08
#SBATCH --partition=batch                 # use your lab partition if you have one
#SBATCH --time=00:05:00                   # 5-minute walltime cap
#SBATCH --cpus-per-task=1
#SBATCH --mem=2G
#SBATCH --output=logs/%x-%j.out           # %x=jobname, %j=jobid

set -euo pipefail

# Vital signs — useful in every job log
echo "Job:    $SLURM_JOB_ID ($SLURM_JOB_NAME)"
echo "Host:   $(hostname)"
echo "Start:  $(date)"

# Activate the mamba env (necessary; Slurm doesn't inherit your interactive shell)
source ~/miniforge3/etc/profile.d/conda.sh
mamba activate eslab

# Run the work
python hello_slurm.py

echo "End:    $(date)"
echo "Run 'seff $SLURM_JOB_ID' for an efficiency report."

Read every line and make sure you understand it. The source ~/miniforge3/... + mamba activate eslab block is what makes your eslab env available — without it, the job’s python is the system Python, not the env’s.

4. Submit the job (5 min)

sbatch myjob.slurm
# Submitted batch job 12345

Note the job ID. Then immediately monitor:

squeue -u $USER

What you should see, in rough sequence:

  • ST = PD with reason (Resources) or (Priority) — pending, waiting for scheduler
  • ST = R — running, with TIME ticking up
  • (After ~30 seconds): gone from the queue — completed

Once finished:

ls -la logs/
cat logs/lab08-12345.out                  # use your real job ID

You should see the output from your echo lines and the Python script.

5. Watch a job from start to finish (5 min)

To see all three states in real time, in one terminal window:

watch -n 1 squeue -u $USER

Submit a new job in another terminal:

sbatch myjob.slurm

Watch the state column transition PDR → (gone). Press Ctrl+C to exit watch.

6. Use seff for the post-mortem (5 min)

seff 12345

Output looks like:

Job ID: 12345
Cluster: unity
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 00:00:23
CPU Efficiency: 95.83% of 00:00:24 core-walltime
Job Wall-clock time: 00:00:24
Memory Utilized: 142.21 MB
Memory Efficiency: 6.95% of 2.00 GB

Two efficiency numbers:

  • CPU Efficiency close to 100% = you used all the CPUs you asked for. Lower = you over-asked.
  • Memory Efficiency close to 100% = you used all the memory you asked for. Much lower = you over-asked.

In this case, memory efficiency is 6.95% — you asked for 2 GB and used 142 MB. For a real job you’d dial --mem down. (You’ll do exactly this in Lab 9.)

7. Cancel a running job (10 min)

Now intentionally submit a long job so you can cancel it. Edit hello_slurm.py to time.sleep(600) (10 minutes) instead of 20 seconds. Then:

sbatch myjob.slurm
# Note the job ID
squeue -u $USER                          # confirm it's R (or PD, then will be R)

Wait for it to enter R state, then:

scancel <jobid>
squeue -u $USER                          # confirm it's gone (may briefly show CG = completing)

Look at the log to see how it was killed:

cat logs/lab08-<jobid>.out

You should see the start-of-job lines, then a partial output as the process was interrupted.

8. (Optional) Set up --mail-type for completion emails (3 min)

Add to myjob.slurm:

#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=yourname@osu.edu

Submit again — when the job finishes (or fails) you’ll get an email. Useful for long-running jobs you don’t want to watch.


Deliverables

Save to lab08/ in your personal repo:

  1. lab08/hello_slurm.py — the Python script (with time.sleep(20) version, not the 600).

  2. lab08/myjob.slurm — your batch script. Redact any real OSU email if you added --mail-user.

  3. lab08/job_log.txt — the contents of logs/lab08-<jobid>.out from a successful run.

  4. lab08/seff.txt — the output of seff <jobid> for that successful run.

  5. lab08/cancelled_log.txt — the partial log from your scanceled job in Task 7.

  6. lab08/reflection.md — 5–7 sentences answering:

    • Looking at your seff output, was your --mem=2G request right-sized? What would you change?
    • What’s the difference between --time (walltime) and seff’s “CPU Utilized” time?
    • If squeue shows your job as PD with reason (Resources), what does that mean — and what could you change to start it sooner?

Self-check


Common issues

❌ “Submitted batch job 12345” but no output file appears

Most likely cause: logs/ doesn’t exist, so --output=logs/%x-%j.out couldn’t write. Run mkdir -p logs and resubmit. (Slurm fails silently on this — there’s no error in squeue.)

❌ The job ran but python hello_slurm.py failed with ModuleNotFoundError: No module named 'numpy'

The mamba env isn’t being activated correctly. Check your script: 1. Is the source ~/miniforge3/etc/profile.d/conda.sh line present? 2. Is mamba activate eslab after it? 3. Is eslab the env you actually want?

❌ Job dies with state OOM (out of memory)

You asked for less memory than the job actually needs. Bump --mem= (e.g. from 2G to 8G for this lab — but really, Lab 10 will teach you how to measure rather than guess).

❌ Job dies with state TO (timed out)

You asked for less walltime than the job needs. Bump --time=. For this lab, --time=00:05:00 should be plenty.

squeue shows my job as PD for a long time

The cluster is busy. Read the NODELIST(REASON) column: - (Resources) — waiting for someone else to finish so resources free up - (Priority) — there are higher-priority jobs ahead of you - (QOSMaxJobsPerUserLimit) — you have too many jobs in flight - (AssocGrpJobsLimit) — your account hit a limit

Most often the answer is “be patient.” Sometimes the answer is “reduce your request so the job fits in more gaps” — that’s the lesson of Lab 9.

seff says “Job not found in db”

The job is too recent (accounting hasn’t caught up yet — wait a minute) or too old (Slurm purged it). For currently-running jobs, use sstat -j <jobid> instead.


Time estimate

  • Reading: ~30 min
  • Tasks: ~45 min (including waiting for jobs to schedule)
  • Deliverables: ~10 min

Total: ~1.5 hours


Extensions (optional)

Add the diagnostic vital-signs block

Take the longer “vital signs” block from Handbook Best Practices §8 and add it to your myjob.slurm. This logs every Slurm env var at the top of every output file — invaluable when you have dozens of past jobs and need to figure out which one used which resources.

Try sacct for richer accounting info

sacct -j 12345 --format=JobID,JobName,State,Elapsed,MaxRSS,ReqMem,CPUTime

This is what seff is built on top of. Useful for batch-querying many jobs at once.

Submit a job that depends on another

Submit job A, get its ID, then submit job B that won’t start until A finishes successfully:

sbatch myjob.slurm                 # job 12345
sbatch --dependency=afterok:12345 followup.slurm

Useful for multi-stage pipelines (preprocess → train → analyze).


What’s next?

With basic batch submission under your belt, Lab 09 — Right-sizing: the headline lab drives home the most important Slurm skill: asking for the resources you actually need.