Slurm Best Practices: right-sizing requests and measuring what you need
Introduction
The single most common mistake new HPC users make is asking for too much — too many CPUs, too much memory, too long a walltime. The intuition feels right (“better safe than sorry”), but on a shared cluster it backfires three ways:
- Your own job waits longer. Slurm’s backfill scheduler runs small, short jobs first; a job asking for the whole node sits in the queue much longer than one asking for a fraction.
- You block other users. Every CPU and gigabyte you reserve is unavailable to the rest of the lab — even if your job is only using 5% of it.
- It teaches you nothing. You don’t learn what your code actually needs, so the over-requesting just gets worse over time.
This page is the antidote. It covers:
✔ How Slurm scheduling decides what runs when (intuition for backfill) ✔ A worked example: a single-threaded Python job that needs ~30 GB on a 96 GB node ✔ How to measure how much CPU and RAM your code actually uses (the hard part on HPC) ✔ Right-sizing CPUs for parallel vs serial code ✔ Balancing CPU/RAM/GPU for deep-learning training ✔ A drop-in diagnostics wrapper to put around every batch script ✔ Reading seff reports for post-mortem right-sizing
Prerequisites: Slurm Basics for the mechanics.
1. Why Over-Requesting Hurts You (Not Just Others)
Slurm doesn’t just “run jobs first-come-first-served.” It uses a backfill scheduler: when there’s a gap in the schedule (e.g. 4 hours and 32 GB free on a node before a big reservation kicks in), Slurm scans the queue looking for jobs that fit in that gap.
If your job asks for 24 hours and 96 GB, it doesn’t fit in many gaps. So it waits for a big window to open up.
If your job asks for 4 hours and 16 GB, it fits in most gaps. So it slips in much sooner.
Concrete example: two identical jobs that actually take 2 hours and use 8 GB. One is submitted with --time=24:00:00 --mem=64G, the other with --time=04:00:00 --mem=16G. On a busy cluster, the second one typically starts running hours before the first one — even though the work is identical.
The rule: ask for what you actually need, plus a small safety margin (~20–30%).
2. The Resource Axes You Control
Every batch script declares some combination of:
| Resource | Slurm flag | What it limits |
|---|---|---|
| Walltime | --time= |
Max wall-clock time; job is killed when it expires |
| CPUs | --cpus-per-task= |
Number of CPU cores your task can use |
| Memory | --mem= or --mem-per-cpu= |
Maximum RAM; exceeding it → instant OOM kill |
| GPUs | --gres=gpu:N or --gpus=N |
Number of GPUs |
| Nodes | --nodes= |
(Multi-node MPI only — usually 1 for Python/ML) |
The first three are the ones you tune. GPUs are binary (you need one or you don’t).
3. Worked Example — A Single-Threaded scikit-learn Job
Suppose you have this script, fit_rf.py:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
df = pd.read_parquet("/fs/project/<group>/data/big_features.parquet") # ~10 GB in memory
X, y = df.drop("label", axis=1), df["label"]
# Default n_jobs=1 — single-threaded
clf = RandomForestClassifier(n_estimators=500, max_depth=20)
clf.fit(X, y)
import joblib
joblib.dump(clf, "model.joblib")Suppose this code uses about 30 GB of peak RAM (mostly the loaded DataFrame plus the trained model) and runs single-threaded because n_jobs=1 is the sklearn default.
A typical Unity compute node has 48 CPUs and 96 GB RAM. What should you request?
❌ The wrong way: “I’ll just grab the whole node”
#SBATCH --cpus-per-task=48
#SBATCH --mem=96G
#SBATCH --time=24:00:00This:
- Blocks all 48 CPUs and 96 GB of RAM — preventing 47 other people’s small jobs from running on that node
- Sits in the queue much longer waiting for a full node to free up
- Provides zero benefit to your job, since your Python uses 1 CPU and 30 GB no matter what you “reserve”
✅ The right way: ask for what you need, plus headroom
#SBATCH --cpus-per-task=1 # the code is single-threaded
#SBATCH --mem=36G # 30 GB measured + ~20% headroom
#SBATCH --time=02:00:00 # measured wall time + safetyThis:
- Fits in most queue gaps → starts running sooner
- Lets 47 other jobs run alongside yours on the same node
- Costs your lab less SU (service unit) charge
- Does exactly the same work at exactly the same speed
The catch: you have to measure to know that “30 GB” and “2 hours” are accurate. The next sections cover how.
4. How Much Memory Does My Code Actually Use?
On a Mac you’d open Activity Monitor and watch the number climb. On a headless HPC node you don’t have that — but there are five solid ways to find out.
4.1 ✔ Best for new code: /usr/bin/time -v
Wrap your command in GNU time (with the -v flag, not the bash builtin) to get a full resource report including peak memory:
/usr/bin/time -v python fit_rf.pyAfter the script finishes, you get:
Command being timed: "python fit_rf.py"
User time (seconds): 4823.12
System time (seconds): 87.45
Percent of CPU this job got: 98%
Elapsed (wall clock) time: 01:22:17
Maximum resident set size (kbytes): 31245312 ← THIS LINE
Major (requiring I/O) page faults: 23
...
The “Maximum resident set size” is the peak RAM the process held. Divide by 1024² to get GB: 31245312 / 1024 / 1024 ≈ 29.8 GB. Add ~20% headroom → request 36 GB.
(Note the explicit /usr/bin/time path — the bash builtin time is different and doesn’t have -v.)
4.2 ✔ Best inside Python: psutil
For monitoring memory at specific moments (after data load, mid-loop, etc.):
import psutil, os
proc = psutil.Process(os.getpid())
# ... do work ...
print(f"RSS: {proc.memory_info().rss / 1e9:.2f} GB")You can print this around the suspicious parts of your code. Especially useful for finding which step causes a memory spike.
4.3 ✔ Interactively: htop on an sinteractive node
If you’re developing on a compute node via sinteractive:
sinteractive -p <group> --cpus-per-task=4 --mem=64G --time=02:00:00
# ... once allocated:
htop -u $USER # interactive process viewerThen in another terminal pane (use tmux or livenode), run your code and watch its RES column climb.
4.4 ✔ During a running batch job: sstat
While your sbatch job is running:
sstat -j <jobid> --format=JobID,MaxRSS,AveRSS,MaxVMSizeReturns live max-resident-set-size for the job so far.
4.5 ✔ Post-mortem on a finished job: seff
After the job ends:
seff 12345Returns a one-page summary:
Job ID: 12345
Cluster: unity
User/Group: yourname.##/yourname.##
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 01:21:43
CPU Efficiency: 99.30% of 01:22:17 core-walltime
Job Wall-clock time: 01:22:17
Memory Utilized: 29.82 GB
Memory Efficiency: 31.06% of 96.00 GB
Two efficiency numbers to read:
- CPU Efficiency — close to 100% means you used all the cores you asked for. 25% means you used 1 of 4. Use fewer cores next time.
- Memory Efficiency — close to 100% (with a small margin) is ideal. 31% means you asked for 3× too much memory.
seff is the single most useful tool for right-sizing future jobs. Get in the habit of running seff <jobid> after every long job.
4.6 What about sacct?
sacct is the raw Slurm accounting interface — more verbose but works for any job, finished or running:
sacct -j 12345 --format=JobID,JobName,State,Elapsed,MaxRSS,ReqMem,CPUTime,ExitCodeUseful when seff isn’t available or for batch-querying many jobs at once.
5. How Many CPUs Should I Request?
A confusing thing about Python on HPC: many libraries silently default to using every CPU they can see, which means a 1-CPU allocation can become a “let me try to use all 48 CPUs” mess that thrashes and underperforms.
The rule of thumb:
| Workload | What to ask for |
|---|---|
| Pure single-threaded Python (sklearn defaults, plain loops, most pandas operations) | --cpus-per-task=1 |
| NumPy/SciPy with BLAS (matrix multiplies, linear algebra) | --cpus-per-task=4–8 AND set OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK |
sklearn with n_jobs=N (Random Forest, cross-validation, etc.) |
--cpus-per-task=N with N matching what you’ll pass to sklearn |
joblib.Parallel(n_jobs=N) |
--cpus-per-task=N matching |
PyTorch DataLoader with num_workers=N |
Request N + 1 or N + 2 CPUs |
| MPI / multi-process (rare for ML) | --ntasks=N --cpus-per-task=1 |
Always tell your libraries about Slurm
Inside your batch script, before launching Python, export the thread-count env vars so libraries respect your allocation instead of grabbing everything:
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
export MKL_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
export OPENBLAS_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
export NUMEXPR_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}Without these, a NumPy @ operator may try to use all 48 cores on the node — even though you only have 4 — and slow down dramatically due to thread thrashing while also being a bad citizen.
6. How Much Walltime?
The strategy:
- Run a tiny version first. Subset your data to 1% or 10%, measure how long it takes.
- Multiply linearly (most ML training, batch processing, scientific simulation scales roughly linearly with input size).
- Add a 30–50% safety margin for the full run.
- Round up to a sensible chunk (1h, 4h, 12h, 24h).
If you genuinely don’t know and the job MUST finish, ask for more — but plan to use seff afterwards to right-size for next time. Elapsed time in seff tells you what the run actually took.
For very-long-running things (multi-day training), use checkpointing so a wall-time-exceeded kill doesn’t lose progress. See CPU Templates §6.
7. GPU Jobs: Balancing CPU, RAM, and GPU
A GPU job has two memory budgets:
| Memory | What lives there | How to size |
|---|---|---|
System RAM (--mem) |
The Python interpreter, NumPy arrays, dataset on CPU, batches being prepared by DataLoader workers | Measure with seff, like CPU jobs |
| GPU memory | Your model weights, activations, gradients, current batch on GPU | Watch with nvidia-smi while training |
And you still need CPUs, primarily for:
- The PyTorch/TF DataLoader workers preparing batches in parallel (
num_workers=N→ requestN+1orN+2CPUs) - Any preprocessing that happens in Python before tensors are moved to GPU
Sensible defaults for a single-GPU deep-learning job
#SBATCH --gres=gpu:1 # one GPU
#SBATCH --cpus-per-task=8 # 4 dataloader workers + slack
#SBATCH --mem=48G # enough for typical image/audio batches
#SBATCH --time=12:00:00Then set DataLoader appropriately in the Python:
loader = DataLoader(dataset, batch_size=64, num_workers=4, pin_memory=True)Tuning:
- GPU OOM (CUDA out of memory) — reduce
batch_size, switch to mixed precision (autocast/bfloat16), use gradient accumulation, or use a bigger GPU - GPU underutilized (
nvidia-smishows <50% GPU util) — usually a data-loading bottleneck; increasenum_workers(and bump--cpus-per-task) - System RAM OOM — reduce
num_workers, or batch-prep less data on CPU
Watching GPU usage while training
In a second SSH session to the same compute node:
nvidia-smi -l 1 # update every 1 second
# or, more compact:
watch -n 1 nvidia-smiLook at:
Memory-Usagecolumn: how much GPU VRAM is in use (vs. total available on the card)Volatile GPU-Utilcolumn: % of GPU compute time being used. 90%+ is great. <50% = data-loading bottleneck
After training, seff <jobid> gives system-RAM and CPU efficiency. For GPU-side metrics during the run, you have to capture nvidia-smi output yourself (see Section 8 below).
8. The Diagnostics Wrapper Every Batch Script Should Have
The script below is a drop-in template. Replace the contents of the “ACTUAL WORK” block with your python invocation; everything else helps you debug and right-size.
#!/bin/bash
#SBATCH --job-name=fit_rf
#SBATCH --partition=batch
#SBATCH --time=02:00:00 # see best-practices §6
#SBATCH --cpus-per-task=1 # see best-practices §5
#SBATCH --mem=36G # see best-practices §4
#SBATCH --output=logs/%x-%j.out
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=yourname@osu.edu
set -euo pipefail # fail loudly on any error
# ───────────────────────────────────────────────────────────
# Tell numerical libraries how many threads they may use.
# Without this, NumPy/BLAS/MKL will try to grab every CPU
# visible on the node — even ones not allocated to you.
# ───────────────────────────────────────────────────────────
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
export MKL_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
export OPENBLAS_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
export NUMEXPR_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
# ───────────────────────────────────────────────────────────
# Vital signs — recorded at the top of the log for every job.
# ───────────────────────────────────────────────────────────
echo "=================== JOB INFO ===================="
echo "Job ID: $SLURM_JOB_ID"
echo "Job name: $SLURM_JOB_NAME"
echo "Partition: $SLURM_JOB_PARTITION"
echo "Node: $(hostname)"
echo "CPUs: $SLURM_CPUS_PER_TASK"
echo "Memory (MB): ${SLURM_MEM_PER_NODE:-${SLURM_MEM_PER_CPU:-unset}}"
echo "GPUs: ${SLURM_GPUS:-${SLURM_JOB_GPUS:-none}}"
echo "Working dir: $(pwd)"
echo "Started: $(date)"
echo "================================================="
# ───────────────────────────────────────────────────────────
# Environment activation
# ───────────────────────────────────────────────────────────
source ~/miniforge3/etc/profile.d/conda.sh
mamba activate myproject
echo "Python: $(which python)"
python -c "import sys; print(f'Python version: {sys.version.split()[0]}')"
# Optional but very useful for ML: log GPU specs and current versions
if command -v nvidia-smi &> /dev/null; then
echo "--- nvidia-smi ---"
nvidia-smi
fi
# ───────────────────────────────────────────────────────────
# Optional: snapshot GPU usage every 30s in the background
# so you can review utilization after the job ends.
# ───────────────────────────────────────────────────────────
if command -v nvidia-smi &> /dev/null; then
(
while sleep 30; do
echo "--- $(date '+%H:%M:%S') ---"
nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total --format=csv,noheader
done
) > "logs/gpu-${SLURM_JOB_ID}.log" 2>&1 &
GPU_LOGGER_PID=$!
fi
# ───────────────────────────────────────────────────────────
# ACTUAL WORK — wrap in /usr/bin/time -v for a peak-memory readout
# ───────────────────────────────────────────────────────────
echo "===================== WORK ======================"
/usr/bin/time -v python fit_rf.py
echo "================================================="
# Stop the GPU logger if we started one
if [ -n "${GPU_LOGGER_PID:-}" ]; then
kill $GPU_LOGGER_PID 2>/dev/null || true
fi
echo "Finished: $(date)"
echo ""
echo "Run 'seff $SLURM_JOB_ID' after the job's epilog completes"
echo "for a one-page efficiency report (CPU% and Memory%)."Why this matters:
- The vital-signs block tells you instantly which node + how much was requested for any past job — no detective work
- The thread-count exports stop libraries from oversubscribing the node
/usr/bin/time -vcaptures peak memory so you can right-size next time- The optional GPU-usage logger gives you per-30s GPU stats so you can spot dataloader bottlenecks
Save it as templates/diagnostic.slurm and copy it for every new project.
9. The Post-Mortem Loop: Read seff, Adjust, Re-submit
After every meaningful job, run:
seff 12345Look at the two efficiency numbers and adjust:
| What you see | What it means | What to change |
|---|---|---|
CPU Efficiency 98% on --cpus-per-task=4 |
All 4 cores fully used | Allocation looks right; don’t change |
CPU Efficiency 25% on --cpus-per-task=4 |
You used 1 of 4 cores | Drop to --cpus-per-task=1 next time |
Memory Efficiency 31% on --mem=96G |
Used ~30 GB of 96 | Drop to --mem=36G next time |
Memory Efficiency 99% with OOM exit |
You hit the cap | Bump --mem by 50% next time |
Elapsed time 0:30 on --time=24:00:00 |
Massive over-ask | Drop --time to 1–2h next time |
Iteration: first submission is a rough guess (always over-ask a bit if unsure); second submission uses the seff numbers to right-size; from then on you usually hit it right.
10. Other Practical Tips
Use sinteractive for development, sbatch for production
Editing-and-running cycles belong in an interactive session (Persistent Sessions). Once code works, write a batch script and submit it unattended.
Save Slurm logs to a dedicated logs/ directory
mkdir -p logs once per project. Use --output=logs/%x-%j.out so logs don’t clutter the project root.
Use array jobs for “many similar tasks”
If you have 100 input files to process the same way, don’t submit 100 separate jobs. Use a job array — see CPU Templates §4.
Don’t run heavy work on login nodes
Login nodes are for editing, submitting, and light data inspection. Heavy python or R runs on the login node will be killed by sysadmins and annoy everyone. Use sinteractive or sbatch.
Email yourself on completion
Add to your script:
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=yourname@osu.eduYou’ll get an email when the job finishes (or fails). Great for long-running training jobs.
Checkpoint anything long
If a 24-hour job loses its last checkpoint to a wall-time kill, you’ve wasted a day. Save state every N epochs / iterations and write code that can resume from the latest checkpoint.
11. Summary Checklist
Before submitting a job, ask yourself:
After it runs:
12. See Also
- Slurm Basics for
#SBATCHsyntax and the basic monitoring/cancel commands - CPU Templates for concrete drop-in scripts for common CPU workloads
- GPU Templates for the same on the GPU side
- Python Environments §4.3 for the standard mamba-activation block inside batch scripts
- Persistent Sessions for
sinteractivepatterns during development