CPU Job Templates
Introduction
This page collects ready-to-adapt Slurm scripts for the most common CPU-only workloads on Unity. Each template includes the diagnostic wrapper from Best Practices §8 so you can right-size after the first run.
Replace placeholders:
<group>— your Slurm partition (oftenbatch; see Shell Environment §4)yourname@osu.edu— your email for completion notificationsmyproject— your mamba env name~/miniforge3/...— your mamba install path (might be~/mambaforge/on older setups)
And always:
mkdir -p logsbefore submitting (so--output=logs/...can write)- Run on a tiny test first; let
seff <jobid>tell you what to tighten for the real run
Templates on this page:
- Single-threaded Python (the scikit-learn / pandas / plain-Python case)
- Multi-threaded NumPy/BLAS (linear algebra, FFTs)
joblib.Parallel/sklearn(n_jobs=N)(embarrassingly parallel within one job)- Job arrays (many independent tasks — e.g. processing 100 files)
- Long-running with checkpoints (multi-day jobs that survive wall-time kills)
- MPI (for the few who need it)
1. Single-Threaded Python
The most common case: plain Python, pandas, sklearn with defaults, scientific scripts that don’t internally parallelize.
#!/bin/bash
#SBATCH --job-name=fit_rf
#SBATCH --partition=<group>
#SBATCH --time=02:00:00
#SBATCH --cpus-per-task=1 # single-threaded
#SBATCH --mem=36G # measured ~30 GB + headroom
#SBATCH --output=logs/%x-%j.out
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=yourname@osu.edu
set -euo pipefail
# Constrain numerical libraries to 1 thread (matches --cpus-per-task)
export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export NUMEXPR_NUM_THREADS=1
# ─── Vital signs ───────────────────────────────────────────
echo "Job: $SLURM_JOB_ID ($SLURM_JOB_NAME) on $(hostname)"
echo "CPUs: $SLURM_CPUS_PER_TASK | Mem (MB): ${SLURM_MEM_PER_NODE:-?}"
echo "Start: $(date)"
# ─── Environment ───────────────────────────────────────────
source ~/miniforge3/etc/profile.d/conda.sh
mamba activate myproject
# ─── Work ─────────────────────────────────────────────────
/usr/bin/time -v python fit_rf.py
echo "End: $(date)"
echo "Run 'seff $SLURM_JOB_ID' for an efficiency report."Tips:
- ✔
OMP_NUM_THREADS=1is critical here. Without it, NumPy operations called by sklearn can spawn 48 threads on a 48-core node despite your single-CPU allocation, causing thread-thrashing slowdowns. - ✔
/usr/bin/time -vgives you peak memory; use the result to tighten--memnext time.
2. Multi-Threaded NumPy / BLAS
For code dominated by large matrix operations (linear algebra, FFTs, eigendecompositions), NumPy / SciPy can scale to multiple threads via the BLAS backend.
#!/bin/bash
#SBATCH --job-name=svd_big
#SBATCH --partition=<group>
#SBATCH --time=04:00:00
#SBATCH --cpus-per-task=8 # BLAS will use 8 threads
#SBATCH --mem=64G
#SBATCH --output=logs/%x-%j.out
set -euo pipefail
# Tell BLAS / OpenMP / MKL how many threads to use — match SLURM_CPUS_PER_TASK
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MKL_NUM_THREADS=$SLURM_CPUS_PER_TASK
export OPENBLAS_NUM_THREADS=$SLURM_CPUS_PER_TASK
export NUMEXPR_NUM_THREADS=$SLURM_CPUS_PER_TASK
echo "Job: $SLURM_JOB_ID on $(hostname)"
echo "CPUs: $SLURM_CPUS_PER_TASK (threads exported to libraries)"
echo "Start: $(date)"
source ~/miniforge3/etc/profile.d/conda.sh
mamba activate myproject
/usr/bin/time -v python linalg_heavy.py
echo "End: $(date)"Tuning:
- Try 1, 2, 4, 8, 16 threads with
seffto find diminishing returns - BLAS scaling is sub-linear past ~8 threads on most operations; going to 48 rarely helps
- If you don’t see speedup, the bottleneck isn’t matrix math — check with profiling
3. joblib.Parallel or sklearn n_jobs=N
For embarrassingly parallel work — fitting many small models, cross-validation folds, hyperparameter search — sklearn and joblib parallelize across CPUs in one Python process.
#!/bin/bash
#SBATCH --job-name=cv_search
#SBATCH --partition=<group>
#SBATCH --time=06:00:00
#SBATCH --cpus-per-task=16 # n_jobs will match this
#SBATCH --mem=64G
#SBATCH --output=logs/%x-%j.out
set -euo pipefail
# Per-thread BLAS should NOT also be 16 — that would 16×16 = 256-way nested
# parallelism on a 48-core node. Force inner-loop BLAS to single thread.
export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export NUMEXPR_NUM_THREADS=1
echo "Job: $SLURM_JOB_ID CPUs: $SLURM_CPUS_PER_TASK on $(hostname)"
echo "Start: $(date)"
source ~/miniforge3/etc/profile.d/conda.sh
mamba activate myproject
/usr/bin/time -v python cv_search.py
echo "End: $(date)"Inside cv_search.py:
import os
N = int(os.environ.get("SLURM_CPUS_PER_TASK", 1)) # respect Slurm allocation
from sklearn.model_selection import GridSearchCV
search = GridSearchCV(estimator, param_grid, cv=5, n_jobs=N)
search.fit(X, y)Key insight: when outer parallelism (joblib / n_jobs) uses N workers, the inner BLAS threading should be 1 — otherwise you get nested parallelism that thrashes. The OMP_NUM_THREADS=1 lines above enforce this.
4. Job Arrays — Many Similar Tasks
If you need to run the same script on 100 different inputs, don’t submit 100 separate jobs. Use a job array: one submission that Slurm fans out into many tasks, each with its own $SLURM_ARRAY_TASK_ID.
#!/bin/bash
#SBATCH --job-name=process_files
#SBATCH --partition=<group>
#SBATCH --time=00:30:00 # per task, not total
#SBATCH --cpus-per-task=2
#SBATCH --mem=8G
#SBATCH --output=logs/%x-%A_%a.out # %A = array master, %a = task ID
#SBATCH --array=0-99 # 100 tasks, IDs 0–99
#SBATCH --array=0-99%20 # alternative: cap concurrent runs at 20
set -euo pipefail
echo "Array master: $SLURM_ARRAY_JOB_ID"
echo "This task: $SLURM_ARRAY_TASK_ID / $SLURM_ARRAY_TASK_COUNT"
echo "Node: $(hostname)"
source ~/miniforge3/etc/profile.d/conda.sh
mamba activate myproject
# Map the array index to an input file
INPUT_DIR="/fs/project/<group>/raw"
mapfile -t FILES < <(ls "$INPUT_DIR"/*.csv | sort)
INPUT_FILE="${FILES[$SLURM_ARRAY_TASK_ID]}"
echo "Processing: $INPUT_FILE"
/usr/bin/time -v python process_one.py "$INPUT_FILE"
echo "Done at $(date)"Notes:
--array=0-99runs 100 tasks. Use--array=0-99%20to cap concurrent tasks at 20 — important if each task uses non-trivial memory and you don’t want to hog the cluster.- Inside the script,
$SLURM_ARRAY_TASK_IDselects which input to process. --output=logs/%x-%A_%a.outgives each task its own log file: e.g.logs/process_files-12345_42.out.- Monitor with
squeue -u $USER— each task appears as12345_42,12345_43, etc. - Cancel the whole array with
scancel 12345; one task withscancel 12345_42.
5. Long-Running Job With Checkpoints
For multi-day work where a wall-time kill or node reboot would be catastrophic, always checkpoint.
#!/bin/bash
#SBATCH --job-name=long_sim
#SBATCH --partition=<group>
#SBATCH --time=24:00:00 # max walltime per submission
#SBATCH --cpus-per-task=8
#SBATCH --mem=64G
#SBATCH --output=logs/%x-%j.out
#SBATCH --signal=USR1@300 # send SIGUSR1 5 min before --time expires
#SBATCH --requeue # automatically resubmit if pre-empted
set -euo pipefail
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
echo "Job: $SLURM_JOB_ID on $(hostname) at $(date)"
echo "Restart count: ${SLURM_RESTART_COUNT:-0}"
source ~/miniforge3/etc/profile.d/conda.sh
mamba activate myproject
# Pass the checkpoint directory and resume flag to Python.
# The script should look for the latest checkpoint and resume from it.
CHECKPOINT_DIR="/fs/project/<group>/<username>/checkpoints/$SLURM_JOB_NAME"
mkdir -p "$CHECKPOINT_DIR"
/usr/bin/time -v python long_sim.py \
--checkpoint-dir "$CHECKPOINT_DIR" \
--checkpoint-every 1000
echo "Finished at $(date)"Inside long_sim.py:
import os, signal, sys
from pathlib import Path
ckpt_dir = Path(os.environ.get("CHECKPOINT_DIR", "ckpt"))
ckpt_dir.mkdir(exist_ok=True, parents=True)
# Handle SIGUSR1 (sent by Slurm 5 min before wall-time): save and exit cleanly
def save_and_exit(signum, frame):
save_checkpoint(state, ckpt_dir / f"step_{state.step}.pt")
print(f"Got SIGUSR1; saved checkpoint at step {state.step}. Exiting.")
sys.exit(0)
signal.signal(signal.SIGUSR1, save_and_exit)
# Resume from latest checkpoint if any
latest = max(ckpt_dir.glob("step_*.pt"), default=None, key=lambda p: int(p.stem.split("_")[1]))
if latest:
state = load_checkpoint(latest)
print(f"Resumed from {latest}")
else:
state = initial_state()
while state.step < state.max_steps:
state = step(state)
if state.step % 1000 == 0:
save_checkpoint(state, ckpt_dir / f"step_{state.step}.pt")With --requeue set, if the job is killed for any preemption reason, Slurm will resubmit it automatically. Your script then loads the latest checkpoint and picks up from there.
6. MPI (multi-process, possibly multi-node)
Most Python workloads don’t need MPI — joblib / multiprocessing handles the embarrassingly-parallel cases (Section 3) and threading handles the linear algebra (Section 2). But for tightly-coupled scientific computing or large simulations, MPI shines.
#!/bin/bash
#SBATCH --job-name=mpi_run
#SBATCH --partition=<group>
#SBATCH --time=02:00:00
#SBATCH --nodes=2 # spread across 2 nodes
#SBATCH --ntasks=96 # 96 total MPI ranks
#SBATCH --ntasks-per-node=48 # 48 per node
#SBATCH --cpus-per-task=1 # 1 CPU per rank
#SBATCH --mem=0 # 0 = give me all RAM on the node
#SBATCH --output=logs/%x-%j.out
set -euo pipefail
echo "Job: $SLURM_JOB_ID"
echo "Nodes: $SLURM_NNODES"
echo "Tasks: $SLURM_NTASKS"
echo "Start: $(date)"
module load intel/2024.2.0 # or your favorite MPI-enabled compiler
# (or activate a mamba env that contains mpi4py + openmpi/mpich)
srun ./my_mpi_program # `srun` is MPI-aware on Slurm clusters
echo "End: $(date)"Notes:
- Use
srun, notmpirun, on a Slurm cluster — it picks up the allocation automatically --mem=0is a Slurm idiom for “all the memory on each allocated node”- For Python MPI (
mpi4py), install it via mamba (mamba install mpi4py openmpi) so it’s linked against a Slurm-aware MPI
If you don’t know whether you need MPI, you don’t.
7. After the Job: Right-Sizing
For every template above, after the first successful run:
seff <jobid>Look at CPU Efficiency and Memory Efficiency. If either is much below 80–90%, your next submission should request less. See Best Practices §9.
For arrays, seff only shows one task; use sacct for the whole array:
sacct -j 12345 --format=JobID,State,Elapsed,MaxRSS,CPUTime8. See Also
- Slurm Basics —
#SBATCHdirective reference, monitoring, cancellation - Slurm Best Practices — the right-sizing methodology these templates lean on
- GPU Templates — same idea, for GPU workloads
- Python Environments — for the
mamba activatesetup inside batch scripts