Lab 11 — Job arrays for many independent tasks
Goal
Process many similar inputs in parallel through a single Slurm submission, using a job array — instead of writing a bash loop that submits 20 separate jobs (which is bad-citizen behavior on the cluster, and a hassle to debug).
By the end of this lab you’ll have processed 20 input files in parallel from one sbatch call, with a sensible concurrent-tasks cap so you don’t monopolize a queue.
Reading
- Handbook: CPU Templates §4 — Job Arrays — Many Similar Tasks — the template and explanation.
Budget ~15 minutes for the reading.
Learning objectives
- Write a Slurm array job using
--array=A-Band the$SLURM_ARRAY_TASK_IDvariable. - Use the
%A_%aoutput-file naming so each task gets its own log. - Cap the number of concurrent tasks with
--array=A-B%Kand explain why this matters. - Aggregate results across many array tasks into one summary.
Setup / prerequisites
- Labs 01–10 complete. In particular, you should have the
diagnostic.slurmtemplate from Lab 10 — we’ll start from that.
Tasks
1. Set up the lab directory and create input data (10 min)
cd ~/hpc_practicum
mkdir -p lab11/{inputs,outputs,logs}
cd lab11Create 20 fake input CSV files to process. Save as make_inputs.py:
"""Generate 20 fake input CSVs of varying size."""
import os
import numpy as np
import pandas as pd
os.makedirs("inputs", exist_ok=True)
rng = np.random.default_rng(42)
for i in range(20):
n_rows = rng.integers(low=5_000, high=50_000)
df = pd.DataFrame({
"x": rng.normal(loc=i, scale=2.0, size=n_rows),
"y": rng.normal(loc=0, scale=1.0, size=n_rows),
"category": rng.choice(["A", "B", "C"], size=n_rows),
})
df.to_csv(f"inputs/input_{i:02d}.csv", index=False)
print(f"input_{i:02d}.csv: {n_rows} rows")Run it:
mamba activate eslab
python make_inputs.py
ls inputs/ | headYou should see input_00.csv through input_19.csv.
2. Write the per-file processing script (10 min)
Save as process_one.py:
"""
process_one.py — process one input file. Run as: python process_one.py inputs/input_XX.csv
"""
import sys, os, time
import pandas as pd
import numpy as np
def main(input_path):
t0 = time.time()
name = os.path.basename(input_path).replace(".csv", "")
print(f"[{name}] starting at {time.strftime('%H:%M:%S')}")
df = pd.read_csv(input_path)
print(f"[{name}] loaded {len(df)} rows")
summary = df.groupby("category").agg(
x_mean=("x", "mean"),
x_std=("x", "std"),
y_mean=("y", "mean"),
y_std=("y", "std"),
n=("x", "size"),
).reset_index()
out_path = f"outputs/{name}_summary.csv"
summary.to_csv(out_path, index=False)
print(f"[{name}] wrote {out_path}")
print(f"[{name}] finished in {time.time()-t0:.2f}s")
if __name__ == "__main__":
if len(sys.argv) != 2:
sys.exit("Usage: python process_one.py <input.csv>")
main(sys.argv[1])Test it on one file before scaling up:
python process_one.py inputs/input_00.csv
ls outputs/You should see outputs/input_00_summary.csv with 3 rows (one per category).
3. Write the Slurm array script (15 min)
Save as process_array.slurm:
#!/bin/bash
#SBATCH --job-name=lab11_array
#SBATCH --partition=batch
#SBATCH --time=00:10:00 # per task
#SBATCH --cpus-per-task=1
#SBATCH --mem=2G
#SBATCH --output=logs/%x-%A_%a.out # %A=array master, %a=task index
#SBATCH --array=0-19%5 # 20 tasks (IDs 0..19), max 5 running at once
set -euo pipefail
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MKL_NUM_THREADS=$SLURM_CPUS_PER_TASK
echo "=== ARRAY TASK INFO ==="
echo "Array master: $SLURM_ARRAY_JOB_ID"
echo "This task: $SLURM_ARRAY_TASK_ID of $SLURM_ARRAY_TASK_COUNT"
echo "Node: $(hostname)"
echo "Start: $(date)"
echo "======================="
source ~/miniforge3/etc/profile.d/conda.sh
mamba activate eslab
# Map the array index to one of our 20 input files.
# `printf` zero-pads to two digits (00, 01, ..., 19).
INPUT_FILE=$(printf "inputs/input_%02d.csv" $SLURM_ARRAY_TASK_ID)
echo "Processing: $INPUT_FILE"
/usr/bin/time -v python process_one.py "$INPUT_FILE"
echo "End: $(date)"A few things worth understanding before you submit:
--array=0-19runs 20 tasks total, with$SLURM_ARRAY_TASK_IDset to 0, 1, 2, …, 19 inside each.%5caps concurrency at 5: at any moment, only 5 of the 20 tasks are running. The other 15 are queued. This is critical behavior — without it, you’d potentially try to run all 20 simultaneously, which:- Is rude on a busy cluster (you grab 20 slots; everyone else waits)
- Can hit per-user job limits and fail
- Doesn’t actually speed up your work much if each task only uses 1 CPU anyway
--output=logs/%x-%A_%a.outproduces one log file per task:logs/lab11_array-12345_0.out,_1, …,_19. Without this, all tasks would clobber the same log file.
4. Submit and monitor (15 min)
sbatch process_array.slurm
# Submitted batch job 12345Watch the queue:
squeue -u $USERYou should see entries like:
JOBID PARTITION NAME USER ST TIME ...
12345_[5-19] batch lab11_array yourname.## PD 0:00
12345_0 batch lab11_array yourname.## R 0:23
12345_1 batch lab11_array yourname.## R 0:23
12345_2 batch lab11_array yourname.## R 0:23
12345_3 batch lab11_array yourname.## R 0:23
12345_4 batch lab11_array yourname.## R 0:23
5 running, 15 pending. As the running ones finish, the next 5 enter R state. Wait until all 20 are done.
Check the output:
ls outputs/ # should see 20 *_summary.csv files
ls logs/ # should see 20 *.out files5. Aggregate the results (5 min)
Save as aggregate.py:
"""Concatenate all per-input summaries into one CSV."""
import glob, pandas as pd
paths = sorted(glob.glob("outputs/input_*_summary.csv"))
print(f"Found {len(paths)} summary files")
dfs = []
for p in paths:
df = pd.read_csv(p)
df["source_file"] = p.replace("outputs/", "").replace("_summary.csv", "")
dfs.append(df)
combined = pd.concat(dfs, ignore_index=True)
combined.to_csv("outputs/combined_summary.csv", index=False)
print(f"Wrote outputs/combined_summary.csv with {len(combined)} rows")python aggregate.py
head outputs/combined_summary.csv6. Inspect efficiency for one task (3 min)
seff works on individual array tasks too:
seff 12345_0Should show CPU Efficiency near 100% (since each task is single-threaded and you asked for 1 CPU) and Memory Efficiency dependent on how much your input file used.
For the whole array:
sacct -j 12345 --format=JobID,State,Elapsed,MaxRSS,CPUTime | headThis shows one line per array task — useful for spotting outliers.
Deliverables
Save to lab11/ in your personal repo:
lab11/make_inputs.py— the input generator.lab11/process_one.py— the per-file script.lab11/process_array.slurm— the array submission script.lab11/aggregate.py— the post-processing aggregator.lab11/output_listing.txt—ls -la outputs/showing all 20 summary files + the combined CSV.lab11/seff_one_task.txt—seff 12345_0output.lab11/reflection.md— 5–7 sentences:- Why is
%5(the concurrency cap) important — what would go wrong without it? - How is this approach more cluster-friendly than a bash loop submitting 20 separate jobs?
- For a real-world version of this — say, processing 1000 satellite images — what would change?
- Why is
Self-check
Common issues
❌ Some output files are missing
Check the corresponding log file (logs/lab11_array-12345_<index>.out). Most likely the input file didn’t exist for that index (check your zero-padding), or the Python script crashed. Fix and resubmit just that task with:
sbatch --array=<missing_index> process_array.slurm❌ All 20 tasks ran simultaneously (no concurrency cap honored)
Check the --array=0-19%5 line — make sure the %5 is there. The cap is a per-array setting; without it you can have all tasks running at once.
❌ “Maximum array size exceeded”
Clusters often cap the array size (e.g. --array=0-9999 may be too many). Run scontrol show config | grep -i array to see your cluster’s MaxArraySize. For >1000 tasks, use multiple smaller submissions or move to a different parallelism strategy.
❌ squeue is unreadable with many array tasks
Use:
squeue -u $USER --array # shows array tasks rolled up
squeue -u $USER -t R # only running tasks❌ One task fails and I want to rerun just it
sbatch --array=7 process_array.slurmsubmits only task index 7. Slurm gives it a fresh new array-master ID.
Time estimate
- Reading: ~15 min
- Tasks: ~60 min (a chunk of which is waiting for the array to finish — could be faster or slower depending on cluster load)
- Deliverables: ~10 min
Total: ~1.5 hours
Extensions (optional)
Add a “follow-up” job that aggregates after the array completes
Slurm --dependency lets you chain jobs. Save aggregate.slurm:
#!/bin/bash
#SBATCH --job-name=lab11_aggregate
#SBATCH --partition=batch
#SBATCH --time=00:05:00
#SBATCH --mem=2G
#SBATCH --output=logs/%x-%j.out
source ~/miniforge3/etc/profile.d/conda.sh
mamba activate eslab
python aggregate.pySubmit it to wait for the array:
ARRAY_ID=$(sbatch --parsable process_array.slurm)
sbatch --dependency=afterok:$ARRAY_ID aggregate.slurmNow aggregate.slurm starts automatically when all 20 array tasks succeed.
Compute the speedup
Without the array (one job per file, run sequentially), how long would 20 separate jobs take? Compare to your actual end-to-end array runtime. Note that 20 jobs at 5 concurrent ≈ 4× speedup over serial.
Use srun inside an array task
If each array task itself needs multiple CPUs (e.g. processing a single big image with parallel code), use srun inside:
srun python process_one.py "$INPUT_FILE"This launches the Python process as a Slurm “step” — useful for nested parallelism.
What’s next?
You’ve now scaled to many tasks. Lab 12 — GPU jobs (or alternative: hyperparameter sweep) moves into GPU work (if you have access) or applies the array pattern to a hyperparameter sweep.