Lab 11 — Job arrays for many independent tasks

Goal

Process many similar inputs in parallel through a single Slurm submission, using a job array — instead of writing a bash loop that submits 20 separate jobs (which is bad-citizen behavior on the cluster, and a hassle to debug).

By the end of this lab you’ll have processed 20 input files in parallel from one sbatch call, with a sensible concurrent-tasks cap so you don’t monopolize a queue.


Reading

Budget ~15 minutes for the reading.


Learning objectives

  1. Write a Slurm array job using --array=A-B and the $SLURM_ARRAY_TASK_ID variable.
  2. Use the %A_%a output-file naming so each task gets its own log.
  3. Cap the number of concurrent tasks with --array=A-B%K and explain why this matters.
  4. Aggregate results across many array tasks into one summary.

Setup / prerequisites

  • Labs 01–10 complete. In particular, you should have the diagnostic.slurm template from Lab 10 — we’ll start from that.

Tasks

1. Set up the lab directory and create input data (10 min)

cd ~/hpc_practicum
mkdir -p lab11/{inputs,outputs,logs}
cd lab11

Create 20 fake input CSV files to process. Save as make_inputs.py:

"""Generate 20 fake input CSVs of varying size."""
import os
import numpy as np
import pandas as pd

os.makedirs("inputs", exist_ok=True)
rng = np.random.default_rng(42)

for i in range(20):
    n_rows = rng.integers(low=5_000, high=50_000)
    df = pd.DataFrame({
        "x": rng.normal(loc=i, scale=2.0, size=n_rows),
        "y": rng.normal(loc=0, scale=1.0, size=n_rows),
        "category": rng.choice(["A", "B", "C"], size=n_rows),
    })
    df.to_csv(f"inputs/input_{i:02d}.csv", index=False)
    print(f"input_{i:02d}.csv: {n_rows} rows")

Run it:

mamba activate eslab
python make_inputs.py
ls inputs/ | head

You should see input_00.csv through input_19.csv.

2. Write the per-file processing script (10 min)

Save as process_one.py:

"""
process_one.py — process one input file. Run as: python process_one.py inputs/input_XX.csv
"""
import sys, os, time
import pandas as pd
import numpy as np


def main(input_path):
    t0 = time.time()
    name = os.path.basename(input_path).replace(".csv", "")
    print(f"[{name}] starting at {time.strftime('%H:%M:%S')}")

    df = pd.read_csv(input_path)
    print(f"[{name}] loaded {len(df)} rows")

    summary = df.groupby("category").agg(
        x_mean=("x", "mean"),
        x_std=("x", "std"),
        y_mean=("y", "mean"),
        y_std=("y", "std"),
        n=("x", "size"),
    ).reset_index()

    out_path = f"outputs/{name}_summary.csv"
    summary.to_csv(out_path, index=False)
    print(f"[{name}] wrote {out_path}")
    print(f"[{name}] finished in {time.time()-t0:.2f}s")


if __name__ == "__main__":
    if len(sys.argv) != 2:
        sys.exit("Usage: python process_one.py <input.csv>")
    main(sys.argv[1])

Test it on one file before scaling up:

python process_one.py inputs/input_00.csv
ls outputs/

You should see outputs/input_00_summary.csv with 3 rows (one per category).

3. Write the Slurm array script (15 min)

Save as process_array.slurm:

#!/bin/bash
#SBATCH --job-name=lab11_array
#SBATCH --partition=batch
#SBATCH --time=00:10:00                   # per task
#SBATCH --cpus-per-task=1
#SBATCH --mem=2G
#SBATCH --output=logs/%x-%A_%a.out        # %A=array master, %a=task index
#SBATCH --array=0-19%5                    # 20 tasks (IDs 0..19), max 5 running at once

set -euo pipefail

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MKL_NUM_THREADS=$SLURM_CPUS_PER_TASK

echo "=== ARRAY TASK INFO ==="
echo "Array master: $SLURM_ARRAY_JOB_ID"
echo "This task:    $SLURM_ARRAY_TASK_ID of $SLURM_ARRAY_TASK_COUNT"
echo "Node:         $(hostname)"
echo "Start:        $(date)"
echo "======================="

source ~/miniforge3/etc/profile.d/conda.sh
mamba activate eslab

# Map the array index to one of our 20 input files.
# `printf` zero-pads to two digits (00, 01, ..., 19).
INPUT_FILE=$(printf "inputs/input_%02d.csv" $SLURM_ARRAY_TASK_ID)
echo "Processing:   $INPUT_FILE"

/usr/bin/time -v python process_one.py "$INPUT_FILE"

echo "End:          $(date)"

A few things worth understanding before you submit:

  • --array=0-19 runs 20 tasks total, with $SLURM_ARRAY_TASK_ID set to 0, 1, 2, …, 19 inside each.
  • %5 caps concurrency at 5: at any moment, only 5 of the 20 tasks are running. The other 15 are queued. This is critical behavior — without it, you’d potentially try to run all 20 simultaneously, which:
    • Is rude on a busy cluster (you grab 20 slots; everyone else waits)
    • Can hit per-user job limits and fail
    • Doesn’t actually speed up your work much if each task only uses 1 CPU anyway
  • --output=logs/%x-%A_%a.out produces one log file per task: logs/lab11_array-12345_0.out, _1, …, _19. Without this, all tasks would clobber the same log file.

4. Submit and monitor (15 min)

sbatch process_array.slurm
# Submitted batch job 12345

Watch the queue:

squeue -u $USER

You should see entries like:

JOBID         PARTITION  NAME          USER         ST  TIME  ...
12345_[5-19]  batch      lab11_array   yourname.##  PD  0:00
12345_0       batch      lab11_array   yourname.##  R   0:23
12345_1       batch      lab11_array   yourname.##  R   0:23
12345_2       batch      lab11_array   yourname.##  R   0:23
12345_3       batch      lab11_array   yourname.##  R   0:23
12345_4       batch      lab11_array   yourname.##  R   0:23

5 running, 15 pending. As the running ones finish, the next 5 enter R state. Wait until all 20 are done.

Check the output:

ls outputs/             # should see 20 *_summary.csv files
ls logs/                # should see 20 *.out files

5. Aggregate the results (5 min)

Save as aggregate.py:

"""Concatenate all per-input summaries into one CSV."""
import glob, pandas as pd

paths = sorted(glob.glob("outputs/input_*_summary.csv"))
print(f"Found {len(paths)} summary files")

dfs = []
for p in paths:
    df = pd.read_csv(p)
    df["source_file"] = p.replace("outputs/", "").replace("_summary.csv", "")
    dfs.append(df)

combined = pd.concat(dfs, ignore_index=True)
combined.to_csv("outputs/combined_summary.csv", index=False)
print(f"Wrote outputs/combined_summary.csv with {len(combined)} rows")
python aggregate.py
head outputs/combined_summary.csv

6. Inspect efficiency for one task (3 min)

seff works on individual array tasks too:

seff 12345_0

Should show CPU Efficiency near 100% (since each task is single-threaded and you asked for 1 CPU) and Memory Efficiency dependent on how much your input file used.

For the whole array:

sacct -j 12345 --format=JobID,State,Elapsed,MaxRSS,CPUTime | head

This shows one line per array task — useful for spotting outliers.


Deliverables

Save to lab11/ in your personal repo:

  1. lab11/make_inputs.py — the input generator.
  2. lab11/process_one.py — the per-file script.
  3. lab11/process_array.slurm — the array submission script.
  4. lab11/aggregate.py — the post-processing aggregator.
  5. lab11/output_listing.txtls -la outputs/ showing all 20 summary files + the combined CSV.
  6. lab11/seff_one_task.txtseff 12345_0 output.
  7. lab11/reflection.md — 5–7 sentences:
    • Why is %5 (the concurrency cap) important — what would go wrong without it?
    • How is this approach more cluster-friendly than a bash loop submitting 20 separate jobs?
    • For a real-world version of this — say, processing 1000 satellite images — what would change?

Self-check


Common issues

❌ Some output files are missing

Check the corresponding log file (logs/lab11_array-12345_<index>.out). Most likely the input file didn’t exist for that index (check your zero-padding), or the Python script crashed. Fix and resubmit just that task with:

sbatch --array=<missing_index> process_array.slurm

❌ All 20 tasks ran simultaneously (no concurrency cap honored)

Check the --array=0-19%5 line — make sure the %5 is there. The cap is a per-array setting; without it you can have all tasks running at once.

❌ “Maximum array size exceeded”

Clusters often cap the array size (e.g. --array=0-9999 may be too many). Run scontrol show config | grep -i array to see your cluster’s MaxArraySize. For >1000 tasks, use multiple smaller submissions or move to a different parallelism strategy.

squeue is unreadable with many array tasks

Use:

squeue -u $USER --array      # shows array tasks rolled up
squeue -u $USER -t R         # only running tasks

❌ One task fails and I want to rerun just it

sbatch --array=7 process_array.slurm

submits only task index 7. Slurm gives it a fresh new array-master ID.


Time estimate

  • Reading: ~15 min
  • Tasks: ~60 min (a chunk of which is waiting for the array to finish — could be faster or slower depending on cluster load)
  • Deliverables: ~10 min

Total: ~1.5 hours


Extensions (optional)

Add a “follow-up” job that aggregates after the array completes

Slurm --dependency lets you chain jobs. Save aggregate.slurm:

#!/bin/bash
#SBATCH --job-name=lab11_aggregate
#SBATCH --partition=batch
#SBATCH --time=00:05:00
#SBATCH --mem=2G
#SBATCH --output=logs/%x-%j.out

source ~/miniforge3/etc/profile.d/conda.sh
mamba activate eslab
python aggregate.py

Submit it to wait for the array:

ARRAY_ID=$(sbatch --parsable process_array.slurm)
sbatch --dependency=afterok:$ARRAY_ID aggregate.slurm

Now aggregate.slurm starts automatically when all 20 array tasks succeed.

Compute the speedup

Without the array (one job per file, run sequentially), how long would 20 separate jobs take? Compare to your actual end-to-end array runtime. Note that 20 jobs at 5 concurrent ≈ 4× speedup over serial.

Use srun inside an array task

If each array task itself needs multiple CPUs (e.g. processing a single big image with parallel code), use srun inside:

srun python process_one.py "$INPUT_FILE"

This launches the Python process as a Slurm “step” — useful for nested parallelism.


What’s next?

You’ve now scaled to many tasks. Lab 12 — GPU jobs (or alternative: hyperparameter sweep) moves into GPU work (if you have access) or applies the array pattern to a hyperparameter sweep.