Lab 09 — Right-sizing: the headline lab

Goal

Experience first-hand why over-requesting Slurm resources hurts you (not just your labmates). You’ll take a single-threaded scikit-learn training job and submit it twice:

  1. With a deliberately greedy resource request — almost a whole node.
  2. With a right-sized request based on what seff told you.

Then compare queue wait time, runtime, and efficiency between the two. The numbers tell the story.

This is the most important Slurm skill you’ll learn in this course.


Reading

Budget ~25 minutes for the reading. Pay particular attention to: - Section 1 (the backfill intuition — why your jobs run sooner when you ask for less) - Section 3 (the worked example — same setup as this lab) - Section 9 (post-mortem with seff)


Learning objectives

  1. Predict, given two #SBATCH blocks, which job will start sooner in the queue.
  2. Read a seff efficiency report and identify whether CPU and memory were over- or under-requested.
  3. Convert a seff report into specific changes for the next submission.
  4. Articulate (in writing) why “I’ll just grab the whole node so I’m safe” hurts both the cluster and yourself.

Setup / prerequisites

  • Labs 01–08 complete — SSH working, eslab env exists, you’ve submitted a basic batch job and used seff.

Tasks

1. Set up the lab directory and starter code (10 min)

cd ~/hpc_practicum
mkdir -p lab09 lab09/logs
cd lab09

Save the following as fit_rf.py. It’s a single-threaded scikit-learn job that generates synthetic data, trains a RandomForestClassifier, saves the model, and reports peak memory along the way.

"""
fit_rf.py — single-threaded scikit-learn benchmark for Lab 09.

Generates a synthetic dataset, trains a RandomForestClassifier with n_jobs=1,
and reports memory usage at each phase.
"""
import os
import time
import psutil
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
import joblib


def memreport(label, t0):
    rss_gb = psutil.Process(os.getpid()).memory_info().rss / 1e9
    print(f"[{label:20s}] RSS: {rss_gb:5.2f} GB  at t={time.time()-t0:6.1f}s")


def main():
    t0 = time.time()
    print(f"Started at {time.strftime('%Y-%m-%d %H:%M:%S')}")
    print(f"Slurm CPUs allocated: {os.environ.get('SLURM_CPUS_PER_TASK', 'n/a')}")
    print(f"Slurm mem allocated:  {os.environ.get('SLURM_MEM_PER_NODE', 'n/a')} MB")
    memreport("Startup", t0)

    print("\nGenerating synthetic dataset (500k samples, 200 features)...")
    X, y = make_classification(
        n_samples=500_000,
        n_features=200,
        n_informative=50,
        n_redundant=20,
        random_state=42,
    )
    df = pd.DataFrame(X)
    df['y'] = y
    memreport("After data load", t0)

    print("\nTraining RandomForestClassifier (n_estimators=200, n_jobs=1)...")
    clf = RandomForestClassifier(
        n_estimators=200,
        max_depth=15,
        n_jobs=1,            # SINGLE-THREADED on purpose
        random_state=42,
    )
    clf.fit(X, y)
    memreport("After fit", t0)

    print("\nSaving model...")
    joblib.dump(clf, "rf_model.joblib")
    memreport("After save", t0)

    print(f"\nFinished at {time.strftime('%Y-%m-%d %H:%M:%S')}")
    print(f"Total walltime: {time.time()-t0:.1f}s")


if __name__ == "__main__":
    main()

This script will use roughly 3–8 GB of memory at peak (the exact number varies per machine) and take 2–5 minutes to run on a single CPU. We’ll measure exactly later.

2. Round 1 — submit with a deliberate over-ask (15 min)

Write fit_rf_overasked.slurm:

#!/bin/bash
#SBATCH --job-name=fit_rf_BIG
#SBATCH --partition=batch
#SBATCH --time=12:00:00                   # 12 hours — absurdly long for this job
#SBATCH --cpus-per-task=16                # 16 CPUs — even though n_jobs=1
#SBATCH --mem=64G                         # 64 GB — even though we'll use ~5
#SBATCH --output=logs/%x-%j.out

set -euo pipefail

# Tell numerical libraries to respect Slurm's CPU allocation.
# (We're asking for 16, even though this script only uses 1. This export is here
# for completeness — it would matter for genuinely multi-threaded code.)
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MKL_NUM_THREADS=$SLURM_CPUS_PER_TASK

echo "=== JOB INFO ==="
echo "Job: $SLURM_JOB_ID ($SLURM_JOB_NAME) on $(hostname)"
echo "CPUs requested: $SLURM_CPUS_PER_TASK"
echo "Memory requested: $SLURM_MEM_PER_NODE MB"
echo "Started: $(date)"
echo "================"

source ~/miniforge3/etc/profile.d/conda.sh
mamba activate eslab

/usr/bin/time -v python fit_rf.py

echo "Finished: $(date)"
echo "Now run 'seff $SLURM_JOB_ID' for the efficiency report."

Submit and note the time:

date                                    # for queue-wait reference
sbatch fit_rf_overasked.slurm
# Submitted batch job 12345

Watch it progress:

squeue -u $USER

Record: - When did it transition from PDR? (Roughly the queue-wait time.) - When did it finish?

When done, examine the log:

cat logs/fit_rf_BIG-12345.out
seff 12345

Take note of: - CPU Efficiency percentage - Memory Efficiency percentage - Actual elapsed walltime - Peak RSS reported by /usr/bin/time -v

3. Diagnose the waste (5 min)

Look at the seff output. You’ll see something like:

Cores: 16
CPU Utilized: 00:02:23                    # the work itself
CPU Efficiency: 1.04% of 03:50:00 core-walltime
Job Wall-clock time: 00:14:23             # includes queue, startup
Memory Utilized: 4.81 GB
Memory Efficiency: 7.5% of 64.00 GB

In plain English:

  • You used 1 CPU worth of compute, but reserved 16 CPUs. → 15 CPUs were locked away from other users for nothing.
  • You used ~5 GB, but reserved 64 GB. → 59 GB locked away.
  • You asked for 12 hours, used a few minutes. → That huge time request also pushed you down the priority list while you waited.

4. Round 2 — right-size based on seff (10 min)

Write fit_rf_rightsized.slurm:

#!/bin/bash
#SBATCH --job-name=fit_rf_RIGHT
#SBATCH --partition=batch
#SBATCH --time=00:15:00                   # 15 minutes — measured + safety margin
#SBATCH --cpus-per-task=1                 # 1 CPU — the script is single-threaded
#SBATCH --mem=8G                          # 8 GB — measured ~5 GB peak + headroom
#SBATCH --output=logs/%x-%j.out

set -euo pipefail

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MKL_NUM_THREADS=$SLURM_CPUS_PER_TASK

echo "=== JOB INFO ==="
echo "Job: $SLURM_JOB_ID ($SLURM_JOB_NAME) on $(hostname)"
echo "CPUs requested: $SLURM_CPUS_PER_TASK"
echo "Memory requested: $SLURM_MEM_PER_NODE MB"
echo "Started: $(date)"
echo "================"

source ~/miniforge3/etc/profile.d/conda.sh
mamba activate eslab

/usr/bin/time -v python fit_rf.py

echo "Finished: $(date)"
echo "Now run 'seff $SLURM_JOB_ID' for the efficiency report."

Tune --mem to your actual measured peak + ~30% safety. If your overasked run hit Memory Utilized: 4.81 GB, request --mem=8G. If it hit 6.5 GB, request --mem=10G. Avoid the temptation to round down to the bare minimum — Slurm OOM-kills your job if it exceeds the cap by even a few MB.

Submit:

date
sbatch fit_rf_rightsized.slurm
squeue -u $USER

When done:

seff <new_jobid>
cat logs/fit_rf_RIGHT-<new_jobid>.out

5. Compare the two runs (10 min)

Make a side-by-side comparison table in lab09/comparison.md:

| Metric                  | Over-asked   | Right-sized  |
|-------------------------|--------------|--------------|
| `--cpus-per-task`       | 16           | 1            |
| `--mem`                 | 64G          | 8G           |
| `--time`                | 12:00:00     | 00:15:00     |
| Queue wait time         | <fill in>    | <fill in>    |
| Actual walltime         | <fill in>    | <fill in>    |
| CPU Efficiency (seff)   | ~1%          | <fill in>    |
| Memory Efficiency (seff)| ~7%          | <fill in>    |
| Was the work different? | No           | No           |

The “Was the work different?” row is the punchline: identical work, dramatically different cluster impact.

6. Reflect (5 min)

In lab09/reflection.md, answer:

  • How did the queue wait times compare? Did the over-asked job really take longer to start? (For lightly-loaded clusters the difference may be small; for busy ones it’s dramatic — note what you observed.)
  • The over-asked job locked up 15 unused CPUs and 59 GB of unused RAM for the duration. How many other small jobs from your labmates could that have run?
  • Are there situations where it actually would make sense to ask for the whole node? (Hint: think about jobs that genuinely use that much. Or jobs where shared-node neighbors would interfere — those are rare though.)

Deliverables

Save to lab09/ in your personal repo:

  1. lab09/fit_rf.py — the Python script.

  2. lab09/fit_rf_overasked.slurm and lab09/seff_overasked.txt — the bad version’s script and seff output.

  3. lab09/fit_rf_rightsized.slurm and lab09/seff_rightsized.txt — the good version’s script and seff output.

  4. lab09/comparison.md — the side-by-side table from Task 5.

  5. lab09/reflection.md — your reflection from Task 6.


Self-check


Common issues

❌ My over-asked job actually started right away (no queue wait)

Cluster was lightly loaded when you submitted. The lesson is still real — it just doesn’t show up dramatically in your numbers this time. Try submitting both jobs during a busier time (mid-morning weekdays, often) to see the queue-wait effect more clearly.

❌ My right-sized job died with OOM

You cut --mem too tight. Bump it up by 50% and resubmit. The point isn’t to find the absolute minimum — it’s to ask for something close to your actual usage plus a safety margin. Run #2 only failed because you didn’t leave enough headroom.

❌ My CPU Efficiency on the over-asked job is something other than ~1/16 = 6.25%

The number depends on how much the script thought it could use multiple CPUs. If, say, your numpy started using more threads than you intended, seff might show higher CPU utilization. The OMP_NUM_THREADS exports help control this — see Handbook §5.

❌ My job pends forever

You may have hit a usage quota (too many jobs already running, hit a fair-share limit). Run squeue -u $USER to see how many of your jobs are in flight, and sshare -u $USER to check your fair-share.

❌ The Python script crashes with MemoryError

Either generate a smaller dataset (reduce n_samples from 500k to e.g. 200k) or run on a node with more RAM. The make_classification data isn’t huge but it can spike during DataFrame conversion.


Time estimate

  • Reading: ~25 min
  • Tasks: ~50 min (the queue waits and runtime are most of this)
  • Deliverables: ~15 min

Total: ~1.5 hours


Extensions (optional)

Submit a third version — deliberately under-asked

Modify fit_rf.slurm to request --mem=1G and submit. Watch it die with OOM (state code).

Inspect the log and compare to seff. This is the failure mode of being too aggressive on right-sizing — and why the “+30% safety margin” matters.

Submit during peak vs. off-peak hours

If you can swing it: submit both fit_rf_overasked.slurm and fit_rf_rightsized.slurm once at a peak time (Tuesday 10am-3pm during the semester) and once at an off-peak time (Friday 8pm, or Sunday morning). Document the queue-wait differences.

Inspect sinfo to see the actual node sizes

sinfo -o "%N %P %c %m %f"

This shows the CPUs and memory per node on each partition. You can see exactly what fraction of a node your over-ask was locking up.


What’s next?

You’ve now seen the consequences of over- and under-requesting. Lab 10 — Measuring memory + the diagnostic wrapper teaches you how to measure what your code actually uses, so you stop guessing.