Lab 10 — Measuring memory + the diagnostic wrapper

Goal

Stop guessing how much memory and CPU your code needs — measure it. By the end of this lab you’ll have:

A Python script instrumented with psutil so it self-reports memory at every phase
A Slurm script wrapped in /usr/bin/time -v that reports peak resident memory
A personal diagnostic.slurm template you’ll reuse for every batch job for the rest of your research career

This lab puts a measurement loop around what you did in Lab 09 — instead of guessing-then-correcting, you’ll measure once and right-size on the first try.

Reading

Handbook: Slurm Best Practices §5–9 — the five measurement techniques, the diagnostic wrapper template, and the seff post-mortem loop.

Budget ~25 minutes for the reading.

Learning objectives

Find the “Maximum resident set size” line in /usr/bin/time -v output and convert it to GB.
Instrument Python code with psutil to log memory usage at specific phases.
Build a reusable diagnostic.slurm template that includes vital-signs logging, thread-count exports, and a /usr/bin/time wrapper.
Use the resulting log to right-size future submissions on the first try.

Setup / prerequisites

Labs 01–09 complete. In particular, eslab env has psutil installed (it was in the Lab 5 install list).

Tasks

1. Set up the lab directory (3 min)

cd ~/hpc_practicum
mkdir -p lab10 lab10/logs ~/templates
cd lab10

The ~/templates/ directory will hold your reusable diagnostic.slurm — a template you’ll copy for new projects from now on.

2. Write a memory-instrumented Python script (15 min)

Save as mem_intensive.py:

"""
mem_intensive.py — a script with three phases of distinctly different memory footprints.

Phase 1: load data (medium memory)
Phase 2: do something memory-hungry (peak memory)
Phase 3: write output (back to small memory)
"""
import os
import time
import psutil
import numpy as np

def memreport(label, t0):
    """Print current RSS for the running process."""
    rss_gb = psutil.Process(os.getpid()).memory_info().rss / 1e9
    elapsed = time.time() - t0
    print(f"[{label:25s}]  RSS: {rss_gb:6.2f} GB   at t={elapsed:6.1f}s", flush=True)


def main():
    t0 = time.time()
    print(f"Started at {time.strftime('%Y-%m-%d %H:%M:%S')}")
    memreport("startup", t0)

    # ─── Phase 1: load some "data" ────────────────────────────
    print("\nPhase 1: allocating a few moderately large arrays...")
    A = np.random.randn(5_000_000, 32).astype(np.float32)       # ~640 MB
    B = np.random.randn(5_000_000, 32).astype(np.float32)       # ~640 MB
    memreport("after Phase 1 load", t0)

    # ─── Phase 2: deliberately memory-hungry ──────────────────
    print("\nPhase 2: computing a large pairwise distance-like matrix...")
    # This intentionally creates a much larger temp array
    chunk = 5000
    distances = np.empty((chunk, chunk), dtype=np.float32)
    for i in range(0, chunk):
        diff = A[i:i+1, :] - B[:chunk, :]                       # (chunk, 32)
        distances[i, :] = np.linalg.norm(diff, axis=1)
        if i == chunk // 4:
            memreport("Phase 2 mid", t0)
    memreport("after Phase 2 compute", t0)

    # ─── Phase 3: write output (release big arrays) ───────────
    print("\nPhase 3: writing output, releasing memory...")
    np.save("distances.npy", distances)
    del A, B, distances
    import gc
    gc.collect()
    memreport("after Phase 3 cleanup", t0)

    print(f"\nFinished at {time.strftime('%Y-%m-%d %H:%M:%S')}")
    print(f"Total walltime: {time.time()-t0:.1f}s")


if __name__ == "__main__":
    main()

This script’s memory profile is intentionally asymmetric: Phase 2 is the peak. Most real research scripts have a similar profile — a peak somewhere in the middle, not at the end. That’s why a single mamba list after the script finishes doesn’t tell you the peak.

3. Run it interactively first (10 min)

Use an sinteractive (or livenode) session for this — you don’t want to be debugging on a login node, but you also don’t need a batch job yet.

# In a livenode or sinteractive session:
sinteractive -p batch --cpus-per-task=2 --mem=8G --time=01:00:00
mamba activate eslab
cd ~/hpc_practicum/lab10
python mem_intensive.py

Read the output. Note:

Approximate peak RSS (it’s the highest of the [after Phase 2 compute] or [Phase 2 mid] readouts)
Total walltime

Now run it again wrapped in /usr/bin/time -v:

/usr/bin/time -v python mem_intensive.py 2> time_output.txt
cat time_output.txt | head -25

Scroll to find the line that says “Maximum resident set size (kbytes):”. Convert to GB:

Maximum resident set size (kbytes): 1234567
→ 1234567 / 1024 / 1024 ≈ 1.18 GB

This should roughly match the peak RSS you saw from the psutil checkpoints.

4. Build your `diagnostic.slurm` template (15 min)

Save as ~/templates/diagnostic.slurm (a reusable template you’ll copy for future projects). This is your “every batch script” wrapper:

#!/bin/bash
#SBATCH --job-name=CHANGE_ME
#SBATCH --partition=batch
#SBATCH --time=CHANGE_ME                  # e.g. 02:00:00
#SBATCH --cpus-per-task=CHANGE_ME         # e.g. 1, 4, 8
#SBATCH --mem=CHANGE_ME                   # e.g. 8G, 32G — measured + ~30% headroom
#SBATCH --output=logs/%x-%j.out
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=CHANGE_ME@osu.edu

set -euo pipefail

# ───────────────────────────────────────────────────────────
# Tell numerical libraries how many threads they may use.
# Without this, NumPy/BLAS/MKL grab every CPU on the node.
# ───────────────────────────────────────────────────────────
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
export MKL_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
export OPENBLAS_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
export NUMEXPR_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}

# ───────────────────────────────────────────────────────────
# Vital signs — recorded at the top of every job's log.
# ───────────────────────────────────────────────────────────
echo "==================== JOB INFO ===================="
echo "Job ID:        $SLURM_JOB_ID"
echo "Job name:      $SLURM_JOB_NAME"
echo "Partition:     $SLURM_JOB_PARTITION"
echo "Node:          $(hostname)"
echo "CPUs:          $SLURM_CPUS_PER_TASK"
echo "Memory (MB):   ${SLURM_MEM_PER_NODE:-${SLURM_MEM_PER_CPU:-unset}}"
echo "GPUs:          ${SLURM_GPUS:-${SLURM_JOB_GPUS:-none}}"
echo "Working dir:   $(pwd)"
echo "Started:       $(date)"
echo "=================================================="

# ───────────────────────────────────────────────────────────
# Environment activation
# ───────────────────────────────────────────────────────────
source ~/miniforge3/etc/profile.d/conda.sh
mamba activate eslab

echo "Python:        $(which python)"
python -c "import sys; print(f'Python version: {sys.version.split()[0]}')"

# ───────────────────────────────────────────────────────────
# (Optional) GPU usage logger in the background
# ───────────────────────────────────────────────────────────
if command -v nvidia-smi &> /dev/null; then
    (
      while sleep 30; do
        echo "--- $(date '+%H:%M:%S') ---"
        nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total --format=csv,noheader
      done
    ) > "logs/gpu-${SLURM_JOB_ID}.log" 2>&1 &
    GPU_LOGGER_PID=$!
fi

# ───────────────────────────────────────────────────────────
# ACTUAL WORK — replace this line for each new project.
# /usr/bin/time -v captures the peak RSS for right-sizing later.
# ───────────────────────────────────────────────────────────
echo "===================== WORK ======================="
/usr/bin/time -v python mem_intensive.py
echo "=================================================="

# Clean up background logger
if [ -n "${GPU_LOGGER_PID:-}" ]; then
    kill $GPU_LOGGER_PID 2>/dev/null || true
fi

echo "Finished:      $(date)"
echo ""
echo "Run 'seff $SLURM_JOB_ID' after the job's epilog completes"
echo "for a one-page efficiency report (CPU% and Memory%)."

This is your drop-in template. For each new project, copy it to that project’s directory and replace the four CHANGE_ME placeholders + the python line at the bottom.

5. Use the template for a real submission (10 min)

cp ~/templates/diagnostic.slurm ~/hpc_practicum/lab10/mem.slurm
cd ~/hpc_practicum/lab10

Edit mem.slurm: - --job-name=lab10_mem - --time=00:30:00 (more than enough for this script) - --cpus-per-task=1 (the script is mostly single-threaded NumPy) - --mem=4G (your interactive measurement said peak ~1 GB; safety margin to 4G) - --mail-user=yourname@osu.edu

Submit:

sbatch mem.slurm

When it finishes, examine the log:

cat logs/lab10_mem-<jobid>.out | head -40            # vital-signs block
grep -A 30 "Command being timed" logs/lab10_mem-<jobid>.out | head -35
seff <jobid>

In the /usr/bin/time -v output, find: - Maximum resident set size (kbytes): — convert to GB - Elapsed (wall clock) time — actual runtime

In seff: - Memory Efficiency — should be reasonable (>25%, ideally >50%) if you chose --mem=4G correctly

6. (Optional) Add real-time GPU tracking (5 min — skip if you don’t have GPU access)

The template already includes the GPU-usage logger (Section 4). If you ran a GPU job, you’d find logs/gpu-<jobid>.log with nvidia-smi samples every 30 seconds. You’ll use this in Lab 12.

Deliverables

Save to lab10/ in your personal repo:

lab10/mem_intensive.py — the instrumented script from Task 2.
lab10/diagnostic.slurm — a copy of your template (the one in ~/templates/diagnostic.slurm). Redact any real email if you have --mail-user.
lab10/mem.slurm — the project-specific version you used in Task 5.
lab10/job_log.txt — the full log from your real Slurm submission. Should include:
- Vital-signs block at the top
- All four psutil checkpoints
- The /usr/bin/time -v output block (with peak RSS in kbytes)
lab10/right_sized.md — a short writeup:
- What was the peak RSS reported by /usr/bin/time -v? (Convert to GB.)
- What was Memory Efficiency from seff?
- If you ran this 100 times in production, what --mem would you settle on, and why?

Self-check

You can find “Maximum resident set size” in /usr/bin/time -v output and convert kbytes to GB without help
Your diagnostic.slurm template exports OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK etc.
You used the template for a real job and got useful diagnostic output
Your psutil checkpoints’ peak roughly agrees with /usr/bin/time -v’s peak RSS

Common issues

❌ `/usr/bin/time -v` not found — only the bash builtin

On Unity it should be there as /usr/bin/time. If not, install via:

mamba install -n eslab time

…or use the explicit path /usr/bin/time -v (don’t use the bash builtin time, which doesn’t have -v).

❌ `psutil` not installed

mamba activate eslab
mamba install psutil

If you skipped psutil in Lab 5, add it now.

❌ The `/usr/bin/time -v` output goes to stderr and mixes with my Python output

That’s normal — /usr/bin/time -v writes to stderr. Slurm captures both stdout and stderr into your --output= file, so it all ends up in the log. To split them in a more advanced setup, use --output= and --error= to separate files.

❌ `psutil` peak RSS is much smaller than `/usr/bin/time -v` peak RSS

That can happen if your code allocates and quickly releases memory between the psutil checkpoints. /usr/bin/time -v captures the all-time peak; psutil only captures what was alive when you called it. To improve, add more checkpoints, or sample psutil periodically in a background thread.

❌ Memory Efficiency from `seff` is 99% but the job didn’t OOM

Slurm rounds reservations to discrete sizes. If you asked for 4G but the actual node-side reservation was 4096 MiB and your peak was 4080 MiB, you’re at 99% — fine, no kill. If you’d asked for 5G you’d be at ~80% — pick the level you want.

Time estimate

Reading: ~25 min
Tasks: ~50 min (mostly running things and reading their output)
Deliverables: ~15 min

Total: ~1.5 hours

Extensions (optional)

Sample `psutil` periodically in a background thread

For better continuous tracking, set up a sampler:

import threading, time, psutil, os

def memlog(interval=5):
    proc = psutil.Process(os.getpid())
    while True:
        rss_gb = proc.memory_info().rss / 1e9
        print(f"[bg-sampler]  RSS: {rss_gb:.2f} GB  at {time.strftime('%H:%M:%S')}", flush=True)
        time.sleep(interval)

threading.Thread(target=memlog, daemon=True).start()

This logs memory every 5 seconds throughout the run, without needing to scatter manual checkpoints.

Try `memray` for fancier profiling

memray is a more powerful memory profiler. Install with mamba install -c conda-forge memray, then:

memray run --output mem.bin python mem_intensive.py
memray flamegraph mem.bin

Produces an HTML flamegraph showing where memory accumulates by call site.

Use `sstat` for live tracking during a running job

While a Slurm job is in R state:

sstat -j <jobid> --format=JobID,MaxRSS,AveRSS,MaxVMSize

Returns the current max-resident-set-size live. Useful for catching a job that’s about to OOM before it dies.

What’s next?

You can now measure resources. Lab 11 — Job arrays for many independent tasks scales the same techniques to processing many input files in parallel via a single submission.