Capstone — Reproducible HPC repo

Note

Status: stub. Full content still to be expanded — what’s below is the structure.

Goal

Over the final 3 weeks of the course, take a real problem from your own research and package it as a complete, reproducible HPC repo. The deliverable is a public (or course-private) GitHub repo that someone else with a Unity or OSC account could clone, run, and reproduce.

This is the capstone where everything from the previous 12 labs comes together.


What counts as a project

A good capstone has three properties:

  1. It does real work. Not a “Hello World” — something you might actually run for your research, even if scaled down. Examples: training a classifier on a subset of your real dataset, running a parameter sweep over a small simulation, computing derived data products from a satellite-imagery file, fitting a statistical model.

  2. It uses cluster resources non-trivially. At least one of: a multi-hour batch job, a job array, or a GPU job. Pure laptop-scale work doesn’t count.

  3. It produces a verifiable artifact. A figure, a trained model file, a results CSV, a derived dataset — something a reader could compare against to verify reproducibility.

If you don’t have a clean fit from your own research, choose from the suggested defaults in the Appendix at the bottom of this page.


Week-by-week structure

Week 13 — Scope and set up

  • Reading: No new reading. Re-read Slurm Best Practices §11 (the pre/post-submission checklist).
  • Tasks:
    1. Write a 1-page project proposal (PROPOSAL.md in your repo): the scientific question, the data, the method, expected resource needs, the deliverable.
    2. Initialize a GitHub repo (or local-only for now). Add a .gitignore covering Python noise + conda envs.
    3. Write your environment.yml based on what you learned in Labs 5–6.
    4. Set up directory structure: code/, data/ (or pointers to /fs/project/.../data/), slurm/, logs/, results/.
  • Deliverable: PROPOSAL.md, environment.yml, .gitignore, an empty repo with the directory skeleton.

Week 14 — Implement and run

  • Tasks:
    1. Write the code that does the work — Python, R, or whatever fits.
    2. Write Slurm scripts using the diagnostic.slurm template from Lab 10.
    3. Submit a test run on a small subset first (Lab 9 / Lab 10 measurement habit).
    4. Use seff to right-size, then submit the full run.
    5. Capture the outputs.
  • Deliverable: working code + Slurm scripts + outputs from a real run, committed to your repo.

Week 15 — Document and share

  • Tasks:
    1. Write a thorough README.md (template below).
    2. Optionally: produce a conda-lock.yml for byte-exact reproducibility.
    3. Push to GitHub (public or course-private).
    4. (Stretch) Have a peer or your PI clone and re-run from scratch to verify reproducibility.
  • Deliverable: the repo URL.

README.md template

# <Project Title>

One-paragraph elevator pitch of what this repo does.

## Setup

```bash
mamba env create -f environment.yml
mamba activate myproject
```

For byte-exact reproducibility:
```bash
conda-lock install --name myproject conda-lock.yml
```

## Data

Where the input data lives (`/fs/project/<group>/...` if it's too big to commit; a download script otherwise).

## Run

```bash
sbatch slurm/main_run.slurm
```

Expected resources: `--cpus-per-task=N`, `--mem=NG`, `--time=HH:MM:SS`.
Expected runtime: about N minutes on Unity.

## Results

Where the outputs land. What figures / CSVs / models to expect.

## Cluster notes

Anything Unity- or OSC-specific worth knowing.

Grading rubric (for self/peer assessment)

Dimension What “great” looks like
Scientific question Clearly stated; meaningful for your research
Reproducibility mamba env create + sbatch slurm/main_run.slurm should reproduce the full pipeline
Resource right-sizing seff shows >70% CPU efficiency and >70% memory efficiency on the main job
Documentation A new student in your lab could pick this up tomorrow
Use of the course content At least one of: tmux/livenode for the dev cycle, mamba env shared with collaborators, job array, GPU usage, etc.

Suggested capstone defaults (if your own research doesn’t fit)

  1. Satellite imagery downscaling: take a small ArcticDEM tile, run a per-pixel transformation in a job array, produce a derived data product.
  2. Hyperparameter sweep on a public dataset: train a small ML model on UCI or sklearn-built-in datasets across a sweep, find the best config.
  3. Numerical experiment: sweep a parameter of a small ODE / PDE / Monte Carlo solver across many values, plot the convergence behavior.
  4. Text data pipeline: download a corpus, tokenize, count, produce a frequency analysis — across an array of categories.
  5. Image segmentation on a small dataset: train a U-Net or similar on a small image set (or fine-tune a pretrained model).

(More suggestions will be added based on student/department interest. Email the course maintainers with proposals.)


Submission

Submit your repo URL as the final deliverable. If the repo is private, give the course maintainers (or your PI) read access. We don’t need a write-up beyond your README.md and PROPOSAL.md — the repo IS the writeup.