Capstone — Reproducible HPC repo
Status: stub. Full content still to be expanded — what’s below is the structure.
Goal
Over the final 3 weeks of the course, take a real problem from your own research and package it as a complete, reproducible HPC repo. The deliverable is a public (or course-private) GitHub repo that someone else with a Unity or OSC account could clone, run, and reproduce.
This is the capstone where everything from the previous 12 labs comes together.
What counts as a project
A good capstone has three properties:
It does real work. Not a “Hello World” — something you might actually run for your research, even if scaled down. Examples: training a classifier on a subset of your real dataset, running a parameter sweep over a small simulation, computing derived data products from a satellite-imagery file, fitting a statistical model.
It uses cluster resources non-trivially. At least one of: a multi-hour batch job, a job array, or a GPU job. Pure laptop-scale work doesn’t count.
It produces a verifiable artifact. A figure, a trained model file, a results CSV, a derived dataset — something a reader could compare against to verify reproducibility.
If you don’t have a clean fit from your own research, choose from the suggested defaults in the Appendix at the bottom of this page.
Week-by-week structure
Week 13 — Scope and set up
- Reading: No new reading. Re-read Slurm Best Practices §11 (the pre/post-submission checklist).
- Tasks:
- Write a 1-page project proposal (
PROPOSAL.mdin your repo): the scientific question, the data, the method, expected resource needs, the deliverable. - Initialize a GitHub repo (or local-only for now). Add a
.gitignorecovering Python noise + conda envs. - Write your
environment.ymlbased on what you learned in Labs 5–6. - Set up directory structure:
code/,data/(or pointers to/fs/project/.../data/),slurm/,logs/,results/.
- Write a 1-page project proposal (
- Deliverable:
PROPOSAL.md,environment.yml,.gitignore, an empty repo with the directory skeleton.
Week 14 — Implement and run
- Tasks:
- Write the code that does the work — Python, R, or whatever fits.
- Write Slurm scripts using the
diagnostic.slurmtemplate from Lab 10. - Submit a test run on a small subset first (Lab 9 / Lab 10 measurement habit).
- Use
seffto right-size, then submit the full run. - Capture the outputs.
- Deliverable: working code + Slurm scripts + outputs from a real run, committed to your repo.
README.md template
# <Project Title>
One-paragraph elevator pitch of what this repo does.
## Setup
```bash
mamba env create -f environment.yml
mamba activate myproject
```
For byte-exact reproducibility:
```bash
conda-lock install --name myproject conda-lock.yml
```
## Data
Where the input data lives (`/fs/project/<group>/...` if it's too big to commit; a download script otherwise).
## Run
```bash
sbatch slurm/main_run.slurm
```
Expected resources: `--cpus-per-task=N`, `--mem=NG`, `--time=HH:MM:SS`.
Expected runtime: about N minutes on Unity.
## Results
Where the outputs land. What figures / CSVs / models to expect.
## Cluster notes
Anything Unity- or OSC-specific worth knowing.Grading rubric (for self/peer assessment)
| Dimension | What “great” looks like |
|---|---|
| Scientific question | Clearly stated; meaningful for your research |
| Reproducibility | mamba env create + sbatch slurm/main_run.slurm should reproduce the full pipeline |
| Resource right-sizing | seff shows >70% CPU efficiency and >70% memory efficiency on the main job |
| Documentation | A new student in your lab could pick this up tomorrow |
| Use of the course content | At least one of: tmux/livenode for the dev cycle, mamba env shared with collaborators, job array, GPU usage, etc. |
Suggested capstone defaults (if your own research doesn’t fit)
- Satellite imagery downscaling: take a small ArcticDEM tile, run a per-pixel transformation in a job array, produce a derived data product.
- Hyperparameter sweep on a public dataset: train a small ML model on UCI or sklearn-built-in datasets across a sweep, find the best config.
- Numerical experiment: sweep a parameter of a small ODE / PDE / Monte Carlo solver across many values, plot the convergence behavior.
- Text data pipeline: download a corpus, tokenize, count, produce a frequency analysis — across an array of categories.
- Image segmentation on a small dataset: train a U-Net or similar on a small image set (or fine-tune a pretrained model).
(More suggestions will be added based on student/department interest. Email the course maintainers with proposals.)
Submission
Submit your repo URL as the final deliverable. If the repo is private, give the course maintainers (or your PI) read access. We don’t need a write-up beyond your README.md and PROPOSAL.md — the repo IS the writeup.