📖 Review the Lecture notes

Lab 10 - Satellite Image Segmentation

In this lab, you will be guided though the process of segmenting multispectral satellite imagery with different ML and DL algorithms. Specifically, the objective is to label each pixel in satellite images as either land or water. Remember, that is called an image segmentation task.

The lab proceeds along the following steps:

Generate suitable training data.

Download multispectral Landsat imagery that has already been clipped and labeled for your convenience.
Organize into training, validation, and test sets.
Think about pre-processing your data for better performance.
Add some visualization tools to inspect your data right in this notebook.

Classify with the very simplest “black box” scikit-learn linear logistic regression library.
Classify with your own linear logistic regression, coded from scratch without “black box” unknowns.
Classify with a lightweight version of a U-Net Convolutional Neural Network, which is more or less the state-of-the-art for this problem.

Install and import required packages

From earlier labs, you remember how to import python libraries into a python script. To do so, the libraries already need to be installed on your system. For this lab, we will need two libraries that are not installed in google colab by default. Fortunately, jupyter has a convenient way to do so right inside a notebook. The below are basic bash/linux commands that jupyter can intepret by preceding with an exclamation point.

Code

!pip install rasterio
!pip install  rioxarray
!pip -q install contextily pyproj

Collecting rasterio

  Downloading rasterio-1.4.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.1 kB)

Collecting affine (from rasterio)

  Downloading affine-2.4.0-py3-none-any.whl.metadata (4.0 kB)

Requirement already satisfied: attrs in /usr/local/lib/python3.12/dist-packages (from rasterio) (25.4.0)

Requirement already satisfied: certifi in /usr/local/lib/python3.12/dist-packages (from rasterio) (2025.10.5)

Requirement already satisfied: click>=4.0 in /usr/local/lib/python3.12/dist-packages (from rasterio) (8.3.0)

Collecting cligj>=0.5 (from rasterio)

  Downloading cligj-0.7.2-py3-none-any.whl.metadata (5.0 kB)

Requirement already satisfied: numpy>=1.24 in /usr/local/lib/python3.12/dist-packages (from rasterio) (2.0.2)

Collecting click-plugins (from rasterio)

  Downloading click_plugins-1.1.1.2-py2.py3-none-any.whl.metadata (6.5 kB)

Requirement already satisfied: pyparsing in /usr/local/lib/python3.12/dist-packages (from rasterio) (3.2.5)

Downloading rasterio-1.4.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (22.3 MB)

   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 22.3/22.3 MB 40.0 MB/s eta 0:00:00

Downloading cligj-0.7.2-py3-none-any.whl (7.1 kB)

Downloading affine-2.4.0-py3-none-any.whl (15 kB)

Downloading click_plugins-1.1.1.2-py2.py3-none-any.whl (11 kB)

Installing collected packages: cligj, click-plugins, affine, rasterio

Successfully installed affine-2.4.0 click-plugins-1.1.1.2 cligj-0.7.2 rasterio-1.4.3

Collecting rioxarray

  Downloading rioxarray-0.19.0-py3-none-any.whl.metadata (5.5 kB)

Requirement already satisfied: packaging in /usr/local/lib/python3.12/dist-packages (from rioxarray) (25.0)

Requirement already satisfied: rasterio>=1.4.3 in /usr/local/lib/python3.12/dist-packages (from rioxarray) (1.4.3)

Requirement already satisfied: xarray>=2024.7.0 in /usr/local/lib/python3.12/dist-packages (from rioxarray) (2025.10.1)

Requirement already satisfied: pyproj>=3.3 in /usr/local/lib/python3.12/dist-packages (from rioxarray) (3.7.2)

Requirement already satisfied: numpy>=1.23 in /usr/local/lib/python3.12/dist-packages (from rioxarray) (2.0.2)

Requirement already satisfied: certifi in /usr/local/lib/python3.12/dist-packages (from pyproj>=3.3->rioxarray) (2025.10.5)

Requirement already satisfied: affine in /usr/local/lib/python3.12/dist-packages (from rasterio>=1.4.3->rioxarray) (2.4.0)

Requirement already satisfied: attrs in /usr/local/lib/python3.12/dist-packages (from rasterio>=1.4.3->rioxarray) (25.4.0)

Requirement already satisfied: click>=4.0 in /usr/local/lib/python3.12/dist-packages (from rasterio>=1.4.3->rioxarray) (8.3.0)

Requirement already satisfied: cligj>=0.5 in /usr/local/lib/python3.12/dist-packages (from rasterio>=1.4.3->rioxarray) (0.7.2)

Requirement already satisfied: click-plugins in /usr/local/lib/python3.12/dist-packages (from rasterio>=1.4.3->rioxarray) (1.1.1.2)

Requirement already satisfied: pyparsing in /usr/local/lib/python3.12/dist-packages (from rasterio>=1.4.3->rioxarray) (3.2.5)

Requirement already satisfied: pandas>=2.2 in /usr/local/lib/python3.12/dist-packages (from xarray>=2024.7.0->rioxarray) (2.2.2)

Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.12/dist-packages (from pandas>=2.2->xarray>=2024.7.0->rioxarray) (2.9.0.post0)

Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.12/dist-packages (from pandas>=2.2->xarray>=2024.7.0->rioxarray) (2025.2)

Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.12/dist-packages (from pandas>=2.2->xarray>=2024.7.0->rioxarray) (2025.2)

Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.12/dist-packages (from python-dateutil>=2.8.2->pandas>=2.2->xarray>=2024.7.0->rioxarray) (1.17.0)

Downloading rioxarray-0.19.0-py3-none-any.whl (62 kB)

   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62.2/62.2 kB 2.1 MB/s eta 0:00:00

Installing collected packages: rioxarray

Successfully installed rioxarray-0.19.0

👉 Try it yourself Some linux commands are also standard python commands, like pwd. Do you know any linux/bash commands that are not also python? If so, try it out below with and without the preceding exclamation point.

Code

# Example:
!echo "Hello there!"

# try out some other linux commands if you know any.
# Some are perhaps not allowed for security reasons.

Hello there!

As always, I like to keep all my package imports organized in one place, usually in the top.

Code

!pip install osmnx
# -----------------------------
# Standard library
# -----------------------------
import glob
import io
import os
import random
import tarfile
import time
from pathlib import Path
from secrets import randbits

import requests

# -----------------------------
# Core scientific stack
# -----------------------------
import numpy as np
from packaging import version

# -----------------------------
# Geospatial
# -----------------------------
import geopandas as gpd
import osmnx as ox
from osmnx._errors import InsufficientResponseError
import rasterio
import rasterio as rio          # (alias if you prefer rio over rasterio)
import rioxarray
from pyproj import Transformer
from shapely.geometry import box
import contextily as ctx

# -----------------------------
# Visualization
# -----------------------------
import matplotlib.pyplot as plt
from matplotlib.colors import BoundaryNorm, ListedColormap
from matplotlib.gridspec import GridSpec
from matplotlib.patches import Rectangle

# -----------------------------
# Machine learning
# -----------------------------
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# -----------------------------
# Deep learning (PyTorch)
# -----------------------------
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset

Download labeled satellite image dataset

You will be using a dataset that is derived from the excellent github project Deepwatermap. Deepwatermap is a custom deep neural network that was trained on about 1TB of labeled Landsat images and can run inference over Landsat-7, Landsat-8, and Sentinel-2 images. It was trained on 6 spectral bands: * B2: Blue * B3: Green * B4: Red * B5: Near Infrared (NIR) * B6: Shortwave Infrared 1 (SWIR1) * B7: Shortwave Infrared 2 (SWIR2) and the input images were clipped to pixel dimensions of around $743 \times 744$ pixels.

A model like that takes day or weeks to train on multiple server-grade GPUs. For your benefit, I have stripped the problem down to something manageable within the scope of a single lab session. As you will see, we will still achieve remarkable accuracy.

As a reminder, for an image segmentation problem, every pixel is an individual feature, multiplied by the number of layers (in this case, spectral bands) for each pixel. Also, remember an important disctinction with the image classification problem that you modeled before. Before, you used logistic regression to classify if an image contained a number 0, …, 9; whether it contained certain fashion items, and, finally, what Land-Use-Land-Cover classes images from the EuroSat database belonged to. For image segmentation, we want to label each pixel as either land or water (in this case). The more general application is Land-Use-Land-Cover (LULC) classification of each pixel into multiple categories (like, water, forest, grass, manmade structures, etc.).

❓ Question: Calculate the number of features in a single image clip with the pixel size as given above. And what is the size of target vector $y$? How does this scale when we train on hundreds or thousands of images?

As you can imagine, the above problem is too large to solve in a short amount of lab time. I’ve therefore generated a stripped-down dataset for you that is constructed as follows. We have 1000 image clips with pixel dimensions of 256 $\times$ 256 and 200 validation clips with the same dimensions. The clips have four spectral layers: red, greed, blue, and near-infrared (RGB-NIR). For each clip, there is a corresponding file that has a water mask. To allow for more sophisticated ML/DL algorithms, each pixel is labeled as 0 for land, and labels 1, 2, 3 for water. Those 1, 2, 3 labels correspond to increasing probability of water (“maybe”, “probably”, “definitely”). When training a model, we can turn this into a simple binary label or even allow for “soft labels” that are continuous between 0 and 1. As a starting point, in the examples below the labels are turned into binary labels on-the-fly.

Finally, the original Landsat spectral data, which are reflectances, were saved as 16-bit integers but to save even more storage and memory space, I have reduced the accuracy to 8-bit integers.

❓ Question: What is the range of possible values for an 8-bit versus a 16-bit integer? What does that translate into if you turn those values into floats between 0 and 1?

In addition to the train/validation sets, there is another folder of clips that are 3x larger in each dimension. After we have finished training the model, we can run inference/predictions on those larger tiles to see what the results look like. Those tiles also have associated ground-truth masks, so we can still compute accuracy/error metrics on those inference results.

Let’s start by downloading our dataset:

Code

%%time
URL = "https://github.com/jmoortgat/AI_Class/releases/download/v2025-08-16/dataset_compact.tar.gz"
with requests.get(URL, stream=True) as r:
    r.raise_for_status()
    with open("dataset_compact.tar.gz", "wb") as f:
        for chunk in r.iter_content(chunk_size=8192):
            if chunk: f.write(chunk)

print("Downloaded.")
with tarfile.open("dataset_compact.tar.gz", "r:gz") as tar:
    tar.extractall()
print("Extracted.")

Downloaded.

/tmp/ipython-input-3783961333.py:10: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
  tar.extractall()

Extracted.

Visualize the file structure of train and validation data. Note, this code is an example of what ChatGPT spit out for me without further edits. It’s just intended to give you a “pretty” overview of the folder and file structures.

Code

DATA_ROOT = Path("dataset_compact")  # change if needed
FILE_EXTS = {".tif", ".tiff"}        # which files to count as “clips”

# ---- helpers ----
def human_bytes(n: int) -> str:
    for unit in ["B","KB","MB","GB","TB"]:
        if n < 1024:
            return f"{n:.0f} {unit}"
        n /= 1024
    return f"{n:.1f} PB"

def count_files_and_size(root: Path, exts=FILE_EXTS) -> tuple[int, int]:
    cnt = 0
    size = 0
    for p in root.rglob("*"):
        if p.is_file() and (not exts or p.suffix.lower() in exts):
            cnt += 1
            try:
                size += p.stat().st_size
            except OSError:
                pass
    return cnt, size

def print_tree(root: Path, max_depth: int = 3, exts=FILE_EXTS):
    root = Path(root)
    if not root.exists():
        raise FileNotFoundError(root)

    def _print_dir(d: Path, prefix: str, depth: int, is_last: bool):
        cnt, size = count_files_and_size(d, exts)
        connector = "└─" if is_last else "├─"
        print(f"{prefix}{connector} 📁 {d.name} — {cnt} files, {human_bytes(size)}")
        if depth >= max_depth:
            return
        subdirs = sorted([p for p in d.iterdir() if p.is_dir()])
        next_prefix = prefix + ("   " if is_last else "│  ")
        for i, sd in enumerate(subdirs):
            _print_dir(sd, next_prefix, depth + 1, i == len(subdirs) - 1)

    # header with totals for the whole dataset
    total_cnt, total_size = count_files_and_size(root, exts)
    print(f"📦 {root.resolve()} — {total_cnt} files, {human_bytes(total_size)}")
    # first level
    subdirs = sorted([p for p in root.iterdir() if p.is_dir()])
    for i, sd in enumerate(subdirs):
        _print_dir(sd, "", 1, i == len(subdirs) - 1)

def quick_split_summary(root: Path, exts=FILE_EXTS):
    root = Path(root)
    def count_dir(p):
        if not (root/p).exists():
            return 0
        return sum(1 for f in (root/p).iterdir() if f.is_file() and f.suffix.lower() in exts)

    summary = {
        "train/images": count_dir("train/images"),
        "train/masks" : count_dir("train/masks"),
        "val/images"  : count_dir("val/images"),
        "val/masks"   : count_dir("val/masks"),
        "eval/images" : count_dir("eval_tiles/images"),
        "eval/masks"  : count_dir("eval_tiles/masks"),
    }
    print("\nSummary (immediate files per leaf folder):")
    for k, v in summary.items():
        print(f"  {k:<20} {v:>6} files")
    print(f"  {'TOTAL (clips counted once)':<20} {sum(summary.values()):>6} files")

# ---- run ----
print_tree(DATA_ROOT, max_depth=3, exts=FILE_EXTS)
quick_split_summary(DATA_ROOT, exts=FILE_EXTS)

📦 /content/dataset_compact — 2500 files, 325 MB
├─ 📁 eval_tiles — 100 files, 78 MB
│  ├─ 📁 images — 50 files, 78 MB
│  └─ 📁 masks — 50 files, 646 KB
├─ 📁 train — 2000 files, 205 MB
│  ├─ 📁 images — 1000 files, 202 MB
│  └─ 📁 masks — 1000 files, 3 MB
└─ 📁 val — 400 files, 41 MB
   ├─ 📁 images — 200 files, 41 MB
   └─ 📁 masks — 200 files, 616 KB

Summary (immediate files per leaf folder):
  train/images           1000 files
  train/masks            1000 files
  val/images              200 files
  val/masks               200 files
  eval/images              50 files
  eval/masks               50 files
  TOTAL (clips counted once)   2500 files

👉 Try it yourself: Use some basic linux commands again to check on these files in a more standard way, i.e. without all the above python code. You could ask an LLM for some linux commands that you would use for this (not all of them work in jupyter, though).

Code

# ! pwd
# !ls /content/dataset_compact/eval_tiles/masks
# !ls /content/dataset_compact/eval_tiles/masks | grep -c .tif
# !du -h /content/dataset_compact

Visualize training image clips and land-water masks (labels)

In the next code snippet we define some functions that random select clips and their associated masks from either the training, validation, or inference/test folders and visualizes them with matplotlib. This is another example of convenient code that was autogenerated for me by ChatGPT in one go. This is just “auxiliary” code that you don’t need to understand. Though, you should be able to recognize much of it from earlier labs and lectures.

Code

def _read_img_rgb(path):
    """Read 4-band image [B2,B3,B4,B5] and return RGB as B4,B3,B2, scaled to [0,1]."""
    with rio.open(path) as src:
        arr = src.read()              # (C,H,W) = (4,H,W)
    arr = np.transpose(arr, (1,2,0))  # (H,W,4)
    rgb = arr[..., [2,1,0]]           # B4,B3,B2 (natural color)
    if rgb.dtype == np.uint8:
        return rgb.astype(np.float32) / 255.0
    lo = np.percentile(rgb, 2, axis=(0,1), keepdims=True)
    hi = np.percentile(rgb, 98, axis=(0,1), keepdims=True)
    return np.clip((rgb - lo) / (hi - lo + 1e-6), 0, 1).astype(np.float32)


def _read_mask(path):
    """Read a mask and clip values to {0,1,2,3}."""
    with rio.open(path) as src:
        m = src.read(1).astype(np.uint8)
    return np.clip(m, 0, 3)


def show_samples(split='train', data_root='dataset_compact', num=3, seed=None, title=False):
    """
    Display random samples (top: RGB, bottom: mask) from a split.
    Valid splits: 'train' | 'val' | 'eval_tiles'
    """
    split_map = {
        'train': ('train/images', 'train/masks'),
        'val': ('val/images', 'val/masks'),
        'eval_tiles': ('eval_tiles/images', 'eval_tiles/masks'),
    }
    if split not in split_map:
        raise ValueError(f"Unknown split '{split}'. Use one of {list(split_map)}.")

    img_dir = f"{data_root}/{split_map[split][0]}"
    msk_dir = f"{data_root}/{split_map[split][1]}"

    files = sorted(glob.glob(f"{img_dir}/*.tif"))
    if not files:
        raise FileNotFoundError(f"No images found in {img_dir}")

    rng = random.Random(seed if seed is not None else randbits(64))
    sel = rng.sample(files, k=min(num, len(files)))

    # Discrete colormap for labels: 0=land (gray), 1–3 water confidence (light→dark blue)
    cmap = ListedColormap(["#8c8c8c", "#a6cee3", "#1f78b4", "#08306b"])
    norm = BoundaryNorm([-0.5, 0.5, 1.5, 2.5, 3.5], cmap.N)

    # Figure with dedicated colorbar column
    fig = plt.figure(figsize=(4*len(sel)+1, 7), constrained_layout=True)
    gs = GridSpec(2, len(sel) + 1, figure=fig, width_ratios=[1]*len(sel) + [0.05])

    axes_img = [fig.add_subplot(gs[0, i]) for i in range(len(sel))]
    axes_msk = [fig.add_subplot(gs[1, i]) for i in range(len(sel))]
    cax = fig.add_subplot(gs[:, -1])

    last_im = None
    for j, img_path in enumerate(sel):
        msk_path = img_path.replace("/images/", "/masks/")
        rgb = _read_img_rgb(img_path)
        msk = _read_mask(msk_path)

        ax_img = axes_img[j]
        ax_img.imshow(rgb)
        ax_img.set_xticks([]); ax_img.set_yticks([])
        ax_img.set_xlabel(img_path.split("/")[-1], fontsize=8, labelpad=4)

        ax_msk = axes_msk[j]
        last_im = ax_msk.imshow(msk, cmap=cmap, norm=norm, interpolation="nearest")
        ax_msk.set_xticks([]); ax_msk.set_yticks([])
        ax_msk.set_xlabel(msk_path.split("/")[-1], fontsize=8, labelpad=4)

    if last_im is not None:
        cbar = fig.colorbar(last_im, cax=cax, ticks=[0,1,2,3])
        cbar.ax.set_yticklabels(["land (0)", "water? (1)", "water (2)", "water (3)"])

    fig.set_constrained_layout_pads(w_pad=0.02, h_pad=0.03, wspace=0.02, hspace=0.03)
    if title:
        fig.suptitle(f"{split} samples (top: RGB B4/B3/B2, bottom: labels 0–3)", y=0.995)
    plt.show()

Have a look at some of the examples. Note the optional parameter “seed”. If you specify the seed, it will always pick the same random clips. If you don’t (or change the value of the seed), then re-running the same code cell / function call, will select different random clips each time. The RGB data are shown in the top and the masks in the bottom row. You’ll see that the labels are still 0 for land, and 1 for “maybe” water, 2 for “probably” water, and 3 for “definitely” water, with a colorbar legend on the right.

Code

# Just running the function will tell you wat inputs it expects / allows for.
show_samples

show_samples
def show_samples(split='train', data_root='dataset_compact', num=3, seed=None, title=False)

/tmp/ipython-input-2193594441.pyDisplay random samples (top: RGB, bottom: mask) from a split.
Valid splits: 'train' | 'val' | 'eval_tiles'

👉 Try it yourself: play with some of the options to get a sense of what our training data looks like.

❓ Question: Do you notice any differences when looking at the train/val data as compared to the eval_tiles data?

Code

# show_samples('val', num=4, seed=123) # reproducible selection
# show_samples('train', num=3)         # new random each run
# show_samples('eval_tiles', num=3, seed=123)          # from eval_tiles

Answer to the previous question: You may have noticed/remembered that the evaluation clips are about $3$ larger than the training/validation clips, so they may show a bit more complexity in the water morphologies than the more “zoomed in” smaller clips. Also, the larger clips have some padding on the right and bottom. This is because the source I used ( https://github.com/giswqs/deep-water-map ) has clips of 743 $\times$ 744 but it is more convenient, or necessary depending on the ML algorithm, if the output is a multiple of our training clips 256 $\times$ 256 (specifically, 3$\times$ in each direction).

Code

show_samples('eval_tiles', num=3, seed=123)          # from eval_tiles

Here’s another function that was intended to also add an Open Street Map (OSM) layer to show where the images are located in the Arctic, but it doesn’t quite work as intended yet (the image is on top of the OSM, so you can’t see it; the idea was for the OSM map to be larger than the satellite image clip). You can skip this (or try to fix it ;) ).

Code

def plot_clip_with_vector_osm(base_dir, feature_types=("water", "roads")):
    """
    Pick a random clip from train/val/eval folders and plot image+mask with OSM overlays.
    """
    # --- pick random clip
    img_dir = os.path.join(base_dir, "images")
    mask_dir = os.path.join(base_dir, "masks")
    img_file = random.choice(sorted(os.listdir(img_dir)))
    img_path = os.path.join(img_dir, img_file)
    mask_path = os.path.join(mask_dir, img_file)

    # --- read data
    with rasterio.open(img_path) as src:
        img = src.read([1, 2, 3])  # RGB bands
        img = img.transpose(1, 2, 0)
        img_bounds = src.bounds
        img_crs = src.crs

    with rasterio.open(mask_path) as src:
        mask = src.read(1)

    # --- reproject bounds to WGS84 for OSM
    transformer = Transformer.from_crs(img_crs, "EPSG:4326", always_xy=True)
    minx, miny = transformer.transform(img_bounds.left, img_bounds.bottom)
    maxx, maxy = transformer.transform(img_bounds.right, img_bounds.top)

    # pad the bbox to improve OSM queries
    pad_x = (maxx - minx) * 0.5
    pad_y = (maxy - miny) * 0.5
    poly_wgs = box(minx - pad_x, miny - pad_y, maxx + pad_x, maxy + pad_y)

    # --- get correct API function depending on osmnx version
    if version.parse(ox.__version__) >= version.parse("2.0"):
        get_geoms = ox.features_from_polygon
    else:
        get_geoms = ox.geometries.geometries_from_polygon

    def safe_get_geoms(tags):
        try:
            return get_geoms(poly_wgs, tags=tags)
        except InsufficientResponseError:
            return None

    # --- download features
    gdfs = {}
    if "water" in feature_types:
        gdfs["water"] = safe_get_geoms({"natural": "water"})
    if "roads" in feature_types:
        gdfs["roads"] = safe_get_geoms({"highway": True})

    # --- plot
    fig, axes = plt.subplots(1, 2, figsize=(14, 7))
    ax_img, ax_mask = axes

    ax_img.imshow(img / img.max())  # normalize for plotting
    ax_img.set_title("Image + OSM")
    ax_img.axis("off")

    ax_mask.imshow(mask, cmap="tab20")
    ax_mask.set_title("Mask + OSM")
    ax_mask.axis("off")

    # overlay OSM layers
    for name, gdf in gdfs.items():
        if gdf is not None and not gdf.empty:
            if name == "water":
                gdf.plot(ax=ax_img, facecolor="none", edgecolor="blue", linewidth=1, alpha=0.5)
                gdf.plot(ax=ax_mask, facecolor="none", edgecolor="blue", linewidth=1, alpha=0.5)
            if name == "roads":
                gdf.plot(ax=ax_img, color="black", linewidth=0.8, alpha=0.7)
                gdf.plot(ax=ax_mask, color="black", linewidth=0.8, alpha=0.7)

    plt.tight_layout()
    plt.show()

plot_clip_with_vector_osm("dataset_compact/train", feature_types=("water","roads"))

Code

Data pre-processing

Recap question: What is the difference between 8-bit and 16-bit imagery?

Click here for the answer

An 8-bit image can represent
\[2^8 = 256\]
discrete intensity levels (values 0–255).

A 16-bit image can represent
\[2^{16} = 65,536\]
discrete intensity levels (values 0–65,535).

If we assume reflectance values are scaled into the range $[0,1]$:

8-bit step size ($\Delta$):
\[\Delta_{8} \approx \frac{1}{255} \approx 0.0039\]
16-bit step size ($\Delta$):
\[\Delta_{16} \approx \frac{1}{65,535} \approx 0.000015\]

So 16-bit data has ~256× finer precision than 8-bit.

This matters in scientific imagery (e.g. remote sensing), where subtle reflectance differences can be lost if the data are quantized to only 256 levels.

In short:
- 8-bit: good for display and simple tasks.
- 16-bit: essential for preserving detail, avoiding banding, and keeping dynamic range for analysis.

Let’s explicitly check again what the properties of our dataset are.

❓ Question: Do you know enough linux/bash or can you find out with google/chatgpt how to get the full path of the first (or, really, any) image in the folder of training image clips?

Hint/answer

!ls dataset_compact/train/images/ file is: tile_1204_train_i000161.tif

Here I provide an example of how one way to use the python libraries rioxarray or rasterio to read the geotiff file into an xarray, which is given the variable name arr below. xarray allows many of the same operations as on standard numpy arrays, but it also preserves the geospatial information. Specifically, each pixel doesn’t just have a value of RGB-NIR reflectances, but it also has associated lat/lon coordiate information and the xarray has metadata for coordinate reference information etc.

Code

# I find rioxarray the most straightforward approach:

first_file = "" # you need to find and plug in the full path to any of the train/eval image clips

xarr = rioxarray.open_rasterio(first_file)

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/usr/local/lib/python3.12/dist-packages/xarray/backends/file_manager.py in _acquire_with_cache_info(self, needs_lock)
    218             try:
--> 219                 file = self._cache[self._key]
    220             except KeyError:

/usr/local/lib/python3.12/dist-packages/xarray/backends/lru_cache.py in __getitem__(self, key)
     55         with self._lock:
---> 56             value = self._cache[key]
     57             self._cache.move_to_end(key)

KeyError: [<function open at 0x7de4732c36a0>, ('',), 'r', (('sharing', False),), 'c8e6b598-cb42-4ddb-b71b-9d01fa37a19e']

During handling of the above exception, another exception occurred:

CPLE_OpenFailedError                      Traceback (most recent call last)
rasterio/_base.pyx in rasterio._base.DatasetBase.__init__()

rasterio/_base.pyx in rasterio._base.open_dataset()

rasterio/_err.pyx in rasterio._err.exc_wrap_pointer()

CPLE_OpenFailedError: : No such file or directory

During handling of the above exception, another exception occurred:

RasterioIOError                           Traceback (most recent call last)
/tmp/ipython-input-4067998613.py in <cell line: 0>()
      3 first_file = "" # you need to find and plug in the full path to any of the train/eval image clips
      4 
----> 5 xarr = rioxarray.open_rasterio(first_file)

/usr/local/lib/python3.12/dist-packages/rioxarray/_io.py in open_rasterio(filename, parse_coordinates, chunks, cache, lock, masked, mask_and_scale, variable, group, default_name, decode_times, decode_timedelta, band_as_variable, **open_kwargs)
   1133         else:
   1134             manager = URIManager(file_opener, filename, mode="r", kwargs=open_kwargs)
-> 1135         riods = manager.acquire()
   1136         captured_warnings = rio_warnings.copy()
   1137 

/usr/local/lib/python3.12/dist-packages/xarray/backends/file_manager.py in acquire(self, needs_lock)
    199             An open file object, as returned by ``opener(*args, **kwargs)``.
    200         """
--> 201         file, _ = self._acquire_with_cache_info(needs_lock)
    202         return file
    203 

/usr/local/lib/python3.12/dist-packages/xarray/backends/file_manager.py in _acquire_with_cache_info(self, needs_lock)
    223                     kwargs = kwargs.copy()
    224                     kwargs["mode"] = self._mode
--> 225                 file = self._opener(*self._args, **kwargs)
    226                 if self._mode == "w":
    227                     # ensure file doesn't get overridden when opened again

/usr/local/lib/python3.12/dist-packages/rasterio/env.py in wrapper(*args, **kwds)
    461 
    462         with env_ctor(session=session):
--> 463             return f(*args, **kwds)
    464 
    465     return wrapper

/usr/local/lib/python3.12/dist-packages/rasterio/__init__.py in open(fp, mode, driver, width, height, count, crs, transform, dtype, nodata, sharing, opener, **kwargs)
    354 
    355             if mode == "r":
--> 356                 dataset = DatasetReader(path, driver=driver, sharing=sharing, **kwargs)
    357             elif mode == "r+":
    358                 dataset = get_writer_for_path(path, driver=driver)(

rasterio/_base.pyx in rasterio._base.DatasetBase.__init__()

RasterioIOError: : No such file or directory

Check all the information contained in such an xarray:

Code

print("Xarray object with metadata:")
print(xarr)

print("\nAttributes (metadata):")
print(xarr.attrs)

print("\n\n\nCoordinates (note: these include spatial info):")
print(xarr.coords)

# To get min/max per band
for i in range(xarr.sizes["band"]):
    band_data = xarr.isel(band=i)
    print(f"Band {i+1}: min={band_data.min().item()}, max={band_data.max().item()}")

Below is a different library but almost the same approach to achieve the same thing using rasterio rather than rioxarray (similar packages):

Code

first_file = "" # fill in; this can be any image clip in the train or eval folders

print("Opening:", first_file)

with rio.open(first_file) as src:
    arr = src.read()   # shape = (bands, height, width)
    print("Shape (bands, height, width):", arr.shape)

    # Loop through bands
    for i in range(arr.shape[0]):
        band = arr[i]
        print(f"Band {i+1}: min={band.min()}, max={band.max()}")

Feature engineering

Once again, the reflectances of the Landsat (or Sentinel) images for each pixel are the features for this ML/DL task.

❓ Question: Are these features suitable for a ML/DL algorithm for logistic classification as-is?
If not, why not, and how should we change those?

Hint

Think about the sigmoid activation function in logistic regression:

\[ \sigma(z) = \frac{1}{1 + e^{-z}}, \quad z = \mathbf{w}^\top \mathbf{x} + b \]

What happens if $x$ is very large (like reflectance 199)?

Hint/answer

Large reflectance values make (z) very large, so ((z)) saturates to 1 (or 0).
This kills the gradient updates and breaks learning.

👉 We should scale reflectances before feeding them into logistic regression or other ML/DL models.
Common approaches:
- Min–max scaling to [0–1]
- Percentile stretch (2–98%) followed by [0–1] normalization

Remember what the sigmoid function looks like

Code

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

z_vals = np.linspace(-20, 20, 400)
plt.plot(z_vals, sigmoid(z_vals))
plt.title("Sigmoid Function")
plt.xlabel("z")
plt.ylabel("σ(z)")
plt.grid(True)
plt.show()

Here’s a basic demo of how the sigmoid function quickly saturates when using raw 8-bit reflectances (this would be even way worse for the original 16-bit reflectance values), while it behaves much better for normalized reflectance inputs.

Code

reflectance_raw = int(input("Enter a reflectance value between 0 and 255: "))

weight = 0.1
bias = 0.0

z_raw = weight * reflectance_raw + bias
sig_raw = sigmoid(z_raw)

reflectance_scaled = reflectance_raw / 255.0
z_scaled = weight * reflectance_scaled + bias
sig_scaled = sigmoid(z_scaled)

print(f"\nRaw reflectance = {reflectance_raw}")
print(f"  z = {z_raw:.2f}, sigmoid(z) = {sig_raw:.6f}")

print(f"\nScaled reflectance = {reflectance_scaled:.3f}")
print(f"  z = {z_scaled:.6f}, sigmoid(z) = {sig_scaled:.6f}")

Enter a reflectance value between 0 and 255: 89

Raw reflectance = 89
  z = 8.90, sigmoid(z) = 0.999864

Scaled reflectance = 0.349
  z = 0.034902, sigmoid(z) = 0.508725

Here’s a more visual illustration for just the positive range of the sigmoid function and some initial guess of fitted weight and bias:

Code

w = 0.1
bias = 0.0

# Case 1: raw reflectances [0,255]
x_raw = np.linspace(0, 255, 256)
z_raw = w * x_raw + bias
sig_raw = sigmoid(z_raw)

# Case 2: scaled reflectances [0,1]
x_scaled = np.linspace(0, 1, 256)
z_scaled = w * x_scaled + bias
sig_scaled = sigmoid(z_scaled)

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].plot(x_raw, sig_raw, color="red")
axes[0].set_title("Raw reflectances [0–255]")
axes[0].set_xlabel("Reflectance value")
axes[0].set_ylabel("Sigmoid output")
axes[0].grid(True)

axes[1].plot(x_scaled, sig_scaled, color="blue")
axes[1].set_title("Scaled reflectances [0–1]")
axes[1].set_xlabel("Reflectance value")
axes[1].set_ylabel("Sigmoid output")
axes[1].grid(True)

plt.tight_layout()
plt.show()

Play around with the cell below to practice these different feature re-scaling ideas.

Code

# 👉 Step 1. Pick any image clip from your dataset (train/val/eval)
# first_file = ""  # <-- fill in with full path to an image, e.g. "dataset_compact/train/images/<>.tif"

# Open with rioxarray (bands, height, width)
xarr = rioxarray.open_rasterio(first_file)
print("Raw shape (bands, H, W):", xarr.shape)
print("Data type:", xarr.dtype)
print("Raw min/max reflectance values:", float(xarr.min()), float(xarr.max()))

# --------------------------------------
# Once again, remember: each pixel’s reflectances in the 4 bands are the *features*.
#
# ❓ Question:
# Are the raw integer reflectances in the 0–255 range directly suitable
# for logistic regression or deep learning?
#
# Hint: Think about how the sigmoid behaves for very large vs. small inputs.
# --------------------------------------

# Step 2. Flatten the array into (H, W, C) form for easier handling
img = xarr.transpose("y", "x", "band").values
print("Reordered shape (H, W, C):", img.shape)

# Step 3. Explore percentile-based reflectance scaling
# 👉 Compute the 2% and 98% percentiles across all pixels:
lo = np.percentile(img, 2, axis=(0,1))
hi = np.percentile(img, 98, axis=(0,1))
print("Per-band 2% percentiles:", lo)
print("Per-band 98% percentiles:", hi)

# ❓ Question:
# For 8-bit reflectances (0–255), what do you expect these values to be close to?
# (Think: about 5 for low, 250 for high).
# This is why in our universal scaling we just fix low=5, high=250.

# Step 4. Apply clipping
img_clipped = np.clip(img, 5, 250)
print("After clipping → min:", img_clipped.min(), "max:", img_clipped.max())

# Step 5. Normalize to [0,1]
img_norm = (img_clipped - 5) / (250 - 5)
print("After normalization → min:", img_norm.min(), "max:", img_norm.max())

# --------------------------------------
# 👉 Suggestion:
# Print out min/max again after each step.
# Try plotting histograms of reflectance values to *see* the effect
# of clipping and scaling.
# --------------------------------------

# Step 6. Process the mask file in a similar way
mask_file = first_file.replace("/images/", "/masks/")
mask = rioxarray.open_rasterio(mask_file).values.squeeze()
print("Mask shape:", mask.shape, "unique labels:", np.unique(mask))

# ❓ Question:
# How can we collapse the 4 classes {0,1,2,3} into just land vs. water?
# Hint: np.where( np.isin(mask, [2,3]), 1, 0 )

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/usr/local/lib/python3.12/dist-packages/xarray/backends/file_manager.py in _acquire_with_cache_info(self, needs_lock)
    218             try:
--> 219                 file = self._cache[self._key]
    220             except KeyError:

/usr/local/lib/python3.12/dist-packages/xarray/backends/lru_cache.py in __getitem__(self, key)
     55         with self._lock:
---> 56             value = self._cache[key]
     57             self._cache.move_to_end(key)

KeyError: [<function open at 0x7de4732c36a0>, ('',), 'r', (('sharing', False),), '57f65f9f-fbbd-4e8e-819e-6b305c1e14e6']

During handling of the above exception, another exception occurred:

CPLE_OpenFailedError                      Traceback (most recent call last)
rasterio/_base.pyx in rasterio._base.DatasetBase.__init__()

rasterio/_base.pyx in rasterio._base.open_dataset()

rasterio/_err.pyx in rasterio._err.exc_wrap_pointer()

CPLE_OpenFailedError: : No such file or directory

During handling of the above exception, another exception occurred:

RasterioIOError                           Traceback (most recent call last)
/tmp/ipython-input-2545844640.py in <cell line: 0>()
      3 
      4 # Open with rioxarray (bands, height, width)
----> 5 xarr = rioxarray.open_rasterio(first_file)
      6 print("Raw shape (bands, H, W):", xarr.shape)
      7 print("Data type:", xarr.dtype)

/usr/local/lib/python3.12/dist-packages/rioxarray/_io.py in open_rasterio(filename, parse_coordinates, chunks, cache, lock, masked, mask_and_scale, variable, group, default_name, decode_times, decode_timedelta, band_as_variable, **open_kwargs)
   1133         else:
   1134             manager = URIManager(file_opener, filename, mode="r", kwargs=open_kwargs)
-> 1135         riods = manager.acquire()
   1136         captured_warnings = rio_warnings.copy()
   1137 

/usr/local/lib/python3.12/dist-packages/xarray/backends/file_manager.py in acquire(self, needs_lock)
    199             An open file object, as returned by ``opener(*args, **kwargs)``.
    200         """
--> 201         file, _ = self._acquire_with_cache_info(needs_lock)
    202         return file
    203 

/usr/local/lib/python3.12/dist-packages/xarray/backends/file_manager.py in _acquire_with_cache_info(self, needs_lock)
    223                     kwargs = kwargs.copy()
    224                     kwargs["mode"] = self._mode
--> 225                 file = self._opener(*self._args, **kwargs)
    226                 if self._mode == "w":
    227                     # ensure file doesn't get overridden when opened again

/usr/local/lib/python3.12/dist-packages/rasterio/env.py in wrapper(*args, **kwds)
    461 
    462         with env_ctor(session=session):
--> 463             return f(*args, **kwds)
    464 
    465     return wrapper

/usr/local/lib/python3.12/dist-packages/rasterio/__init__.py in open(fp, mode, driver, width, height, count, crs, transform, dtype, nodata, sharing, opener, **kwargs)
    354 
    355             if mode == "r":
--> 356                 dataset = DatasetReader(path, driver=driver, sharing=sharing, **kwargs)
    357             elif mode == "r+":
    358                 dataset = get_writer_for_path(path, driver=driver)(

rasterio/_base.pyx in rasterio._base.DatasetBase.__init__()

RasterioIOError: : No such file or directory

✅ Takeaways:
- Raw 8-bit reflectances (0–255) → saturate sigmoid, bad for learning.
- Simple min–max scaling (0–1) → much better, keeps values in a useful range.
- Best practice:
Apply a 2–98% percentile stretch to remove outliers, then normalize to [0–1].
This makes features more robust and keeps ML/DL models stable.

To get everyone on the same page moving forward, below is a clean function that does all the required pre-processing on all of the training and validation data in terms of min-max scaling and percentile stretching. It also converts the land-water labels 0, 1, 2, 3 into just binary labels 0 (land) and 1 (water). Finally, I added an option frac_data, which is simply the fraction of our total amount of available training and validation data to use. The default is frac_data=1.0, using all data. For frac_data=0.1, we’d use 100 training images and 20 validation images. This is useful for faster results and in cases where Colab might run out of memory otherwise.

Code

def load_data_and_collapse(base_dir, low=5, high=250, frac_data=1.0, seed=0):
    """
    Load image+mask pairs from a folder, preprocess, flatten into feature vectors,
    and return X (features) and y (binary labels).

    Parameters
    ----------
    base_dir : str
        Path to folder with 'images/' and 'masks/' subfolders
    low, high : int
        Fixed scaling range for reflectances (default [5, 250])
    frac_data : float, optional
        Fraction of image/mask pairs to use (default 1.0 = all).
        E.g., 0.1 uses only 10% of available pairs.
    seed : int, optional
        Random seed for reproducibility when selecting a subset.

    Returns
    -------
    X : np.ndarray, shape (N_pixels, N_bands)
        Flattened features across all selected images
    y : np.ndarray, shape (N_pixels,)
        Flattened binary labels across all selected images
    """
    img_files = sorted(glob.glob(f"{base_dir}/images/*.tif"))
    mask_files = sorted(glob.glob(f"{base_dir}/masks/*.tif"))

    # --- Optionally subsample the dataset ---
    n_files = len(img_files)
    n_select = max(1, int(frac_data * n_files))
    rng = random.Random(seed)
    idx_select = sorted(rng.sample(range(n_files), n_select))

    img_files = [img_files[i] for i in idx_select]
    mask_files = [mask_files[i] for i in idx_select]

    X_list, y_list = [], []

    for img_path, mask_path in zip(img_files, mask_files):
        # --- Load image ---
        img = rioxarray.open_rasterio(img_path)         # (bands, H, W)
        img = img.transpose("y", "x", "band").values   # (H, W, C)

        # Scale reflectances: 5 → 0, 250 → 1
        img = np.clip((img - low) / (high - low), 0, 1)

        # --- Load mask ---
        mask = rioxarray.open_rasterio(mask_path).values.squeeze()  # (H, W)

        # Collapse labels: (0+1 → 0 land), (2+3 → 1 water)
        mask_bin = np.where(np.isin(mask, [2, 3]), 1, 0)

        # --- Flatten ---
        X_list.append(img.reshape(-1, img.shape[-1]))
        y_list.append(mask_bin.ravel())

    # --- Concatenate across all images ---
    X = np.vstack(X_list)
    y = np.hstack(y_list)

    return X, y

Linear logistic regression

Using the function that we create at the end of the Feature Engineering section, let’s load our training and validation data and pre-process them with min-max scaling and percentile stretching:

Code

# Example: load training and validation data
t0 = time.time()
X_train, y_train = load_data_and_collapse("dataset_compact/train",frac_data=0.1)
X_val, y_val     = load_data_and_collapse("dataset_compact/val",frac_data=0.1)
print(f"Data loaded in {time.time()-t0:.2f} seconds")

# Shapes
print("Train features:", X_train.shape)   # (N_pixels_total, N_bands)
print("Train labels:  ", y_train.shape)   # (N_pixels_total,)
print("Val features:  ", X_val.shape)
print("Val labels:    ", y_val.shape)

# Peek at ranges
print("Feature range (train): min =", X_train.min(), ", max =", X_train.max())
print("Unique labels in train:", np.unique(y_train))

Data loaded in 3.15 seconds
Train features: (6553600, 4)
Train labels:   (6553600,)
Val features:   (1310720, 4)
Val labels:     (1310720,)
Feature range (train): min = 0.0 , max = 1.0
Unique labels in train: [0 1]

Training a linear logistic regression model is again deceptively simple if we use the off-the-shelf black box functions from scikit-learn. Specifically, we’ll use LogisticRegression. As explained in earlier labs, these function actually do various optimization internally that you may not always be aware of. In this case, LogisticRegression actually already does some sorts of feature scaling internally to perform better, i.e. it does not only just do a basic regression. It also already uses regularization by default to avoid overfitting. You can learn about some of this by simply typing

? LogisticRegression

in a code cell.

In a later step, I’ll show you again how to achieve the same ‘from scratch’ with just plain algebra and no black box libraries.

Code

# Fit logistic regression
t1 = time.time()

clf = LogisticRegression(max_iter=200, n_jobs=-1)

clf.fit(X_train, y_train)

print(f"Training done in {time.time()-t1:.2f} s")

Training done in 11.44 s

Acknowledge how insane it is that we can train this model in just seconds! Remember from the earlier prompts/questions how many input variables/features we are processing, and we’re not even using a GPU yet.

At this point, our model which we named clf for classification only contains the “fitting parameters” of our best logistic regression fit to our labeled images. All the way in the top, we also loaded another very convenient scikit-learn function: from sklearn.metrics import classification_report. Let’s first see how this works to see how well our trained model fits the training data itself:

Code

y_train_pred = clf.predict(X_train)

print(classification_report(y_train, y_train_pred))

              precision    recall  f1-score   support

           0       0.83      0.92      0.88   4339773
           1       0.81      0.64      0.71   2213827

    accuracy                           0.83   6553600
   macro avg       0.82      0.78      0.79   6553600
weighted avg       0.82      0.83      0.82   6553600

Next, we can do the same for our validation data that were never seen as part of model training:

Code

# Evaluate
t2 = time.time()
y_pred = clf.predict(X_val)
print(f"Prediction done in {time.time()-t2:.2f} s")

print("\nClassification report (binary land=0, water=1):")
print(classification_report(y_val, y_pred))

Prediction done in 0.02 s

Classification report (binary land=0, water=1):
              precision    recall  f1-score   support

           0       0.87      0.96      0.91    844951
           1       0.90      0.74      0.82    465769

    accuracy                           0.88   1310720
   macro avg       0.89      0.85      0.86   1310720
weighted avg       0.88      0.88      0.88   1310720

Remember that the f1-score is the (harmonic) average of precision and recall. The harmonic average is close to the lower value of whatever is averaged, so it is a single conservative metric of performance. What do you see for the overall accuracy in classifying water pixels from RGB-NIR Landsat satellite imagery using our linear logistic regression model?

Pretty great for such a simple model, right?

To my surprise, I accidentally discovered in putting this lab together that the LogisticRegression function actually performs significantly better when we don't do the usual best-practice feature scaling. The reasons are a bit complicated but have to do, again, with various other steps that are hidden inside this function. For example, it uses a “L2 regularized solver” by default rather than the typical gradient-based method, and by scaling from 0-255 to 5-250 and then normalizing, we lost some of the range in the raw data.

Below, we train LogisticRegression again but just on the raw image data.

First, a near identical function to just load the image + mask data without feature scaling:

Code

def load_raw_reflectances(base_folder, frac_data=1.0, seed=0):
    """
    Load image/mask pairs and collapse labels into binary (land=0, water=1).
    Returns raw reflectance values (0–255), no scaling.

    Parameters
    ----------
    base_folder : str
        Path to folder with 'images/' and 'masks/' subfolders
    frac_data : float, optional
        Fraction of image/mask pairs to use (default 1.0 = all).
        E.g., 0.1 uses only 10% of available pairs.
    seed : int, optional
        Random seed for reproducibility when selecting a subset.

    Returns
    -------
    X_out : np.ndarray, shape (N_pixels, N_bands)
        Flattened features across selected images
    y_out : np.ndarray, shape (N_pixels,)
        Flattened binary labels across selected images
    """

    image_files = sorted(glob.glob(f"{base_folder}/images/*.tif"))
    mask_files  = sorted(glob.glob(f"{base_folder}/masks/*.tif"))

    # --- Optionally subsample dataset ---
    n_files = len(image_files)
    n_select = max(1, int(frac_data * n_files))
    rng = random.Random(seed)
    idx_select = sorted(rng.sample(range(n_files), n_select))

    image_files = [image_files[i] for i in idx_select]
    mask_files  = [mask_files[i] for i in idx_select]

    features_all, labels_all = [], []

    for img_file, mask_file in zip(image_files, mask_files):
        # --- Load image ---
        img = rioxarray.open_rasterio(img_file).values.transpose(1, 2, 0)  # (H, W, C)

        # --- Load mask ---
        mask = rioxarray.open_rasterio(mask_file).values[0]  # single band

        # Collapse labels: (0,1) → 0 (land), (2,3) → 1 (water)
        mask_binary = np.where(np.isin(mask, [2, 3]), 1, 0)

        # Flatten
        features_all.append(img.reshape(-1, img.shape[-1]))
        labels_all.append(mask_binary.ravel())

    X_out = np.vstack(features_all)
    y_out = np.hstack(labels_all)

    return X_out, y_out

Code

# --- Load training and validation sets (raw reflectances) ---
t0 = time.time()
X_train_raw, y_train_raw = load_raw_reflectances("dataset_compact/train", frac_data=0.1)
X_val_raw,   y_val_raw   = load_raw_reflectances("dataset_compact/val", frac_data=0.1)
print(f"Raw data loaded in {time.time()-t0:.2f} s")

# --- Fit logistic regression (on raw reflectances) ---
t1 = time.time()
clf_raw = LogisticRegression(max_iter=200, n_jobs=-1)
clf_raw.fit(X_train_raw, y_train_raw)
print(f"Training done in {time.time()-t1:.2f} s")

# --- Evaluate on validation set ---
t2 = time.time()
y_val_pred_raw = clf_raw.predict(X_val_raw)
print(f"Prediction done in {time.time()-t2:.2f} s")

print("\nClassification report (binary land=0, water=1):")
print(classification_report(y_val_raw, y_val_pred_raw))

Raw data loaded in 2.11 s
Training done in 15.15 s
Prediction done in 0.04 s

Classification report (binary land=0, water=1):
              precision    recall  f1-score   support

           0       0.96      0.96      0.96    844951
           1       0.93      0.92      0.92    465769

    accuracy                           0.95   1310720
   macro avg       0.94      0.94      0.94   1310720
weighted avg       0.95      0.95      0.95   1310720

These results are quite amazing!

👉 Assignment: Load a single image and mask from the eval_tiles folder, perform inference/prediction on that image, and compute the error/accuracy metrics versus the corresponding mask of ground-truth labels.

👉 Assignment: Generalize the previous code to loop through each image/mask pair in the eval_tiles folder and print out the accuracy metrics.

Code

# Write your own code here to make a prediction for the larger 768x768 tiles in eval_tiles
# first for just one clip and then for multiple/all
# Analyze how well our model performs on those larger images.

💡 Hint 1: Single eval tile

# pick a single eval image + mask
img_path  = "dataset_compact/eval_tiles/images/<filename>.tif"
mask_path = "dataset_compact/eval_tiles/masks/<filename>.tif"

# load image and mask
img  = rioxarray.open_rasterio(img_path).values.transpose(1,2,0)
mask = rioxarray.open_rasterio(mask_path).values[0]

# collapse labels to binary
mask_bin = np.where(np.isin(mask, [2,3]), 1, 0)

# flatten
X_eval = img.reshape(-1, img.shape[-1])
y_eval = mask_bin.ravel()

# predict with trained classifier
y_pred_eval = clf_raw.predict(X_eval)   # or clf if using scaled version

# metrics
print(classification_report(y_eval, y_pred_eval))

💡 Hint 2: Loop through all eval tiles

image_files = sorted(glob.glob("dataset_compact/eval_tiles/images/*.tif"))
mask_files  = sorted(glob.glob("dataset_compact/eval_tiles/masks/*.tif"))

for img_path, mask_path in zip(image_files, mask_files):
    # load
    img  = rioxarray.open_rasterio(img_path).values.transpose(1,2,0)
    mask = rioxarray.open_rasterio(mask_path).values[0]
    mask_bin = np.where(np.isin(mask, [2,3]), 1, 0)
    
    # flatten
    X_eval = img.reshape(-1, img.shape[-1])
    y_eval = mask_bin.ravel()
    
    # predict
    y_pred_eval = clf_raw.predict(X_eval)   # or clf if using scaled version
    
    # print metrics per tile
    print(f"\nResults for {img_path.split('/')[-1]}:")
    print(classification_report(y_eval, y_pred_eval, digits=3))

Helpful functions

The below actually implements what the previous lab exercise/question was. It adds one more option.

As outlined in the very beginning, the training images for DeepWaterLab had somewhat inconvenient and non-consistent pixel dimension of $743 \times 744$, $744\times 743$, $744\times 744$. Especially for convolutional neural networks, CNN, it is useful/necessary to have standard dimenions as powers of 2 or in our case also multiples or our training tiles. For these reasons, I “padded” these images to $3\times 256 = 768$ pixels in each dimension. However, those padded pixels have no meaning. So in the functions below, I strip those back out and only make predictions and associated error metrics on the pixels that we actually part of the Landsat images. Actually, we can crop to anything $<743$ pixels in each dimension. In the examples I pick $740 \times 740$ pixels.

Code

CROP = 740  #  crop size on larger evaluation images

The next function randomy selects one of the larger evaluation images, shows the mask with soft-labels, the mask with binary labels, our model prediction, and a difference map. Then it also prints out the precision/recall/f1-score for just that one selected test image.

Code

def visualize_random_eval(model, eval_dir="dataset_compact/eval_tiles"):
    img_files  = sorted(glob.glob(os.path.join(eval_dir, "images", "*.tif")))
    mask_files = sorted(glob.glob(os.path.join(eval_dir, "masks", "*.tif")))

    # pick a random eval tile
    idx = random.randrange(len(img_files))
    img_path, mask_path = img_files[idx], mask_files[idx]
    fname = os.path.basename(img_path)

    # --- Load image ---
    img = rioxarray.open_rasterio(img_path).values.transpose(1, 2, 0)
    img = img[:CROP, :CROP, :]
    H, W, C = img.shape
    X = img.reshape(-1, C)

    # --- Load mask ---
    mask = rioxarray.open_rasterio(mask_path).values[0]
    mask = mask[:CROP, :CROP]
    mask_bin = np.where(np.isin(mask, [2, 3]), 1, 0)

    # --- Predict ---
    preds = model.predict(X).reshape(H, W)
    diff  = (preds != mask_bin).astype(np.uint8)

    # --- Plotting ---
    fig, axes = plt.subplots(1, 4, figsize=(16, 4))
    axes[0].imshow(mask, cmap="tab10")
    axes[0].set_title("Original Mask (0–3)")
    axes[1].imshow(mask_bin, cmap="gray")
    axes[1].set_title("Binary ground truth")
    axes[2].imshow(preds, cmap="gray")
    axes[2].set_title("Predicted")
    axes[3].imshow(diff, cmap="Reds")
    axes[3].set_title("Difference")
    for ax in axes: ax.axis("off")
    plt.tight_layout()
    plt.show()

    # --- Metrics ---
    y_true = mask_bin.ravel()
    y_pred = preds.ravel()
    print(f"Metrics for file: {fname}")
    print(classification_report(y_true, y_pred, digits=3))

Try it out:

Code

# visualize performance on a random eval tile
visualize_random_eval(clf_raw)

Metrics for file: tile_1643_eval_00028.tif
              precision    recall  f1-score   support

           0      0.987     0.997     0.992    103683
           1      0.999     0.997     0.998    443917

    accuracy                          0.997    547600
   macro avg      0.993     0.997     0.995    547600
weighted avg      0.997     0.997     0.997    547600

Next, we make a simple function to compute the error metrics across all 50 evaluation tiles:

Code

def evaluate_on_eval_set(model, eval_dir="dataset_compact/eval_tiles"):
    img_files  = sorted(glob.glob(os.path.join(eval_dir, "images", "*.tif")))
    mask_files = sorted(glob.glob(os.path.join(eval_dir, "masks", "*.tif")))

    all_preds, all_truth = [], []

    for img_path, mask_path in zip(img_files, mask_files):
        # load image cube (HWC) and crop
        img = rioxarray.open_rasterio(img_path).values.transpose(1, 2, 0)
        img = img[:CROP, :CROP, :]
        X   = img.reshape(-1, img.shape[-1])

        # load mask, crop, collapse labels, flatten
        mask = rioxarray.open_rasterio(mask_path).values[0]
        mask = mask[:CROP, :CROP]
        y    = np.where(np.isin(mask, [2, 3]), 1, 0).ravel()

        # predict
        preds = model.predict(X)

        all_preds.append(preds)
        all_truth.append(y)

    y_true = np.concatenate(all_truth)
    y_pred = np.concatenate(all_preds)

    print("Classification report on eval set:")
    print(classification_report(y_true, y_pred))

    return y_true, y_pred

Alright, let’s run this over all 50 larger images in the evaluation folder and print a detailed report on precision, recall, and f1 scores. We can do this for either the model that was trained on the pre-processed training data (cfl) or the model trained on the raw data (clf_raw).

Code

# Evaluate entire eval set
#evaluate_on_eval_set(clf, "dataset_compact/eval_tiles")
evaluate_on_eval_set(clf_raw, "dataset_compact/eval_tiles")

Classification report on eval set:
              precision    recall  f1-score   support

           0       0.87      0.95      0.91  12760824
           1       0.95      0.88      0.91  14619176

    accuracy                           0.91  27380000
   macro avg       0.91      0.91      0.91  27380000
weighted avg       0.91      0.91      0.91  27380000

(array([0, 0, 0, ..., 1, 1, 1]), array([0, 0, 0, ..., 1, 1, 1]))

Again, the accuracy is kind of remarkable.

Next, let’s see what we can do with logistic regression implemented from scratch using only linear algebra and basic python, just like you learned in earlier weeks.

Logistic regression from-scratch, without skikit-learn / black box

Below I repeat a version of our earlier code for linear logistic regression. I’ll keep the code extremely simple and explicit with minimal use of functions and not even bothering with vectorizing the weights/slopes + bias terms together (which is more efficient).

I did add automatic stopping: after each iteration we check how much the error/loss function has decreased. If it doesn’t decrease anymore within a user defined tolerance (the input tol=1e-4 by default), then we exit the gradient descent loop.

Code

# def compute_loss(y, y_pred):
#    """Binary cross-entropy loss."""
#    m = len(y)
#    return -(1/m) * np.sum(
#        y * np.log(y_pred + 1e-9) + (1 - y) * np.log(1 - y_pred + 1e-9)
#    )
# ------------------------
# Logistic Regression Training (SGD / Mini-batch + Early Stopping)
# ------------------------

def train_logistic_regression(X, y, learning_rate=0.01, num_iterations=10, batch_size=32,
                              tol=1e-4, patience=3):
    """
    Train logistic regression using mini-batch gradient descent.
    Includes early stopping and timing info.
    """
    m, n = X.shape
    weights = np.random.randn(n) * 0.01
    bias = 0.0

    best_loss = float('inf')
    no_improve = 0

    start_time_total = time.time()

    for epoch in range(num_iterations):
        epoch_start = time.time()

        # Shuffle the data each epoch
        indices = np.arange(m)
        np.random.shuffle(indices)
        X_shuffled, y_shuffled = X[indices], y[indices]

        # Mini-batch loop
        for i in range(0, m, batch_size):
            X_batch = X_shuffled[i:i+batch_size]
            y_batch = y_shuffled[i:i+batch_size]

            # Forward pass
            linear = np.dot(X_batch, weights) + bias
            y_pred = 1 / (1 + np.exp(-linear))  # sigmoid

            # Gradients
            error = y_pred - y_batch
            grad_w = (1/len(y_batch)) * np.dot(X_batch.T, error)
            grad_b = (1/len(y_batch)) * np.sum(error)

            # Update
            weights -= learning_rate * grad_w
            bias -= learning_rate * grad_b

        # Evaluate full loss at end of epoch
        y_pred_full = 1 / (1 + np.exp(-(np.dot(X, weights) + bias)))
        loss = -(1/m) * np.sum(
            y * np.log(y_pred_full + 1e-9) + (1 - y) * np.log(1 - y_pred_full + 1e-9)
        )

        epoch_end = time.time()
        print(f"Epoch {epoch+1}/{num_iterations}, "
              f"Loss: {loss:.4f}, "
              f"Time: {epoch_end - epoch_start:.2f} sec")

        # Early stopping check
        if loss + tol < best_loss:
            best_loss = loss
            no_improve = 0
        else:
            no_improve += 1
            if no_improve >= patience:
                print(f"Stopping early at epoch {epoch+1} "
                      f"(no improvement for {patience} epochs).")
                break

    total_end = time.time()
    print(f"\nTraining completed in {total_end - start_time_total:.2f} seconds.\n")

    return weights, bias


# ------------------------
# Prediction
# ------------------------

def predict_labels(X, weights, bias, threshold=0.5):
    """Predict binary labels (0 or 1) given learned weights and bias."""
    linear = np.dot(X, weights) + bias
    probs = sigmoid(linear)
    return (probs >= threshold).astype(int)

Define a fancy “class” for this model. In a nutshell, a class lets us group various variable definitions and functions together into one object. A practical implication is that this allows us to swap out our own logistic regression exactly like how scikit-learn’s logisticregression function was used.

Code

# ------------------------
# Training wrapper
# This defines a python "class". Don't worry too much about how that works
# (or ask chatgpt for some explanation!)
# ------------------------

class SimpleLogisticRegressionScratch:
    """Wrapper to look like sklearn's LogisticRegression"""
    def __init__(self, learning_rate=0.1, num_iterations=1000):
        self.learning_rate = learning_rate
        self.num_iterations = num_iterations
        self.feature_weights = None
        self.bias_term = None

    def fit(self, X, y):
        self.feature_weights, self.bias_term = train_logistic_regression(
            X, y,
            learning_rate=self.learning_rate,
            num_iterations=self.num_iterations
        )

    def predict(self, X):
        return predict_labels(X, self.feature_weights, self.bias_term)

Code

# -------------------------
# Load a subset of training and validation data (scaled reflectances)
# -------------------------
X_train_small, y_train_small = load_data_and_collapse(
    "dataset_compact/train",
    low=5, high=250,
    frac_data=0.1   # take only 10% of training clips
)
X_val_scaled, y_val = load_data_and_collapse(
    "dataset_compact/val",
    low=5, high=250,
    frac_data=0.1   # take only 10% of validation clips
)

# -------------------------
# Train logistic regression from scratch
# -------------------------
scratch_model = SimpleLogisticRegressionScratch(
    learning_rate=0.1,
    num_iterations=50
)

scratch_model.fit(X_train_small, y_train_small)

# Predict on validation set
y_val_pred = scratch_model.predict(X_val_scaled)

# Report metrics on validation set
print("Classification report on validation set (from scratch):")
print(classification_report(y_val, y_val_pred))

Epoch 1/50, Loss: 0.5259, Time: 7.32 sec
Epoch 2/50, Loss: 0.5247, Time: 5.73 sec
Epoch 3/50, Loss: 0.5253, Time: 7.39 sec
Epoch 4/50, Loss: 0.5247, Time: 5.69 sec
Epoch 5/50, Loss: 0.5246, Time: 7.46 sec
Stopping early at epoch 5 (no improvement for 3 epochs).

Training completed in 33.60 seconds.

Classification report on validation set (from scratch):
              precision    recall  f1-score   support

           0       0.87      0.96      0.91    844951
           1       0.90      0.74      0.81    465769

    accuracy                           0.88   1310720
   macro avg       0.89      0.85      0.86   1310720
weighted avg       0.88      0.88      0.88   1310720

👉 Quiz: What are the hyperparameters a user can play with in this setup?

💡 Show Answer

Learning rate (how big each step is during gradient descent).
Fraction of training data used (frac_data parameter when loading).
Lower and upper percentiles for feature scaling (e.g. 2% → value ≈ 5, 98% → value ≈ 250).

👉 Assignment: Adjust each hyperparameter and observe how it changes the results in both computational cost and final performance/accuracy.

💡 Hints

Try different learning rates (e.g. 0.01, 0.1, 1.0) and see how quickly or poorly the model converges.
Change the fraction of training data (frac_data=0.05, 0.1, 0.5, 1.0) to balance speed vs accuracy.
Modify the feature scaling bounds (e.g. low=1, high=254 vs low=5, high=250) and see how this affects stability and performance.

👉 Question: What did you find when testing different scaling bounds?

💡 Show Answer

You should have found that in this case the stretching of the spectral bands between some lower and upper percentile actually results in poorer results.
The best performance may occur when using the full raw range (low=0, high=255).

Why does this happen?
- For models like logistic regression with built-in regularization, rescaling the data can shrink feature values.
- Smaller features → smaller weights → regularization penalty dominates → the model underfits.
- In other words: while percentile-based scaling is often a “best practice” for noisy, heterogeneous datasets, here the raw reflectance values (0–255) already provide a meaningful spread.
- Neural networks might still benefit from percentile scaling, but this simple linear model actually performed better with the unscaled reflectances.

Convolutional Neural Network

🧠 U-Net: Convolutional Neural Network for Image Segmentation

U-Net is a type of Convolutional Neural Network (CNN) designed for semantic segmentation, where the goal is to assign a class label to every pixel in an image.

The architecture has a distinctive U-shaped structure with two main parts:

Encoder (contracting path):
A series of convolution and pooling layers that progressively downsample the image, capturing what is present (the context).
This part works like a standard CNN used for classification.
Decoder (expanding path):
A mirrored sequence of upsampling (deconvolution) layers that reconstruct the spatial details and recover the original image resolution, predicting where each object is located.

Crucially, U-Net uses skip connections between corresponding encoder and decoder layers — passing fine-grained spatial information directly from the downsampling path to the upsampling path.
This allows precise boundary localization, even when objects are small or irregularly shaped.

U-Net was originally developed for biomedical image segmentation, but it has become a standard model for many pixel-level tasks in Earth observation, medical imaging, and computer vision.

🧩 Tiny U-Net: A Simplified U-Shaped CNN for Fast Segmentation

TinyUNet is a lightweight version of the U-Net architecture, built to perform pixel-wise segmentation while keeping the network small and efficient.
It follows the same encoder–decoder pattern but with fewer layers and feature channels.

Structure overview:

Encoder (Downsampling path):
Two convolutional blocks (enc1, enc2) extract spatial features while MaxPool2d layers reduce the resolution.
This captures high-level context — what is in the image.
Bottleneck:
The deepest layer (bottleneck) connects encoder and decoder.
It processes the most abstract features before reconstruction begins.
Decoder (Upsampling path):
Two transpose-convolution layers (up2, up1) gradually restore spatial resolution.
Skip connections (torch.cat) merge encoder features with upsampled ones, preserving fine details — where objects are located.
Final layer:
A 1×1 convolution maps the last feature map to the desired number of output channels (e.g., a single mask).
A sigmoid activation converts outputs to probabilities between 0 and 1.

This “tiny” design retains the core logic of U-Net — context down, detail up, and skip connections — but uses far fewer parameters, making it ideal for: - quick experiments,
- low-resolution imagery, or
- deployment on limited hardware (e.g., laptops or embedded GPUs).

The entire code for the Tiny-UNET is pretty small:

Code

# -------------------------
# Tiny U-Net definition
# -------------------------
class TinyUNet(nn.Module):
    def __init__(self, in_channels=4, out_channels=1):
        super(TinyUNet, self).__init__()

        # Encoder
        self.enc1 = nn.Sequential(
            nn.Conv2d(in_channels, 16, 3, padding=1),
            nn.ReLU(),
            nn.Conv2d(16, 16, 3, padding=1),
            nn.ReLU()
        )
        self.pool1 = nn.MaxPool2d(2)

        self.enc2 = nn.Sequential(
            nn.Conv2d(16, 32, 3, padding=1),
            nn.ReLU(),
            nn.Conv2d(32, 32, 3, padding=1),
            nn.ReLU()
        )
        self.pool2 = nn.MaxPool2d(2)

        # Bottleneck
        self.bottleneck = nn.Sequential(
            nn.Conv2d(32, 64, 3, padding=1),
            nn.ReLU(),
            nn.Conv2d(64, 64, 3, padding=1),
            nn.ReLU()
        )

        # Decoder
        self.up2 = nn.ConvTranspose2d(64, 32, 2, stride=2)
        self.dec2 = nn.Sequential(
            nn.Conv2d(64, 32, 3, padding=1),  # 64 = skip + upsample
            nn.ReLU(),
            nn.Conv2d(32, 32, 3, padding=1),
            nn.ReLU()
        )

        self.up1 = nn.ConvTranspose2d(32, 16, 2, stride=2)
        self.dec1 = nn.Sequential(
            nn.Conv2d(32, 16, 3, padding=1),  # 32 = skip + upsample
            nn.ReLU(),
            nn.Conv2d(16, 16, 3, padding=1),
            nn.ReLU()
        )

        self.final = nn.Conv2d(16, out_channels, 1)

    def forward(self, x):
        e1 = self.enc1(x)
        p1 = self.pool1(e1)

        e2 = self.enc2(p1)
        p2 = self.pool2(e2)

        b = self.bottleneck(p2)

        u2 = self.up2(b)
        d2 = self.dec2(torch.cat([u2, e2], dim=1))

        u1 = self.up1(d2)
        d1 = self.dec1(torch.cat([u1, e1], dim=1))

        return torch.sigmoid(self.final(d1))

Next, we define a python class to handle the data loading. This streamlines the process and also converts the data into an efficient torch.tensor format at the end. pytorch is perhaps the most popular python framework to work with deep neural networks, and handling data transfers to/from GPUs; an earlier approach, tensorflow, was developed by Google but it used coding language quite different from python, whereas pytorch looks closer to plain python.

Code

# -------------------------
# Dataset for image tiles
# -------------------------
class EvalTileDataset(Dataset):
    def __init__(self, base_dir, tile_size=64, frac_data=1.0, seed=0):
        img_files  = sorted(glob.glob(os.path.join(base_dir, "images", "*.tif")))
        mask_files = sorted(glob.glob(os.path.join(base_dir, "masks", "*.tif")))

        # optional subsample of clips
        n_files = len(img_files)
        n_select = max(1, int(frac_data * n_files))
        rng = random.Random(seed)
        idx_select = sorted(rng.sample(range(n_files), n_select))

        self.img_files  = [img_files[i] for i in idx_select]
        self.mask_files = [mask_files[i] for i in idx_select]
        self.tile_size  = tile_size
        self.tiles = []  # list of (file_idx, i, j)

        for fidx, (img_path, mask_path) in enumerate(zip(self.img_files, self.mask_files)):
            img = rioxarray.open_rasterio(img_path).values.transpose(1,2,0)  # HWC
            H, W, _ = img.shape
            for i in range(0, H - tile_size + 1, tile_size):
                for j in range(0, W - tile_size + 1, tile_size):
                    self.tiles.append((fidx, i, j))

    def __len__(self):
        return len(self.tiles)

    def __getitem__(self, idx):
        fidx, i, j = self.tiles[idx]

        img = rioxarray.open_rasterio(self.img_files[fidx]).values.transpose(1,2,0)
        mask = rioxarray.open_rasterio(self.mask_files[fidx]).values[0]

        # crop tile
        X_tile = img[i:i+self.tile_size, j:j+self.tile_size, :]
        y_tile = np.where(np.isin(mask[i:i+self.tile_size, j:j+self.tile_size], [2,3]), 1, 0)

        return (
            torch.tensor(X_tile.transpose(2,0,1), dtype=torch.float32),
            torch.tensor(y_tile[np.newaxis, ...], dtype=torch.float32)
        )

Load our data into torch.tensor objects:

Code

# -------------------------
# Build datasets & loaders
# -------------------------
train_ds = EvalTileDataset("dataset_compact/train", tile_size=64, frac_data=0.1)
val_ds   = EvalTileDataset("dataset_compact/val",   tile_size=64, frac_data=0.1)

train_loader = DataLoader(train_ds, batch_size=16, shuffle=True)
val_loader   = DataLoader(val_ds, batch_size=16)

Define some training parameters, including an option to train on a GPU when available:

Code

device = "cuda" if torch.cuda.is_available() else "cpu"

model = TinyUNet(in_channels=4, out_channels=1).to(device)

# you may remember that Adam is one way to automatically adjust and optimize the learning rate
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

# what loss function to minimize. BCE stands for binary-cross-entropy, which is
# the standard loss/error function for logistic regression that we've used from the start
loss_fn = nn.BCELoss()

# how many epochs to train for. you shouldn't need many
n_epochs = 3

Code

%%time
# -------------------------
# Training loop
# -------------------------

for epoch in range(n_epochs):
    model.train()
    train_loss = 0.0
    for Xb, yb in train_loader:
        Xb, yb = Xb.to(device), yb.to(device)
        optimizer.zero_grad()
        y_pred = model(Xb)
        loss = loss_fn(y_pred, yb)
        loss.backward()
        optimizer.step()
        train_loss += loss.item()
    print(f"Epoch {epoch+1}/{n_epochs}, Loss: {train_loss/len(train_loader):.4f}")

Epoch 1/3, Loss: 0.3167
Epoch 2/3, Loss: 0.2108
Epoch 3/3, Loss: 0.2038
Classification report on validation set (Tiny U-Net):
              precision    recall  f1-score   support

         0.0       0.94      0.97      0.96    844951
         1.0       0.95      0.89      0.92    465769

    accuracy                           0.94   1310720
   macro avg       0.95      0.93      0.94   1310720
weighted avg       0.94      0.94      0.94   1310720

Now that you have trained the U-Net model, we can run .eval() on the validation/test data and see how well this fancy model performed. Note that some of the extra steps below have to do with turning torch.tensor objects back into plain numpy arrays.

Code

# -------------------------
# Evaluation
# -------------------------
model.eval()

all_preds, all_truth = [], []
with torch.no_grad():
    for Xb, yb in val_loader:
        Xb = Xb.to(device)
        preds = model(Xb).cpu().numpy()
        preds = (preds > 0.5).astype(np.uint8).ravel()
        all_preds.append(preds)
        all_truth.append(yb.numpy().ravel())

y_true = np.concatenate(all_truth)
y_pred = np.concatenate(all_preds)

print("Classification report on validation set (Tiny U-Net):")
print(classification_report(y_true, y_pred))

How does the accuracy of the U-Net model compare to our earlier models?

🛰️ Week 10 Lab — Satellite Image Segmentation with CNNs

In this lab, we explored pixel-wise land–water classification using multispectral satellite imagery.
The goal was to label every pixel as land or water, progressing from simple machine-learning models to a compact convolutional neural network.

🧩 Key steps

Data preparation
- Downloaded pre-labeled RGB-NIR image clips from the DeepWaterMap dataset.
- Organized them into train/validation/test splits and inspected them visually.
Feature scaling and preprocessing
- Demonstrated why raw 8-bit reflectances (0–255) are unsuitable for logistic regression due to sigmoid saturation.
- Applied 2–98 % percentile clipping and min–max normalization to map features into a [0, 1] range.
- Collapsed the original four water-confidence labels (0–3) into binary land/water labels.
Modeling progression
- Linear Logistic Regression (scikit-learn): a quick baseline for separating land vs. water using per-pixel features.
- Custom Logistic Regression (from scratch): implemented manually to illustrate weight updates and loss behavior.
- Tiny U-Net CNN: a compact encoder–decoder network with skip connections that performs spatially coherent segmentation.
Evaluation and visualization
- Compared predictions on validation and larger evaluation tiles.
- Used color-coded masks and optional OpenStreetMap (OSM) overlays to assess geographic alignment.

💡 Takeaway points

Each pixel’s spectral bands serve as input features; segmentation extends classification to a per-pixel task.
Normalization is essential—scaled reflectances prevent vanishing or saturated activations.
Simple linear models can capture basic land/water contrast, but CNNs (U-Nets) exploit spatial context for far better accuracy and smoother boundaries.
Skip connections in U-Net architectures help retain fine details lost during down-sampling.
The lab demonstrates a full pipeline: from raw remote-sensing data ➜ preprocessing ➜ feature engineering ➜ model training ➜ visual validation.

In short: this lab bridges traditional machine learning and modern deep learning for Earth-observation imagery, showing why convolutional architectures like U-Net have become the standard for semantic segmentation.