Module 5: Data Preprocessing & Pipelines

What this module covers

This module shifts from focusing on models themselves to the crucial steps that come before modeling: preparing real-world data so machine learning can work effectively. These steps are often called data preprocessing or feature engineering, and in practice they take most of a data scientist’s time.

You’ll learn three essential components:

  • Feature scaling — making sure different numeric features (like lot size vs. year built) are on comparable scales
  • Encoding categorical variables — turning string-based categories (like “Neighborhood” or “Roof Style”) into numeric features using one-hot encoding
  • Imputation — filling in missing values, which are common in real datasets

You’ll also learn how to combine all of these steps into a clean scikit-learn Pipeline — the standard way to package preprocessing and modeling together in practice. Pipelines prevent data leakage, make your workflow reproducible, and are the foundation for proper cross-validation (covered in the next module).

Materials

Slides: Data Preprocessing & Pipelines (pdf)

Lecture: Machine Learning Pipelines — interactive notebook covering feature scaling, encoding, imputation, and Pipeline construction.

Practice: Predicting House Prices with Pipelines — a complete end-to-end exercise using the Ames Housing dataset. You’ll explore the data, handle missing values and categorical variables, train a linear regression model, evaluate it with multiple metrics (\(R^2\), RMSE, MAE, RMSLE), and build a clean pipeline.

Prerequisites

Module 4 — logistic regression and classification.


Next module: Module 6: Cross-Validation & Regularization — how to detect and fix overfitting.