Module 5: Data Preprocessing & Pipelines
What this module covers
This module shifts from focusing on models themselves to the crucial steps that come before modeling: preparing real-world data so machine learning can work effectively. These steps are often called data preprocessing or feature engineering, and in practice they take most of a data scientist’s time.
You’ll learn three essential components:
- Feature scaling — making sure different numeric features (like lot size vs. year built) are on comparable scales
- Encoding categorical variables — turning string-based categories (like “Neighborhood” or “Roof Style”) into numeric features using one-hot encoding
- Imputation — filling in missing values, which are common in real datasets
You’ll also learn how to combine all of these steps into a clean scikit-learn Pipeline — the standard way to package preprocessing and modeling together in practice. Pipelines prevent data leakage, make your workflow reproducible, and are the foundation for proper cross-validation (covered in the next module).
Materials
Slides: Data Preprocessing & Pipelines (pdf)
Lecture: Machine Learning Pipelines — interactive notebook covering feature scaling, encoding, imputation, and Pipeline construction.
Practice: Predicting House Prices with Pipelines — a complete end-to-end exercise using the Ames Housing dataset. You’ll explore the data, handle missing values and categorical variables, train a linear regression model, evaluate it with multiple metrics (\(R^2\), RMSE, MAE, RMSLE), and build a clean pipeline.
Prerequisites
Module 4 — logistic regression and classification.
Next module: Module 6: Cross-Validation & Regularization — how to detect and fix overfitting.