🧪 (Binary) Logistic Regression with Pipelines, Cross-Validation, and Regularization
In this lab, you will build up logistic regression models step by step and see how to make them more powerful, robust, and reliable.
We’ll start with the basics and then gradually add tools that help us deal with more complex data and avoid overfitting. Along the way, you’ll also learn how to use scikit-learn’s pipelines to keep your workflow neat and reproducible.
What you’ll do in this lab:
Warm-up: Load datasets, split into training and validation sets, and standardize features.
Logistic regression basics: Train simple linear logistic regression models and plot their decision boundaries.
Polynomial feature expansion: Add higher-order polynomial features to capture nonlinear decision boundaries.
Pipelines: Combine preprocessing (scaling, polynomial expansion) and logistic regression into a single reusable pipeline.
Cross-validation (CV): Compare manual CV loops with scikit-learn’s built-in tools (cross_val_score, cross_validate) and see why CV helps detect overfitting/underfitting.
Regularization: Explore Ridge (L2) and Lasso (L1) penalties, learn how the regularization strength λ affects the coefficients, and visualize how features are suppressed.
Hyperparameter tuning: Write your own nested loops to search for the best polynomial degree and λ, and then see how GridSearchCV automates this search.
👉 By the end of this lab, you’ll know how to: - Train logistic regression models with linear and nonlinear features.
- Use cross-validation to evaluate models more reliably.
- Apply regularization to prevent overfitting.
- Automate model selection with pipelines and grid search.
def plot_boundary_poly(clf, X_tr_s, y_tr, poly, title="", padding=0.5, n=300): x_min, x_max = X_tr_s[:,0].min()-padding, X_tr_s[:,0].max()+padding y_min, y_max = X_tr_s[:,1].min()-padding, X_tr_s[:,1].max()+padding xx, yy = np.meshgrid(np.linspace(x_min, x_max, n), np.linspace(y_min, y_max, n)) grid = np.c_[xx.ravel(), yy.ravel()] Z = clf.predict_proba(poly.transform(grid))[:,1].reshape(xx.shape)# note that the poly.transform(grid) above is the only difference w.r.t. the earlier function plt.figure() cs = plt.contourf(xx, yy, Z, levels=np.linspace(0,1,11), alpha=0.25) plt.colorbar(cs, label="P(y=1)") plt.contour(xx, yy, Z, levels=[0.5], linewidths=2) plt.scatter(X_tr_s[:,0], X_tr_s[:,1], c=y_tr, s=25, edgecolor='k', alpha=0.9) plt.title(title); plt.show()
Code
def plot_decision_boundary(clf, X, y, title="", padding=0.5, n=300, ax=None):""" Plot the decision boundary and class-1 probability shading for a classifier or pipeline. This function creates a 2D contour plot showing how a trained model classifies data in feature space. It shades the background according to the predicted probability of class 1 (from 0 to 1), draws the decision boundary where P(y=1) = 0.5, and overlays the actual data points. Parameters ---------- clf : estimator or pipeline The trained classifier (e.g. LogisticRegression, or a Pipeline with scaling and polynomial features). Must support `.predict_proba`. X : array-like of shape (n_samples, 2) Feature matrix with exactly two columns (so results can be plotted in 2D). y : array-like of shape (n_samples,) Labels for the data (0/1 in binary classification). title : str, optional (default="") Title for the plot. padding : float, optional (default=0.5) Extra margin to add around the scatter plot when drawing the decision surface. n : int, optional (default=300) Resolution of the grid used to compute predicted probabilities. Larger values give smoother contours but may be slower. ax : matplotlib.axes.Axes, optional (default=None) Axis to plot on. If None, a new figure and axis are created. Pass an existing axis when using subplots. Returns ------- ax : matplotlib.axes.Axes The axis with the plotted decision boundary and scatter points. Notes ----- - The background color map shows P(y=1), with darker colors indicating higher probability. - The solid contour line marks the decision boundary (P(y=1) = 0.5). - Points are plotted with their true labels for comparison. - If `ax=None`, a colorbar is also added to the figure. """# mesh grid x_min, x_max = X[:,0].min()-padding, X[:,0].max()+padding y_min, y_max = X[:,1].min()-padding, X[:,1].max()+padding xx, yy = np.meshgrid(np.linspace(x_min, x_max, n), np.linspace(y_min, y_max, n)) grid = np.c_[xx.ravel(), yy.ravel()]# predict probabilities Z = clf.predict_proba(grid)[:,1].reshape(xx.shape)# pick axisif ax isNone: fig, ax = plt.subplots(figsize=(6,6)) standalone =Trueelse: standalone =False# plot cs = ax.contourf(xx, yy, Z, levels=np.linspace(0,1,11), alpha=0.25) ax.contour(xx, yy, Z, levels=[0.5], linewidths=2) ax.scatter(X[:,0], X[:,1], c=y, s=25, edgecolor="k", alpha=0.9) ax.set_title(title)if standalone: plt.colorbar(cs, ax=ax, label="P(y=1)") plt.show()return ax
Quick Note on Printing in Python: unformatted vs. formatted
Before we get started, in case you’re not familiar yet with formatted strings / print statements in python, you might want to learn about it:
Unformatted print (comma-separated arguments)
print("Variable A:", A, "Variable B:", B)
Each argument is separated by a space when it is printed. Output looks simple but spacing/formatting is less customizable.
Formatted print (f-strings)
print(f"Variable A: {A}; Variable B: {B}.\n")
Lets you embed variables directly inside the string. You can control punctuation, spacing, line breaks, and even apply formatting (e.g., :.2f for decimals).
👉 f-strings are the modern, recommended way in Python 3.6+. For more details, see the Python docs on f-strings.
Code
# Example of formatted versus unformatted print statements:# Example numberspi =3.141592653589793big =1234567890.123456789tiny =0.000000123456789# --- Unformatted printing ---print("Unformatted print:")print("pi =", pi, "big =", big, "tiny =", tiny)# --- Formatted printing with f-strings ---print("\nFormatted print with f-strings:")print(f"pi = {pi:.2f}, big = {big:.2e}, tiny = {tiny:.3e}")
1) Load the Google Sheets
Here’s a nice and compact way to load in the two datasets what we used in the lecture of Week 4 (find associated codes in: Week_4_Lecture_Logistic_Regression_Classification.ipynb ).
Make a mental note of this: if you want to use Colab for your own research later, you may want to store your own data files in a Google Drive as well so you can access them from a jupyter notebook.
Code
def gsheet_csv(sheet_id, gid=0):returnf"https://docs.google.com/spreadsheets/d/{sheet_id}/export?format=csv&gid={gid}"sheetA_id ="1X-jEUKsulZIGilE91PxK3Nse968gGW9wDwg0lLxP14c"sheetB_id ="1eq-Awuub_ChZi5YIqlbq81_RWUUrFBIqRwVeMZerVUI"try: dfA = pd.read_csv(gsheet_csv(sheetA_id, 0)) # This data file happens to have column names on the first row colnames = ["x1", "x2", "y"] dfB = pd.read_csv(gsheet_csv(sheetB_id, 0), header=None, names=colnames) # this file has no column names, so we add manuallyexceptExceptionas e:print("⚠️ Could not fetch Google Sheets. Download CSVs locally.", e) dfA, dfB = pd.DataFrame(), pd.DataFrame()print("Dataset A shape:", dfA.shape)print("Dataset B shape:", dfB.shape)display(dfB.head())
2) Pick two numeric features and target column for each dataset
For each dataset, split the dataframe that we just loaded into a feature array and the target variable (‘y’). It may make sense to carry the ‘A’ and ‘B’ forward in variable names, so you can easily look at just one dataset or the other or mix them up. For now, define X_A, y_A and X_B, y_B as the features and target of problem dataset A/B, respectively, and make sure they are now numpy arrays. This and many other steps in this lab were also done in the lecture notes, so you can look there for ideas.
Code
# Split dfA into XA and yA and convert to numpy array# Split dfB into XB and yB and convert to numpy array
Whenever you do anything like this, it’s a good idea to check the shapes of your new variable and maybe inspect the first rows, for example.
Importantly: check what fraction/percentage of your binay target labels is 0 versus 1. If there are far more samples with one class than another (i.e. mostly 0 or mostly 1 labels), we have to be extra careful with some of our sampling methods.
Code
# Instect shapes/elements of your data
Add scatter plots of what the data look like, labeled by the target feature. Look in the lecture notes how we did that and copy-paste here so you can see which dataset and which and remind yourself of what they look like.
Code
# Add visualization of your data with matplotlib
3) Train/Test split and feature scaling
Split your datasets A and B, each, into 70% training data and 30% validation data.
💡 Hint: options in train_test_split
The function comes from sklearn.model_selection:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(
X, y,
test_size=0.3, # 30% for validation
random_state=42, # makes the split reproducible
stratify=y # (optional) keep class balance in both splits
)
test_size: fraction (0–1) or absolute number of samples for the validation set.
train_size: you can specify instead of test_size.
random_state: fixes the random seed so everyone gets the same split.
stratify: ensures the class proportions are preserved in both train and validation sets (especially useful if classes are imbalanced).
For this lab, instead of just X_train, X_val, y_train, y_val, make sure to use different variable names for the datasets A and B.
Code
# Split feature array XA and target yA into training and validation features and the corresponding training and validation labels# Do the same for dataset B
In the lecture last week, we discussed “min-max” scaling, which scales all features to the [0, 1] range. scikit-learn has an easy way to do this with MinMaxScaler. You could totally use that option for this lab, but for logistic regression, sometimes the “Standard Scaler” performs better.
The StandardScaler transforms each feature (column) of your data so that it has:
Mean = 0
Standard deviation = 1
The formula is:
\[
z = \frac{x - \mu}{\sigma}
\]
where:
- \(x\) = original feature value
- \(\mu\) = mean of that feature (computed from the training data)
- \(\sigma\) = standard deviation of that feature (from the training data)
- \(z\) = the standardized (scaled) value
Why is this useful?
It ensures all features are on the same scale, which helps optimization algorithms converge faster.
It prevents features with large ranges (e.g. “income” in dollars vs. “age” in years) from dominating the model.
It makes regularization (L1/L2) work fairly across all features.
Important:
Fit the scaler only on the training data (scaler.fit(X_train)) and then apply it to both training and validation/test data (scaler.transform(...)). This avoids leaking information from the validation/test sets.
Scale the input features of both datasets A and B using StandardScaler.
💡 Hint: how to use StandardScaler
The scaler comes from sklearn.preprocessing:
from sklearn.preprocessing import StandardScaler# Fit the scaler *only on the training data*scalerA = StandardScaler().fit(XA_tr)# Transform both training and validation dataXA_tr_s = scalerA.transform(XA_tr)XA_val_s = scalerA.transform(XA_val)fit: computes the mean and standard deviation from the training data.transform: applies the scaling (subtract mean, divide by std) to any dataset.
Important: never fit on the validation/test data, only on the training data. Otherwise you leak information from validation into training. Repeat the same procedure for dataset B.
Code
# define StandardScaler for dataset A and 'fit' and then 'transform' to training/validation data for A# Do the same for dataset B
Again, double check that your data still good look (e.g., print min and max of each variable).
Code
# Inspect your data
4) Linear logistic regression (no regularization)
Come up with a name for your fitting model, which is necessary to work with scikit-learn functions. Define your model parameters, and then fit it to your training data. Do the same for dataset A and B.
We have to explicitly specify not to use a regularization penalty, because that is actually used by default (so you have to say penalty=None). You can set max_iter=5000 just to make sure it converges even for polynomial features (default is 100). Setting random_state=RNG_SEED where you define RNG_SEED with some number somewhere (see cell that loads libraries) makes the randomness in the model deterministic (should give same results each time your run it).
FYI, you can also play around with different solvers: - lbfgs
Quasi-Newton optimizer, fast and robust. Works well for multiclass problems and with L2 regularization. Default choice.
liblinear
Coordinate descent on a linear SVM-style formulation. Good for small datasets. Supports L1 or L2 regularization, but only for binary classification.
newton-cg
Newton–Raphson method using conjugate gradients. Handles multinomial loss. Requires more memory than lbfgs.
newton-cholesky
Newton method using Cholesky decomposition. Typically faster than newton-cg for smaller datasets.
sag
Stochastic Average Gradient — efficient for large datasets, but only supports L2 regularization.
saga
Stochastic Average Gradient with variance reduction. Supports both L1 and L2 regularization, and elastic net. Works for large-scale problems.
👉 For most cases, lbfgs is the recommended default. Once we get to polynomial models, we’ll have many features and sometimes lbfgs may not converge; in such a case, try, e.g., newton-cg.
Remember, just like for LinearRegression for numerical regression, after defining our solvers, you fit it to data with something like linA.fit(XA_tr_s, yA_tr).
Define solver and fit for each dataset (A and B).
Code
%%time
Make predictions on your separate test/validation data for A and B.
Use scikit-learn’s accuracy_score(y_true, y_pred) to see the accuracy of your current predictions for each dataset.
Code
You should see that the accuracy on one of the datasets is pretty good but it is really bad for the other dataset.
Do you know why?
💡 Hint
Logistic regression with only a linear model can only draw a straight line as its decision boundary.
If a dataset is close to linearly separable, then a straight line works well — accuracy will be high (around 90%+).
But if the true separation is curved (e.g. circular, quadratic), then a straight line cannot capture it.
The model underfits → low accuracy (close to random guessing).
This is why we later introduce polynomial features (e.g. quadratic terms) so logistic regression can fit curved boundaries.
5) Plotting helper for boundaries
Plot the decision boundaries for the models that you just trained/fitted, for each dataset.
Use the nifty function plot_decision_boundary defined above. Remember that for any function you can always run help(plot_decision_boundary) to get more information on its input and output parameters etc.
Code
help(plot_decision_boundary)
I’ll give you an example of how to plot the decision boundaries for both datasets side-by-side using subplots, which takes advantage of the last argument of the plot_decision_boundary function. The first 3 input arguments may have different names in the code you wrote yourself above, so make sure you understand and edit as needed.
This should again clearly show why a linear model for logistic regression works well for one problem, which is linearly separable, but doesn’t work for the other problem, where we need higher-order features.
Question: Which, if either, of these models is underfitting? Which is overfitting? Which probably has a high bias? And which a high variance?
💡 Answer
The linear logistic regression model applied to data that really require quadratic (curved) features is underfitting. It cannot capture the true decision boundary. This corresponds to high bias (systematic error). At this stage, neither model is clearly overfitting. The polynomial model hasn’t yet shown strong signs of memorizing noise. So we don’t attribute high variance to either model here. In short: Linear model on quadratic data → underfitting, high bias Neither model (yet) → high variance
6) Polynomial features + logistic regression (degree 6)
In order to potentially address underfitting, we will try a higher-order polynomial model for each dataset (even though one of them already seems to be doing okay with a linear model).
Create 6th order polynomial features for each of your datasets. It is okay to use scikitlearn tools like poly = PolynomialFeatures(degree=6, include_bias=False).
💡 Hint: how to use PolynomialFeatures
The transformer comes from sklearn.preprocessing:
from sklearn.preprocessing import PolynomialFeaturespoly = PolynomialFeatures(degree=6, include_bias=False)# Option 1: shortcutXA_tr_poly = poly.fit_transform(XA_tr_s) # fit + transform in one stepXA_val_poly = poly.transform(XA_val_s) # only transform on validation# Option 2: two separate callspoly.fit(XA_tr_s) # learn mapping from training dataXA_tr_poly = poly.transform(XA_tr_s) # apply to trainingXA_val_poly = poly.transform(XA_val_s) # apply to validationfit: learns the mapping (which polynomial combinations to create).transform: applies that mapping to a dataset.fit_transform: convenience method that does both in one go.
Note: (more explanation)
the .fit step may seem confusing (I find it confusing myself). To be clear PolynomialFeatures doesn’t fit anything. This is just the naming scheme that a lot of scikit-learn tools follow. In this case, PolynomialFeatures has to take a look at the feature array basically just to see how many features there are because it will generate not just features \(x_1^2\), \(x_1^3\), $$, \(x_1^6\) but –because we have two features– we also need all the cross terms like \(x_1 x_2\) and \(x_1 x_2^6\), etc. If we had more or fewer features, there would be more/fewer of those cross-terms. So the number of output features is not know by just specifying PolynomialFeatures(degree=6, include_bias=False). It needs to do poly.fit(XA_tr_s) or poly.fit_transform(XA_tr_s) to figure out the proper output. Once it knows that, you can use keep using .transform to apply it to other feature arrays with the same number of features.
⚠️ Important: Always fit the transformer on training data only, then reuse it to transform both training and validation/test data. This ensures no information leaks from validation/test into training.
Code
# define polynomial model with some variable namepoly =# 'fit' + 'transform' or 'fit_transform' it to your training data for dataset A and then also for B# apply the same model to your validation/test data
Next, redo the logistic regression but now applied to your polynomial features.
Code
# you can se the exact same syntax as before to define and run/fit your LogisticRegression models# just .fit it to the appropriate polynomial feature arrays
Just like before, used your trained model to make predictions for the validation data that you set aside and print out the accuracy of those predictions for both datasets.
Code
# obtain precisions for predictions on validation data
How do the accuracies compare to our purely linear logistic regression results?
Plot degree-6 boundaries
To visualize the result of fitting our 6th order logistic regressions models, we want to again see what the decision boundaries look like.
In the top, we defined a convenient function to plot these: plot_boundary_poly(clf, X_tr_s, y_tr, poly, title="", padding=0.5, n=300).
First, use the plot_boundary_poly function to show your training data and the decision boundary for dataset A and B.
This is kind of a temporary way of doing it. In a minute, you’ll define a pipeline, which makes the data preprocessing and plotting much easier.
Code
plot_boundary_poly(polyA, XA_tr_s, yA_tr, poly, "Dataset A — Degree-6")plot_boundary_poly(polyB, XB_tr_s, yB_tr, poly, "Dataset B — Degree-6")
Next, do the same but now for your validation data.
Are the decision boundaries the same or different? Does that make sense?
💡 Answer
The decision boundary in logistic regression is defined by the set of points \(x\) that satisfy
\[
\sum_j \theta_j x_j = 0,
\]
where the coefficients \(\theta_j\) are determined during training.
Once the model is trained, these coefficients are fixed. The decision boundary is therefore completely determined by the trained model and does not depend on whether we are plotting training data or validation/test data.
The scatter points (training vs. validation) will of course change when you switch datasets, but the boundary remains the same. Geometrically, the boundary says:
Points with features \(x\) that lie on one side of the line (or curve, if polynomial features are included) are predicted as class 0.
Points on the other side are predicted as class 1.
So the boundary itself is fixed after training, and we simply use it to classify any future measurements of \(x\).
Suggestion: in the top cell where we imported all the libaries, change RNG_SEED to some other value. Remember, this is what makes randomness reproducible. If you change it to another value, you get different randomness. After you change it, come back to this cell and select it, find the ‘Runtime’ menu in the top, and hit ‘Run before’ This will re-run all the cells until this one.
Do the decision boundaries look the same or different? What does that mean?
Do these decision boundaries look like we’re done and this is the best model for this problem?
7) Pipeline and Cross-Validation
Explore how defining a pipeline can greatly simpify the data pre-processing that we did above.
Here’s a simple example of how we can define a function, which we call make_logreg_pipeline (or whatever you like) that combines the feature scaling, generating polynomial features, and defining what LogisticRegression model we want to run (but not actually run it yet).
Code
# --- define pipeline ---def make_logreg_pipeline(degree=8):return Pipeline([ ("scaler", StandardScaler()), ("poly", PolynomialFeatures(degree=degree, include_bias=False)), ("clf", LogisticRegression( penalty=None, # no regularization solver="lbfgs", # you can chanche this to, e.g. max_iter=20000# crank up iterations )) ])
One reasons to wrap this in a function is that we can choose which of the parameters we want to keep control over. In the example below, make_logreg_pipeline only has one input argument, the degree of polynomial. We set this up with a default value of 8 when no order is specified, i.e. when you just call make_logreg_pipeline(). But in general, you could call make_logreg_pipeline(degree=2), with whatever order you want to try out.
Other parameters would currently always have the same values when you call this function. Specifically, the values of: * penalty=None, # no regularization * solver="lbfgs", # you can chanche this to, e.g. * max_iter=20000 # crank up iterations
You’ll want to change this later in the lab.
As an illustration of how easy it is to use the pipeline, I’ll show below how to fit it to all the data for dataset B (no train/validation split in this illustration). Then, we run the predict step over the same data, just so we can compute the accuracy of the model and plot the data and decision boundary again.
Code
# --- fit pipeline ---# it may be useful to define your own variable for polynomial degree:my_poly_degree =4pipe = make_logreg_pipeline(degree=my_poly_degree)pipe.fit(XB, yB)# --- evaluate accuracy ---y_pred = pipe.predict(XB)acc = accuracy_score(yB, y_pred)print(f"Training accuracy: {acc:.3f}")# now we can use a nicely automatic formatted title as the last argument:# plot_decision_boundary(pipe, XB, yB, title=f"Degree={my_poly_degree} logistic regression")plot_decision_boundary(pipe, XB, yB, title=f"Degree={my_poly_degree} logistic regression")plt.show()
Manual Cross-Validation
Okay, let’s start putting a lot of things together now.
⚠️ Important: what we’re about to do here is a simplified / “fake” version of cross-validation.
We will repeatedly split the dataset into training and validation sets at random, train a model, and check its accuracy and decision boundary. This gives us a sense of variability between different random splits. Proper k-fold cross-validation (where every sample is used exactly once for validation and \(k-1\) times for training) will come next.
Steps for this simplified version:
Define a variable for whatever polynomial degree you’d like to experiment with.
Define how many folds you want to simulate with some variable. Start with something modest until everything works, like 3.
Create a figure that will hold \(k\) panels of the \(k\) decision boundaries computed in each of the folds. You can do this with fig, axes = plt.subplots(1, n_folds, figsize=(4*n_folds, 4), constrained_layout=True), where n_folds would be the variable name for \(k\).
Create a loop, using something like for k in range(n_folds):.
For each iteration of the loop, do a new training:validation split of your data, just like we did before.
Start from the “raw” initial X and y arrays that we downloaded from the spreadsheet.
Focus here on Dataset B (the more complicated one that really needs a curved decision boundary).
Each split should be different. We can ensure this by giving a different random_state each time. The simplest is to just use the loop variable k, but any different value works.
Remember: this is not proper CV, because the splits can overlap — some points may never be used in validation, others may appear multiple times. But it’s a good warm-up to understand the process.
Define your pipeline like we did above, with the polynomial degree controlled by your chosen variable.
Fit the pipeline to your training data.
Compute the accuracy on your validation data for each fold. You can either:
use the shortcut pipe.score(X_val, y_val), or
explicitly compute predictions with pipe.predict(X_val) and then compare to y_val.
Print the accuracy for each fold.
Finally, plot the decision boundaries. The plotting function is set up so you can show them side by side in the larger figure, for example: plot_decision_boundary(pipe, X_val, y_val, title=f"Fold {k+1} val (deg={my_poly_degree})", ax=axes[k])
If you want more than 5 folds, you’ll need to modify the subplot grid to allow multiple rows.
👉 After this exercise, we’ll see how to do proper k-fold CV where each point is guaranteed to be used exactly once as validation.
Code
# Write pseudo-CV code
Again, pay attention to any variations in the accuracy of the model and what the decision boundaries look like for each CV fold. For a ‘good’ model, the accuracy on training and validation data should be similar for each CV fold and also across all the folds, and the decision boundaries should be similar across the folds.
Experiment by changing the order of the polynomial and running your CV again. For what order/degree of polynomial do you see the best results?
Proper CV
In the previous exercise, we did a “fake” version of cross-validation:
each time we randomly split the data into training (70%) and validation (30%) sets, fit a model, and checked performance. This is useful for illustrating variability between splits, but it is not true cross-validation.
Proper k-fold CV works differently:
The dataset is divided once into \(k\)disjoint folds (subsets).
In each round, one fold is used as the validation set, while the other \(k-1\) folds are combined for training.
This is repeated \(k\) times so that every sample appears exactly once in a validation set and \(k-1\) times in a training set.
At the end, you can average the validation accuracy across folds to get a more systematic estimate of performance.
Why do this?
- Guarantees that all data points are used for validation exactly once.
- Reduces the chance that results depend too much on a lucky (or unlucky) random split.
- Provides a fairer, more systematic evaluation of how well the model generalizes.
Disadvantages with small datasets (like ours with 100–117 points):
- If \(k\) is large, each fold is very small (e.g. with 10-fold CV, only ~10 points per validation set).
- Tiny validation sets can make accuracy estimates unstable (high variance).
- Training sets also shrink as \(k\) increases, which may bias the results.
That’s why in small datasets it’s common to limit ourselves to 5-fold CV (so ~20 points per validation set).
In contrast, in the “fake” CV approach (random 70:30 splits), we could repeat the process as many times as we liked, because each validation set always had ~30 points.
To modify your code above to do true CV, instead of using the train_test_split with a random seed, we can use manual index subsetting. I know this can be confusing, so let me give you the essential code, and just make sure you understand it.
What is the total number of measurements in your dataset (start with dataset B, but aferwards also try on A)?
Assuming that n_samples is the number of samples and n_folds defines how many CV folds you want to do, the number of samples in each fold needs to be fold_size = n_samples // n_folds. For example, if we had 100 measurements in total and want to do 5-fold CV, the each fold will have 20 samples. [Figure out for yourself (try in code cell) what the double-dash devision does and what the implications are if n_folds*fold_size is not exactly n_samples.]
Now, the critical difference from before is that inside your loop you can manually split your training and validation data as follows:
start = k * fold_size end = (k +1) * fold_size# --- split data directly by slicing --- X_val, y_val = XB[start:end], yB[start:end] X_tr = np.concatenate([XB[:start], XB[end:]]) y_tr = np.concatenate([yB[:start], yB[end:]])
See if you understand that code. Everything else should work just as in the previous example.
Code
# Write proper CV code
scikit-learn Cross-Validation
Now that you’ve explored CV in some detail and implemented it more or less from scratch yourself, you’ve ‘deserved’ to learn how to use ‘black box’ routines from scikit-learn to make this a lot easier.
The simplest to use, perhaps, is cross_val_score, which does the entire cross-validation in one line of code, once you’ve defined your elegant pre-processing pipeline.
Code
# Run 5-fold cross-validation# uncomment if you want to change the polynomial degree#pipe = make_logreg_pipeline(degree=my_poly_degree)scores = cross_val_score(pipe, XB, yB, cv=5)print("Fold accuracies:", scores)print("Mean accuracy:", scores.mean())
We could also have used the ‘manual’ Cross-Validation loop, but used scikit-learn‘s StratifiedKFold to do the data splits. This is not just more convenient (perhaps), but also a bit more sophisticated. The ’stratified’ refers to the fact that the routine makes sure that each fold has a similar balance of 0 and 1 labels.
Play around some more with the pipeline + cross-validation ‘package’ that you’ve put together by now. Try again what polynomial order gives you the best mean accuracy on validation data for a full cross-validation. Try the same for the dataset A if you haven’t already.
7) Regularization!
Now we have all the basics in place to unleash the magic of regularization and (mostly) automatically figure out what the best polynomial model is to fit our data.
By using scikit-learn pipelines, this is quite straighforward. Just slightly edit your pipeline function. Note that most of the below is some more elaborate documentation for when you run help() on the function. Lasso regularization doesn’t work with all of the solvers, so we make sure it uses one that works. When choosing the penalty, l2 stands for Ridge regularization and l1 is Lasso regularization. Elasticnet is a mix of both. lab is the regularization strength as defined in the lecture. In the ML world, it is perhaps more common to divide the total error by \(\lambda\), such that the MSE term ends up with a factor \(C=1/\lambda\). This is what the solvers expect, so I do that conversion inside the function.
💡 Quick recap: Ridge, Lasso, and Elastic Net
Ridge (L2 regularization)
Penalty: \[
\lambda \sum_j \theta_j^2
\]
- Shrinks coefficients smoothly towards zero.
- Rarely sets any coefficient exactly to zero.
- Good for stability, handles correlated features well.
Lasso (L1 regularization)
Penalty: \[
\lambda \sum_j |\theta_j|
\]
- Some coefficients shrink all the way to zero.
- Performs feature selection.
- Can be unstable if features are highly correlated (may drop one arbitrarily).
Elastic Net (mix of L1 and L2)
Penalty: \[
\lambda \left[ \alpha \sum_j |\theta_j| + (1 - \alpha) \sum_j \theta_j^2 \right]
\]
- Combines Ridge and Lasso.
- \(\alpha = 1\) → pure Lasso.
- \(\alpha = 0\) → pure Ridge.
- \(0 < \alpha < 1\) → mix of both.
- Keeps Lasso’s feature selection, but more stable with correlated features.
Code
def make_regularized_pipeline(degree=8, penalty="l2", lam=1.0, solver="lbfgs"):""" Create a pipeline with scaling, polynomial feature expansion, and logistic regression with configurable regularization. Parameters ---------- degree : int Degree of the polynomial features. penalty : str or None Type of regularization ("l2", "l1", "elasticnet", or None). Note: None = no regularization. lam : float Regularization strength λ (lambda). Larger λ = stronger penalty. Internally, C = 1/λ is passed to scikit-learn. solver : str Optimization algorithm. Default "lbfgs". - "lbfgs" supports only "l2" or no penalty. - For "l1", this function automatically switches to "liblinear". - "elasticnet" requires solver="saga" (not used here). """if lam <=0:raiseValueError("lam must be positive") C =1.0/ lam# Normalize penaltyif penalty isNoneorstr(penalty).lower() =="none": penalty =None# Safety checksif penalty =="l1": solver ="liblinear"elif penalty =="elasticnet":raiseValueError("Elastic Net requires solver='saga'. Not enabled in this lab.")elif penalty notin ("l2", None):raiseValueError(f"Unsupported penalty: {penalty}")return Pipeline([ ("scaler", StandardScaler()), ("poly", PolynomialFeatures(degree=degree, include_bias=False)), ("clf", LogisticRegression( penalty=penalty, C=C, solver=solver, max_iter=20000 )) ])
You can now easily rerun your CV analyses for either dataset and using any of the CV options you code up above. I would recommend using either the ‘fake’ or ‘true’ CV that you coded from scratch, so you can see the decision plots.
Play around with the order of polynomial, change between Ridge and Lasso regularization, and experiment with different values of \(\lambda\) to get the most consistent results.
to plot bar charts of the fitted coefficients for each of the CV folds and see what happens when using different types of regularization and different regularization strengths \(\lambda\).
Code
# Modify your CV code of choice to incorporate regularization into the pipeline
Use CV to find optimal hyper-parameters
You can now combine everything you’ve learned so far to use CV to help you determine what the best model and hyper-parameters are. As always, first we’ll do it from scratch (hard-coded) and then you’ll see a convenient function that does all this for you.
Here’s the recipe to code from scratch:
Define an array with a list of polynomial degrees you want to try
Hint
degrees = [1, 2, 4, 6, 8]
Do the same for different regularization strengths you want to try (Hint: you may have to try up to pretty high values; experiment with your code above first).
Set up a double nested loop with the outer loop running through the different polynomial degrees and the inner loop through the different regularization strengths.
Inside the inner loop, define your pipeline with the degree and \(\lambda\) for a specific iteration, and also choose either Lasso or Ridge for this experiment (in the end, try it for Ridge first and then for Lasso).
Once you have defined your pipeline, we can use our manually coded CV or the easier scikit-learn one
Hint
scores = cross_val_score(pipe, XB, yB, cv=5, scoring="accuracy"); mean_score = scores.mean()
Determine for yourself how you want to keep track of the best mean score on validation data. You can either set up arrays that store the mean accuracy scores for each combination of the parameters (degree and \(\lambda\)) so you can inspect all of the afterwards and then figure out which row has the highest accuracy and for which degree and \(\lambda\). Or, you can use a trick like starting with highest_accuracy = -1 and then for each iteration you can check something like: if mean_score > highest_accuracy: highest_accuracy = mean_score; best_lambda = lambda ; best_degree = degree. In that case, after all loops end, you will only have those ‘best’ values.
Code
# Write code that uses CV to determine optimal polynomial degree and regularization strengthprint("\nBest hyperparameters:")print(f"Degree = {best_d}, λ = {best_lam}")print(f"Best mean validation accuracy = {best_score:.3f}")
Once you have these optimal parameters, you could run your code again for CV with the figures for the decision boundaries to see what they look like in this ‘best’ case. [Note of course that this is only the best option within the parameters that you tested for. ]
Remember, try out both Ridge and Lasso regularization and see if either is better than the other.
Code
# Try both Ridge and Lasso, and apply to both dataset A and B.
Note: relect on what you’ve learned about how to classify dataset A.
GridSearchCV
Everything you’ve done so far can also be done all at once by yet another powerful scikit-learn function: GridSearchCV. As the name kind of suggests, it does exactly what you just did manually. Have a look at the following code and compare the results to your own manually coded version. They should be the same.
Code
from sklearn.model_selection import GridSearchCVparam_grid = {"poly__degree": [2, 4, 6, 8],"clf__C": [1/lam for lam in [0.01, 0.1, 1, 10]] # C = 1/λ}grid = GridSearchCV( make_regularized_pipeline(penalty="l2"), param_grid=param_grid, cv=5, scoring="accuracy")grid.fit(XA, yA)best_params = grid.best_params_best_C = best_params["clf__C"]best_lambda =1/ best_Cprint("Best polynomial degree:", best_params["poly__degree"])print("Best λ (regularization strength):", best_lambda)print("Best mean CV accuracy:", grid.best_score_)
🔎 Summary of What We Did in This Lab
In this lab, we explored logistic regression step-by-step and then built up to more powerful workflows with pipelines and cross-validation. Here’s the roadmap of what you covered:
Data preparation
Loaded datasets from Google Sheets.
Selected two numeric features and a binary target column.
Split into training and validation sets.
Standardized features (fit scaler on training, transform both train/val).
Logistic regression basics
Fit a simple linear logistic regression (no polynomial features).
Learned how to plot decision boundaries and probability shading.
Polynomial feature expansion
Used PolynomialFeatures to add higher-order terms (e.g., degree 6).
Saw how this allowed curved decision boundaries.
Understood that fitting happens on the training set, but the boundary itself is fixed after training.
Pipelines
Built reusable pipelines combining scaling → polynomial features → logistic regression.
Compared manual vs. pipeline approaches.
Cross-validation (CV)
Wrote manual CV loops to train/validate models across multiple folds.
Compared fake CV (random re-splits each time) vs. proper CV (systematic partitioning of the dataset).
Used scikit-learn functions like cross_val_score and cross_validate for easier evaluation.
Regularization
Explored Ridge (L2) and Lasso (L1) penalties.
Understood the role of the regularization strength λ (C = 1/λ in scikit-learn).
Visualized how higher λ shrinks coefficients, and how Lasso can force some coefficients to exactly zero (feature selection).
Added coefficient bar plots to see directly how regularization affects learned parameters.
Hyperparameter tuning
Implemented a manual grid search with nested loops over polynomial degree and λ.
Then introduced GridSearchCV as the convenient pre-packaged version that automates this process.
👉 By the end of this lab, you’ve seen: - How logistic regression extends from linear to nonlinear decision boundaries. - How to prevent overfitting with cross-validation and regularization. - How pipelines and hyperparameter search make workflows easier and less error-prone.
Optional, for the Die Hards
In the lecture, I considered a simple quadratic function and used Ridge and Lasso regularization to demonstrate clearly how we can reduce overfitting and even get automatic feature selection. In this lab, we considered problems of similar simplicity but now for binary classification through non-linear logistic regression.
If you want to practice this further, you can start from the dataset that we used a few weeks ago where we applied non-linear numerical regression to fit geophysical well data. You could now turn that workflow into an elegant reusable pipeline and add in Lasso and/or Ridge regularization to avoid overfitting. By doing so, you may get better results on the completely separate hold-out wells that we used for final model testing.
Alternatively, you could follow a similar approach for our lab from last week and see if regularization can help in determining which ‘real world’ features (rather than polynomial terms) can be eliminated from predicting house prices.