Dear friends!
In the first part of this series, we explored the how and why of regularization. Now, it’s time to dive into the math of the three most popular regularization algorithms: ridge regression, lasso regression, and elastic net. In this article, we will explain the mathematical formulation of each algorithm, discuss their geometric interpretation, and provide examples of how to use them in R. Are you ready? Let’s go! 🚀
Ridge Regression (L2 Regularization)
Ridge regression, a popular form of regularization, is designed to handle multicollinearity in datasets. It does this by adding a penalty term to the least squares loss function, which works to reduce the size of the coefficients. This penalty term ensures that the model doesn’t overfit the training data by making it complex. The loss function of ridge regression can be described as:
- L(β) is the loss function.
- yi is the actual value of the target variable.
- xiT is the transpose of the feature vector.
- β is the vector of model parameters or coefficients.
- λ is the regularization parameter, controlling the amount of shrinkage.
- The first term is the usual least squares loss, while the second term is the L2 penalty.
Use Cases: Ridge regression is particularly beneficial when features are correlated, leading to multicollinearity issues. It can produce stable coefficients, even when predictors are highly interrelated. For datasets where the number of observations is less than the number of features, ridge regression can prove instrumental in obtaining reliable coefficient estimates.
R Code:
library(ISLR)
library(caret)
library(glmnet)
library(ggplot2)
# Load the Hitters dataset
data(Hitters)
hitters <- na.omit(Hitters)
# Prepare the data
set.seed(123) # for reproducibility
trainIndex <- createDataPartition(hitters$Salary, p = .8, list = FALSE)
trainData <- hitters[trainIndex, ]
testData <- hitters[-trainIndex, ]
# Define the control using a cross-validation approach
trainControl <- trainControl(method = "cv", number = 10)
# Train the model with Ridge (alpha = 0), with normalization
set.seed(123)
ridgeModel <- train(
Salary ~ ., data = trainData,
method = "glmnet",
trControl = trainControl,
tuneGrid = expand.grid(alpha = 0, lambda = seq(0.001, 0.1, length = 10)),
preProc = c("center", "scale"), # Normalize (center and scale) the data
metric = "RMSE"
)
# Coefficients Plot
coef <- as.matrix(coef(ridgeModel$finalModel, s = ridgeModel$bestTune$lambda))
coef <- coef[coef != 0, , drop = FALSE] # Remove zero coefficients
# Remove the intercept
coef <- coef[-1, , drop = FALSE]
# Create a data frame for plotting
coef_df <- data.frame(
Feature = rownames(coef),
Coefficient = as.vector(coef)
)
# Plot
ggplot(coef_df, aes(x = reorder(Feature, Coefficient), y = Coefficient)) +
geom_bar(stat = "identity") +
coord_flip() +
theme_minimal() +
labs(title = "Ridge Regression Coefficients",
x = "Features",
y = "Coefficients")
In the results from the Ridge regression, each feature’s coefficient indicates its association with the target variable, Salary. The features CRuns
, CRBI
, and CHits
stand out with the highest positive coefficients, meaning that players with higher cumulative runs, runs batted in, and hits tend to command higher salaries. On the other hand, significant negative coefficients for features like AtBat
, Years
, CWalks
, and DivisionW
suggest that higher values in these attributes might be linked with a reduction in a player's salary. Interestingly, the negative coefficient for AtBat
suggests that just being at bat frequently, without achieving corresponding hits or other positive outcomes, may not be as beneficial for a player's salary.
Lasso Regression (L1 Regularization)
Lasso regression, a distinct regularization approach, primarily stands out due to its ability to not only shrink the coefficients but to set some of them exactly to zero. This specific property of Lasso regression tends to render it as a viable tool for both regularization and feature selection.
The loss function for Lasso regression can be formulated as:
Use Cases: Lasso regression can be used to improve the interpretability, computational efficiency, and generalization performance of machine learning models. By shrinking the coefficients of less important features towards zero, lasso regression helps to identify the most important features for predicting the target variable, reduce the number of features in the model, and prevent the model from overfitting the training data.
R Code:
# [Prepare Hitters dataset...]
# Train the model with Lasso (alpha = 1), with normalization
set.seed(123)
lassoModel <- train(
Salary ~ ., data = trainData,
method = "glmnet",
trControl = trainControl,
tuneGrid = expand.grid(alpha = 1, lambda = seq(0.001, 0.1, length = 10)),
preProc = c("center", "scale"), # Normalize (center and scale) the data
metric = "RMSE"
)
# Coefficients Plot
coef <- as.matrix(coef(lassoModel$finalModel, s = lassoModel$bestTune$lambda))
coef <- coef[coef != 0, , drop = FALSE] # Remove zero coefficients
# Remove the intercept
coef <- coef[-1, , drop = FALSE]
# Create a data frame for plotting
coef_df <- data.frame(
Feature = rownames(coef),
Coefficient = as.vector(coef)
)
# Plot
ggplot(coef_df, aes(x = reorder(Feature, Coefficient), y = Coefficient)) +
geom_bar(stat = "identity") +
coord_flip() +
theme_minimal() +
labs(title = "Lasso Regression Coefficients",
x = "Features",
y = "Coefficients")
Elastic Net
When it comes to combining the strengths of both L1 and L2 regularization, Elastic Net emerges as a compelling choice. It introduces a penalty that is a blend of L1 and L2 regularization, allowing the model to inherit the feature selection capabilities of Lasso while retaining the stability in coefficient estimation from Ridge.
The Blend of L1 and L2 Regularization: Mathematically, the loss function of Elastic Net is given by:
Use Cases: Elastic Net is particularly beneficial when dealing with datasets where features are highly correlated or when there are more features than observations. While Lasso might randomly select one feature among the correlated ones, Elastic Net is likely to include all correlated features but shrink them towards a central value. This intermediate solution can offer a more balanced approach and is less sensitive to small changes in the model’s structure.
R Code:
# [Prepare Hitters dataset...]
# Train the model with Elastic Net (0 < alpha < 1), with normalization
set.seed(123)
elasticNetModel <- train(
Salary ~ ., data = trainData,
method = "glmnet",
trControl = trainControl,
tuneLength = 10, # Automatically generate 10 values of alpha and lambda to tune
preProc = c("center", "scale"), # Normalize (center and scale) the data
metric = "RMSE"
)
# Coefficients Plot
coef <- as.matrix(coef(elasticNetModel$finalModel, s = elasticNetModel$bestTune$lambda))
coef <- coef[coef != 0, , drop = FALSE] # Remove zero coefficients
# Remove the intercept
coef <- coef[-1, , drop = FALSE]
# Create a data frame for plotting
coef_df <- data.frame(
Feature = rownames(coef),
Coefficient = as.vector(coef)
)
# Plot
ggplot(coef_df, aes(x = reorder(Feature, Coefficient), y = Coefficient)) +
geom_bar(stat = "identity") +
coord_flip() +
theme_minimal() +
labs(title = "Elastic Net Regression Coefficients",
x = "Features",
y = "Coefficients")
Geometric Interpretation of Regularization Techniques
An intuitive way to understand regularization in regression analysis is through its geometric interpretation. In the image below, we have three distinct shapes that correspond to Lasso (L1 norm), Ridge (L2 norm), and Elastic Net (combined L1 and L2 norms). The diamond, circle, and rounded diamond shapes give us a visual clue about how each method constrains the possible values of the coefficients.
The elliptical contours in the background represent levels of equal loss based on the model’s errors. A smaller ellipse indicates a better fit to the data, while larger ellipses represent increased error. As we incorporate regularization, we’re searching for the point where these contours first touch our constraint shape, as that’s where we find the balance between fitting our data and applying the penalty.
The green dots mark the exact location where the optimization settles, considering both the loss from the data and the regularization penalty. For Lasso, the sharp corners of the diamond often lead the green dot to rest right on an axis, resulting in certain coefficients being set to zero. Ridge, with its smooth circular boundary, causes the green dot to touch in a more gradual manner, shrinking coefficients without necessarily zeroing them out. Elastic Net, with its hybrid shape, allows the green dot to sometimes align with an axis (zeroing some coefficients) while elsewhere merely shrinking them.
Lastly, interpret the model coefficients with caution. The coefficients of a regularized linear regression model can be biased, especially for lasso. Monitor the performance of your models over time. For optimal results, merge these regularization techniques with other machine learning tools, including feature engineering and ensemble learning.