#KB Resampling Methods — Part 2. Cross-Validation: Systematically… | by Prof. Frenzel

7 min readOct 24, 2023

Dear friends!

How can you trust a model to make accurate predictions on new data, if it has only been trained on a limited training dataset? Cross-validation is a statistical technique that addresses this fundamental challenge in machine learning. It works by systematically evaluating a model’s performance on unseen data, providing a reliable estimate of its generalization ability. Ready to explore this further? Let’s go! 🚀

Cross-Validation

Cross-validation, a methodology prevalent in data science and analytics, functions as a robust technique to validate the efficacy of statistical models. It offers a structured approach to estimate how a given model will generalize to an independent dataset. While there are several validation strategies, cross-validation stands out due to its balance between efficiency and accuracy.

K-Fold Cross-Validation

K-Fold Cross-Validation is a resampling technique used to evaluate machine learning models. It works by dividing the dataset into K equal subsets, or folds. Then, the model is trained on K-1 folds and evaluated on the remaining fold. This process is repeated K times, with each fold being used as the validation set once. The performance of the model on each validation set is averaged to get an overall estimate of the model’s performance.

Choosing the right ‘K’ plays a significant role in the process. A smaller ‘K’ means larger individual validation sets, leading to a lower bias but potentially higher variance in the validation results. Conversely, a larger ‘K’ offers smaller validation sets, potentially decreasing variance but increasing bias. Commonly used values are 5 or 10, though the choice should reflect the specific dataset and problem at hand. The primary goal is to find a value of ‘K’ that offers a balance, yielding reliable and reproducible validation results without excessive computational costs.

Variants of Cross-Validation

While K-Fold Cross-Validation is a widely used technique, there are other cross-validation methods available that may be better suited for certain datasets or challenges. Three notable variants of K-Fold are Stratified Cross-Validation, Leave-One-Out Cross-Validation (LOOCV), and Leave-Group-Out Cross-Validation (LGOCV). Each of these methods refines the K-Fold approach in a unique way to address specific issues.

📌Stratified Cross-Validation: One of the potential challenges with standard K-Fold Cross-Validation arises when dealing with imbalanced datasets. Suppose there’s a dataset where a particular class is underrepresented. In such cases, there’s a possibility that during some iterations of the K-Fold process, the validation set might not contain any instance of that class. This absence can lead to misleading validation scores and an ill-informed assessment of the model’s capabilities. Stratified Cross-Validation addresses this concern. The method ensures that each fold is a good representative of the entire dataset in terms of class distribution. As a result, the model receives a more consistent and fair evaluation across all iterations.

📌Leave-One-Out Cross-Validation (LOOCV): LOOCV is a cross-validation method that can be seen as an extreme case of k-fold cross-validation, where k is the number of data points in the dataset. This means that LOOCV trains the model n times, each time using a single data point for validation and the remaining n-1 data points for training. LOOCV provides a very accurate estimate of the model’s generalization performance, but it is also very computationally expensive. Furthermore, while the variance of the resulting estimate might be low, the process can induce a higher bias, as each training set almost resembles the others.

📌Leave-Group-Out Cross-Validation (LGOCV): Many datasets have inherent group structures, such as multiple measurements from the same patient in medical trials. Traditional validation methods can inadvertently mix data from the same group in both the training and validation sets, which can lead to overly optimistic evaluations of a model’s performance. LGOCV addresses this issue by keeping grouped data intact. In each iteration of LGOCV, one (or more) groups of data are set aside for validation. This ensures that groups are never split between training and validation, which maintains the data’s inherent structure and provides a more realistic assessment of the model’s performance on unseen groups.

Selecting the right cross-validation method depends on multiple factors, such as the dataset’s properties, computational constraints, and the desired trade-off between bias and variance. If the classes in the dataset are imbalanced, Stratified Cross-Validation is a good choice. Grouped data can benefit from LGOCV. LOOCV is the most thorough method, but it is also the most computationally expensive. K-Fold or Stratified methods may be more feasible for projects with limited resources. Finally, the intended use of the model dictates the required rigor of validation. Preliminary analyses may be satisfied with K-Fold, but high-stakes applications may necessitate the precision of LOOCV.

After cross-validation, we evaluate the model’s performance using metrics like RMSE and R-squared for regression problems, and accuracy, precision, and F1 score for classification problems. But it’s not just the values of these metrics that matter, we also need to consider their steadiness across validation folds. A model that performs well on one fold but poorly on another is likely to overfit the training data and not generalize well to new data. Therefore, we need to take a holistic view of the performance metrics, contextualized against the problem we’re trying to solve, so we can truly understand the performance of our model.

👣Practical Applications of K-Fold Cross-Validation

Let’s use R to put these ideas into practice. The Auto dataset consists of various specifications and details for numerous car models, like horsepower, acceleration, weight, and miles-per-gallon (mpg). For this exercise, I will employ K-Fold Cross-Validation to evaluate a linear regression model, attempting to predict a car’s mpg based on its horsepower and weight.

1️⃣Load the Auto dataset in R:

install.packages("ISLR2")
library(ISLR2)

# Load the Auto dataset
data(Auto)
summary(Auto)

2️⃣Applying K-Fold Cross-Validation:

To perform K-Fold Cross-Validation in R, I will leverage the caret package, a flexible and comprehensive machine learning toolkit.

install.packages("caret")
library(caret)

# Define training control using 10-fold CV
train_control <- trainControl(method="cv", number=10)

# Train a linear model with scaling
model <- train(mpg ~ horsepower + weight, 
               data=Auto, 
               method="lm", 
               trControl=train_control, 
               preProcess = c("center", "scale"))
model

In this code block, the trainControl function sets up the parameters for 10-fold Cross-Validation. I use preProcess here to scale the predictors, ensuring consistent coefficient interpretation in linear regression. Scaling becomes particularly important for algorithms like K-Nearest Neighbors (KNN) or Support Vector Machines (SVM) because they rely on distances between data points, and disparate scales can skew results. For methods employing gradient descent, such as neural networks or logistic regression, scaled data aids in achieving faster and more stable convergence. It is also important for algorithms that use regularization, such as ridge regression and lasso regression, because it ensures that the penalty on coefficients is applied uniformly.

3️⃣Visualizing Results

To visualize the distribution of performance metrics across different folds and better understand model performance, you can use data visualization tools like ggplot2 to create plots of K-Fold Cross Validation results.

install.packages("ggplot2")
library(ggplot2)

# Extracting the resampling results
resampling_results <- model$resample

# Visualizing RMSE across different folds
ggplot(resampling_results, aes(x = Resample, y = RMSE)) +
  geom_point(size = 3) +
  geom_line(group = 1, aes(group = 1)) + 
  labs(title = "RMSE across 10 Folds", x = "Fold", y = "RMSE") +
  theme_minimal()

🔄Alternative Cross-Validation Methods

Repeated K-Fold Cross Validation

Repeated K-Fold Cross Validation involves running K-Fold Cross Validation multiple times and averaging the results. This provides a more robust estimate by reducing variability.

# Repeated K-Fold Cross Validation
set.seed(123)
train_control_repeated <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
model_repeated <- train(mpg ~ horsepower + weight, 
                        data=Auto, 
                        method="lm", 
                        trControl=train_control_repeated, 
                        preProcess = c("center", "scale"))
model_repeated

Stratified K-Fold Cross Validation

Stratified K-Fold ensures that each fold has the same proportion of observations with a given categorical target value as the complete dataset. This is more relevant for classification problems with imbalanced classes. However, for the sake of demonstration, I’ll show how it’s done (assuming mpg is binned into categories).

# Stratified K-Fold Cross Validation (for demonstration)
Auto$mpg_binned <- as.factor(ifelse(Auto$mpg > median(Auto$mpg), "High", "Low"))
train_control_stratified <- trainControl(method = "cv", number = 10, classProbs = TRUE, summaryFunction = twoClassSummary)
model_stratified <- train(mpg_binned ~ horsepower + weight, 
                          data=Auto, 
                          method="glm", 
                          trControl=train_control_stratified, 
                          preProcess = c("center", "scale"))
model_stratified

Leave-One-Out Cross Validation (LOOCV)

LOOCV is a special case of K-Fold Cross Validation where K equals the number of data points in the dataset. Each data point gets its turn as the validation set.

# Leave-One-Out Cross Validation (LOOCV)
train_control_loocv <- trainControl(method = "LOOCV")
model_loocv <- train(mpg ~ horsepower + weight, 
                     data=Auto, 
                     method="lm", 
                     trControl=train_control_loocv, 
                     preProcess = c("center", "scale"))
model_loocv

Leave-Group-Out Cross Validation (LGOCV)

LGOCV involves splitting the data into a number of groups, then training on all the data except one group and testing on that left-out group. It’s a variation of K-Fold.

# Leave-Group-Out Cross Validation (LGOCV)
set.seed(123)
train_control_lgocv <- trainControl(method = "LGOCV", p = 0.8, number = 10)
model_lgocv <- train(mpg ~ horsepower + weight, 
                     data=Auto, 
                     method="lm", 
                     trControl=train_control_lgocv, 
                     preProcess = c("center", "scale"))
model_lgocv

Cross-validation is a robust model validation technique, especially the K-fold method and its variants. It ensures our models are reliable tools for real-world applications. But model evaluation does not end with validation. The next step is to refine and optimize the model parameters, a process known as hyperparameter tuning. In the next article, we will discuss hyperparameter tuning and the caret tool, which offers a formidable framework for this task.