13 min readApr 14, 2024

What Every Business Analyst Must Know — Part3: Understanding Multicollinearity

Dear friends!

Consider the challenge faced by a real estate analyst trying to predict house prices. They might choose to include both the size of the house and the number of rooms as predictors. Intuitively, both variables should offer unique insights — larger houses cost more, and more rooms might imply a house suitable for larger families, potentially increasing its value, right? Yet, these two variables often move together — larger houses tend to have more rooms. This overlap is a classic example of multicollinearity.

What is Multicollinearity?

Multicollinearity occurs when two or more predictor variables in a regression model are highly correlated, to the extent that it becomes difficult to separate their individual effects on the dependent variable. This intercorrelation not only obscures the unique impact of each predictor but also complicates the statistical analysis, making it challenging to draw reliable conclusions. In the case of our real estate analyst, heavy multicollinearity between the size of the house and the number of rooms could lead them to overestimate the importance of one variable while underestimating the other, or it may result in unstable estimates that change with slight modifications to the model or data sample.

Why It’s Important to Understand Multicollinearity

Understanding multicollinearity is important for anyone relying on regression models to make informed decisions or insights. The instability it introduces to the coefficients can be misleading. For instance, in a regression model affected by multicollinearity, the coefficients might fluctuate erratically in response to minor changes in the data or model specifications. This instability can make it difficult to ascertain the true effect of each predictor, leading to potentially erroneous conclusions about which factors are influencing the dependent variable.

Moreover, multicollinearity affects the precision of the estimated coefficients, which inflates the standard errors. This inflation makes it harder to determine whether a predictor’s coefficient is statistically significantly different from zero based on the p-values. In practical terms, you might fail to identify key drivers of your business problem because multicollinearity masks their statistical significance.

Causes of Multicollinearity

Multicollinearity in data sets can arise from several sources, often linked to the nature of data collection and the inherent relationships among variables. Here are some of the most common causes:

📌Data Collection and Structure

Data Collection Methods: The manner in which data is collected can inadvertently increase multicollinearity. For instance, using similar sources or biased sampling techniques can lead to an overlap in variable characteristics.

High Dimensionality: In datasets with a large number of variables relative to the number of observations (common in finance), multicollinearity becomes more likely. This phenomenon, often referred to as the “curse of dimensionality,” complicates models due to the sheer volume of interrelated information.

📌Natural Correlation

Inherent Relationships: Some variables are naturally correlated due to their inherent relationships. For example, in economics, income and expenditure are typically correlated because higher income leads to higher spending capacity.

Age and Experience: In employment data, age and years of experience are often correlated. Older individuals typically have more years of job experience, leading to multicollinearity in models predicting job performance based on these variables.

📌Geographical and Temporal Factors

Location-Based Variables: Variables that capture geographical information, such as region, zip code, and proximity to landmarks, often correlate with each other. For example, properties close to a city center may share high prices and low crime rates, leading to multicollinearity.

Time Series Data: In datasets that span multiple time periods, temporal variables such as year, month, or economic conditions can be correlated. For example, economic indicators like GDP growth and inflation rates might move in similar patterns over time.

📌Scale and Units of Measurement

Derived or Calculated Variables: Often, variables are not measured directly but are derived from other measurements. For example, body mass index (BMI) is calculated from weight and height. Including all three of these variables in the same model would introduce multicollinearity.

Units of Measurement: Variables measured in similar units or scales, like total floor area and living area in square feet, can also be highly correlated. This is especially true in engineering and construction data.

Detecting Multicollinearity

Beyond the numerical and graphical methods I will elaborate on below, intuition and knowledge play a critical role in detecting multicollinearity. This approach involves using domain knowledge to hypothesize relationships among variables before conducting detailed statistical analysis.

📌Experience and Research

Professionals well-acquainted with the specific characteristics of their industry or sector can often predict which variables might be correlated based on their understanding of the underlying processes or behaviors. For example, in finance, an experienced analyst might anticipate multicollinearity between market capitalization and trading volume, as larger companies tend to have higher trading volumes. Similarly, in healthcare, variables like age and the prevalence of certain medical conditions are naturally expected to be related because the incidence of certain diseases increases with age. In every analytics project, it is expected that you review previous studies and research in the field that can provide insights into common relationships among variables, which can serve as a guide to potential multicollinearity. Historical data analysis and literature can reveal patterns and correlations that consistently appear, and support you in setting up a more efficient development process. Tools like ChatGPT and Google Gemini have become good companions in this field to get you up to speed on the surface level in pretty much every industry.

Numerical Variables

📌Correlation Matrix: A correlation matrix is a useful initial tool for spotting potential multicollinearity. It quantifies the degree to which variables are related. In practice, examining the correlation coefficients between predictors in a matrix format helps identify pairs of variables that might be causing multicollinearity. High correlation coefficients (typically above 0.8 or below -0.8) suggest a strong linear relationship that could potentially distort a regression analysis.

Example: In real estate modeling, an analyst might look at the correlation between variables like square footage, number of rooms, and age of the property. Suppose the correlation between square footage and number of rooms is 0.85, this high value suggests significant multicollinearity.

📌Variance Inflation Factor (VIF): The Variance Inflation Factor is a more sophisticated measure used to quantify the severity of multicollinearity. VIF calculates how much the variance of an estimated regression coefficient is increased due to collinearity. A VIF value greater than 5 suggests a problematic level of multicollinearity, although some experts also use a threshold of 10 as an indicator of severe multicollinearity. A high VIF indicates that the predictor is highly collinear with one or more other predictors in the model, which may necessitate further action such as removing variables or redesigning the model.

where *RSQi* is the coefficient of determination of a regression of the i-th variable on all other predictor variables in the model.

Example: Suppose we have a dataset of 10 cars, including their price, age, mileage, and horsepower. We aim to model price based on the other three variables and then determine the VIF for each explanatory variable. Here’s a step-by-step guide for my Excel user:

Find the Data Analysistool in your Excel and select Regressionto set up your regression model. The objective is to predict price as your dependent variable (y) with age, mileage, and horsepower as independent variables. Excel will generate a regression analysis output, including the R² (R Square) for the model. Next, you will start iterating through your independent variables to calculate the VIF.

Iterate through all independent variables

Run Regression for Each Explanatory Variable: For each explanatory variable, run a regression with that variable as the dependent variable and the others as independents. For example, to calculate the VIF for ‘age’, regress ‘age’ against ‘mileage’ and ‘horsepower’.
Obtain R² values and calculate the VIF: Note the R² value from each regression output and use the formula above. If the R² for ‘age’ is 0.63, then VIF = 2.73. Repeat this for all variables, recalculating by changing the dependent and independent variables each time.
How to Interpret VIF Values
- VIF = 1: Indicates no correlation with other variables.
- 1 < VIF < 5: Suggests moderate correlation, usually not a concern.
- VIF > 5: Strong correlation likely to impact regression analysis; consider model adjustment.

Consider Non-Numerical (Categorical) Variables

📌Box Plots: Box plots can be used to visually assess the distribution of a numerical variable across different categories of a categorical variable. Overlapping distributions may suggest a relationship strong enough that, when these categorical variables are converted to dummy variables, could introduce multicollinearity.

📌Scatter Plots with Grouping: Scatter plots are enhanced by grouping different categories using colors or markers, providing a visual representation of the relationships between numerical variables across different categories. This can be especially useful for spotting non-linear patterns that may introduce issues further along in your research.

Example: In marketing data, scatter plots might be used to plot advertising spend against sales, grouped by region to check if spending patterns correlate strongly with sales across regions.

📌Contingency Tables (Cross-tabulation): Contingency tables, or cross-tabulations, provide a fundamental way to examine the relationship between categorical variables. By displaying the frequency distribution of categories intersecting across two variables, you can get a sense of potential dependencies. They might translate into multicollinearity when creating dummy variables for regression models.

Example: In healthcare, analysts might cross-tabulate data on patient outcomes against treatment types to see if certain treatments are more common for specific outcomes, potentially indicating multicollinearity if used as predictors. For instance, the contingency table shows a notably higher number of patients improving with Treatment A (30 patients) compared to Treatments B and C (10 and 5 patients, respectively). Conversely, Treatment C appears to be the most common treatment for patients whose condition remained unchanged (20 patients) and worsened (15 patients). This pattern suggests that including these treatment types as dummy variables in a regression model could introduce multicollinearity.

Tools:
- Excel: Use PivotTables to create cross-tabulations.
- R: Use the table() function or xtabs() for more complex cross-tabulations.
- Python: Use the crosstab() function from Pandas.

📌Including Interaction Terms: When interaction effects between categorical and continuous variables are suspected of contributing to multicollinearity, interaction terms should be included in the model. Subsequent analysis of VIFs and regression coefficients for these terms can indicate whether their inclusion adequately addresses the issue without exacerbating multicollinearity.

where β0 is the intercept, β1, and β2 are the coefficients for variables X and Z respectively, β3 is the coefficient for the interaction term between X and Z, and ϵ is the error term.

Example: Suppose we are analyzing data on car sales. We want to see if the interaction between the car’s horsepower (HP) and its body style (sedan or SUV) affects the sales price. The interaction term is created by multiplying the horsepower (a continuous variable) by a binary indicator for the body style. For instance, if ‘sedan’ is coded as 0 and ‘SUV’ as 1, the interaction term for each car is horsepower multiplied by the SUV indicator. In Excel, you would simply create the interaction term by entering the formula: =A2*B2 and dragging down for all rows (assume horsepower is in column A and body style (0 for sedan, 1 for SUV) is in column B). In R, you can create an interaction term within the regression model formula:

model <- lm(sales_price ~ hp + suv + (hp*suv), data = data)
summary(model)

In Python, you can use Pandas to create a new column for the interaction term and then fit a model using statsmodels or scikit-learn.

import pandas as pd
from sklearn.linear_model import LinearRegression

# Create the interaction term
df['hp_suv_interaction'] = df['hp'] * df['suv']

X = df[['hp', 'suv', 'hp_suv_interaction']] 
y = df['sales_price']
model = LinearRegression()
model.fit(X, y)

📌Scenario Analysis: Scenario analysis in regression modeling involves manipulating input variables to assess changes in outputs like coefficient stability, standard errors, p-values, and overall model fit, such as R² and F-test results. Consider our scenario from before where you model house prices using features such as square footage, number of rooms, and property age. You first run a regression with all three variables. Next, you exclude one variable at a time to see how it affects the standard error of the coefficients for the remaining variables. In the full model, the R² is 0.85, showing a good fit. If you remove the number of rooms and the R² only drops slightly to 0.84, but the standard error of the square footage coefficient decreases significantly and its p-value improves, this indicates a multicollinearity issue between square footage and the number of rooms. This analysis helps you assess the model’s robustness and allows reliable interpretations and decisions based on statistical tests.

Advanced Techniques

📌Regularization Techniques: Ridge Regression and Lasso are methods that are particularly useful for dealing with multicollinearity at larger scale. Both methods work by adding a penalty to the regression model, which helps in reducing the magnitude of the coefficients. Ridge Regression adds a squared magnitude of coefficient as a penalty term to the loss function, while Lasso adds the absolute value of the magnitude of coefficient. This process tends to shrink the less important variable’s impact on the predictive model. Observing how coefficient estimates change when applying these regularization techniques can provide significant insights into the severity of multicollinearity in your model. For a deeper dive into Ridge Regression and Lasso, consider reading my articles Introduction to Regularization and Ridge and Lasso.

📌Principal Component Analysis (PCA): PCA is a technique used in unsupervised machine learning, which means it operates without needing predefined labels or categories for the data. Essentially, PCA helps simplify complex data sets by reducing the number of variables you need to consider, while still retaining the essential parts of all the information. Imagine you have a vast and crowded bookshelf with books piled up in no specific order. PCA is like organizing this shelf so that the most important books are on the top shelf, making them easy to access, and each subsequent shelf has books that are progressively less vital.

In technical terms, PCA transforms the original variables into a new set of variables (the principal components). These components are uncorrelated, meaning they do not overlap in the information they convey, which is beneficial when dealing with multicollinearity in predictive modeling. The first principal component captures the most variance (spread) in the data, the second captures the second most, and so forth, each orthogonal to (i.e., independent of) the last. This is useful in regression analysis because it allows the most important patterns in the data to emerge while the less important patterns, often just noise, are minimized or ignored. Additionally, it improves the performance of other machine learning models by eliminating multicollinearity and reducing overfitting where the model is too closely fit to a limited set of data points.

Final Thought

While multicollinearity does not impact the predictive power of a model directly, it significantly complicates the interpretation of model coefficients, which can mislead decision-making processes. The ability to refine and adapt models based on your understanding of multicollinearity ultimately leads to better decision-making processes across various fields of application. After addressing this issue, it is important to perform diagnostic tests, such as re-evaluating Variance Inflation Factors (VIF) and analyzing residual plots, to ensure effective management. Additionally, validate your model with out-of-sample tests to confirm that it remains accurate on new data and that adjustments for multicollinearity do not hinder its generalizability. Ultimately, disentangling relationships in datasets requires repeated analysis.

👣Finance

In the investment industry, analysts often utilize correlation matrices to understand the relationships between different financial assets, such as stocks, bonds, or commodities. For instance, let’s consider an investment analyst looking to build a diversified portfolio. The analyst may start by examining the correlation matrix of stock returns from various sectors such as technology, healthcare, and consumer goods. If the correlation matrix reveals high correlations between stocks in the technology sector, it suggests that these stocks tend to move in tandem, possibly due to similar market influences affecting all technology stocks simultaneously.

This understanding aids in portfolio construction by preventing inadvertent concentration in assets that behave similarly, which increases risk. In practice, an analyst may decide to reduce exposure to highly correlated technology stocks or introduce assets from sectors with lower correlations to these stocks. This approach can enhance portfolio diversification and reduce the risk of significant losses that occur when one sector underperforms.

👣Marketing

In marketing, understanding the interaction between different campaign variables is important, and contingency tables are a useful tool. A marketing manager might analyze the effectiveness of an ad campaign across various channels like social media, email, and television. By using contingency tables to cross-tabulate the campaign channels with customer engagement metrics such as clicks and conversions, the manager can identify if certain channels consistently result in higher engagement.

For instance, the contingency table might reveal that email campaigns have a high frequency of conversions but only within a specific age group. This insight could indicate a dependency between the channel (email) and the demographic characteristic (age). This dependency might introduce multicollinearity if both are included as predictors in a regression model aimed at evaluating the effectiveness of marketing channels.

👣Healthcare

In healthcare data analysis, Principal Component Analysis (PCA) is a powerful tool used to reduce dimensionality in datasets with many correlated variables, such as patient health metrics (blood pressure, cholesterol levels, body mass index, etc.). A healthcare researcher might use PCA to distill these metrics into a smaller set of uncorrelated components, each representing a combination of variables that captures the most variance in the data.

For example, PCA could reveal that the first principal component mainly captures variation in metabolic syndrome-related traits (like blood pressure, waist circumference, and cholesterol levels). This component can then be used as a single predictor in a model predicting cardiovascular risk, rather than using all the highly correlated metrics separately. This approach not only simplifies the model but also helps avoid multicollinearity, making the statistical analyses more reliable and interpretable.

Please find my next article in this series HERE.