13 min readApr 6, 2024

What Every Business Analyst Must Know — Part2: Dealing With Complexity

A few weeks ago, I published an article about essential statistical measures every Analyst should know and received very positive feedback. So, I decided to create an entire series for every junior analyst out there, focusing more on practical application and collaboration with GenAI.

As data analysts, we often work with large datasets, trying to find actionable insights and predictive capabilities. I’m sure many of you might start by using Python or R to apply Regularization Methods or Dimensionality Reduction algorithms like PCA. However, this article will concentrate on the basic principles every analyst should understand before using Machine Learning and AI. If you work in descriptive analytics or predictive analytics, for example, creating a regression model, you might find the following strategies helpful. Most industries continue to use Excel, and I firmly believe this will not change soon, so I will specifically address the limitations of Excel here as well.

The job of an analyst is to take what is complex and make it simple.

1️⃣Variable Selection Strategies

Simplification begins with a fundamental tool that must be in every analyst’s vocabulary: correlation. It’s a very useful concept because it helps us with 📌Variable Selection through Correlation Analysis. In the context of predictive modeling, this check for Multicollinearity is a step you cannot skip when working with multiple predictors. This process not only helps in identifying which variables wield a stronger linear relationship with the outcome — your dependent variable (y) — but also supports you in prioritizing them in the predictive modeling process. In a few words: Multicollinearity occurs when some of your variables are too similar, leading to unreliable coefficient estimates and making them difficult to interpret. Unfortunately, Excel doesn’t have built-in functions to check for multicollinearity directly, but you can use correlation matrices or variance inflation factor (VIF) calculations in more advanced statistical software as a check before you start building. If you find pairs with very high correlation coefficients, consider removing one from your model. Ultimately, a careful balance of your variables ensures that each feature contributes unique insights and enhances the reliability of your model.

How to create a Correlation Table in Excel

For example, if “years with company” and “age” are both predictors of employee turnover and highly correlated, you might choose to include only one in your model to maintain its integrity. This is because both variables essentially provide the same information about the employee. Another illustrative pair is “Monthly Income” and “Salary”. These two might also be highly correlated as they essentially represent the employee’s earnings from slightly different perspectives. This redundancy does not add value to the model and instead can complicate the estimation of parameter effects and their significance. As a general guideline, examine the correlation coefficients between all pairs of independent variables. If any pair shows a coefficient higher than 0.9 (indicating a strong correlation), this could be a sign of multicollinearity that needs attention.

Example: Correlation Table Employee Turnover

Incorporating 📌Domain Knowledge for Variable Selection offers another layer of refinement in the model-building process. This strategy emphasizes the value of theoretical understanding and industry insights over purely statistical methods. Yes, that means you might have to use our good friend Google or our new companion ChatGPT to build up some domain knowledge. This allows you to leverage existing research and domain expertise to ensure that the variables you integrate into the model are not just statistically significant but also relevant and justified within the specific context of the analysis. This approach can significantly enhance the model’s applicability and credibility, particularly in scenarios where statistical evidence might suggest multiple plausible predictors. In the new context of AI and cloud computing, this is one of the key added values you should (and must) deliver as a business analyst.

The practice of 📌Chunking Variables into Groups serves as a practical solution for managing large datasets. This method involves organizing variables into smaller, logically connected clusters, based on their statistical relationships or their relevance to certain themes or aspects of the business issue. For example, in a customer satisfaction study, an analyst might separate variables related to customer service from those related to product quality. Each cluster is then analyzed separately through regression analysis, allowing for a detailed examination of how subsets of variables affect the outcome. After identifying the most significant variables within these clusters, combining them into a final model helps maintain simplicity while capturing essential data patterns. This method allows for a detailed analysis without surpassing Excel’s variable limit in regression models, facilitating a comprehensive yet efficient approach to modeling complex data sets.

📌Prioritizing High-Impact, Low-Complexity Variables. When exploring the many explanatory variables, it’s important to identify those that offer high impact with relatively low complexity. This approach is based on the Pareto Principle, or the 80/20 rule, which suggests that about 80% of effects come from 20% of causes. In data analysis, this means a few variables usually have the largest impact on your model.

Identify the ‘low hanging fruits’ — those variables that promise high impact with relatively low complexity

Start by reviewing all explanatory variables and prioritize those with the potential for high impact that are also technically easier to leverage. For instance, variables that show a clean, almost linear relationship with the dependent variable are prime candidates. They are typically easier to explain and validate, especially if they are numeric rather than categorical, eliminating the immediate need for dummy coding. This not only makes the initial stages of model development simpler but also provides immediate value, allowing for early insights and potential quick wins in the analytical process. As you refine your model, you should then gradually include more complex variables, including those categorical in nature, to capture additional details in your analysis.

2️⃣Data Simplification Methods

📌Chunking involves organizing variables into coherent groups based on their correlation or thematic relevance to facilitate a focused analysis. For instance, in a customer satisfaction study, variables related to service quality could be grouped separately from those related to product quality. This segmentation allows for a phased method, where insights from each cluster contribute to the comprehensive analytical model, making sure that we consider each group’s unique contributions. So you would run regressions or conduct descriptive analyses within these smaller sets. This strategy not only makes the dataset more manageable but also aids in identifying which groups of variables are most influential, allowing for a more targeted approach in the regression modeling process. Excel’s SORT and PIVOTTABLES can help you to find these clusters efficiently.

Chunk Your Dataset Into Clusters For Analysis

📌Binning of Categories becomes a valuable technique when dealing with categorical variables that contain many levels. This method involves consolidating these categories into broader, more general groups, effectively reducing the number of dummy variables required for regression. Such a strategy simplifies the model and limits the number of variables, making it more manageable within Excel’s framework. Binning not only streamlines the modeling process but is also often the only way to deal with software limitations like the limit of Excel’s Data Analysis Addon for Regression modeling that can only handle 16 independent variables at a time. For example, a retail analyst examining customer age demographics might bin ages into groups like ‘18–30’, ‘31–45’, and ‘46–60’ to simplify analysis. A marketing analyst might classify a wide range of customer income levels into ‘Low’, ‘Medium’, and ‘High’. Note that unlike chunking, which groups variables, binning consolidates the levels within a categorical variable, streamlining the input for regression analysis. I find the XLOOKUPor IF function in Excel is the most helpful in creating these bins.

Example: Binning variables using Xlookup

The approach of 📌Aggregating Similar Variables offers a path to managing a dense dataset. This method involves combining variables that measure similar concepts into a single predictor, either through creating a composite score or calculating their average. Such consolidation reduces the dataset’s complexity without sacrificing critical information. This strategy is particularly effective in scenarios where multiple variables contribute to the same underlying phenomenon. For example, consider a customer feedback analysis with multiple questions about satisfaction levels regarding different service aspects, like wait time, staff friendliness, and quality of service. By averaging the responses to these questions for each customer, you create a single “Overall Satisfaction Score.”

With a 📌data-driven approach to Variable Grouping, you identify and combine variables based on similar behavior or patterns across observations. This empirical approach to grouping stands apart by its reliance on data patterns rather than theoretical or predefined relationships. It often offers a deeper understanding of the dataset’s structure and uncovers new insights into the data. For example, consider a retail business analyzing customer transactions to improve stock management and marketing strategies. By examining purchase data, you might discover that certain products tend to be bought together more frequently than others, suggesting a correlation in purchasing behavior. Products with high correlation coefficients can be grouped into a single variable representing a category of products that appeal to similar customer segments. This method goes beyond the methods discussed above because it enables you to group variables across different dimensions, such as time (e.g., products that sell well in particular seasons) or customer demographics (e.g., products preferred by a specific age group). Advanced Excel users might leverage add-ins or external tools for cluster analysis to perform more sophisticated grouping based on multivariate patterns, then import these groups back into Excel for further analysis.

The strength of data-driven variable grouping lies in its ability to dynamically adapt to the underlying data structure and may even uncover stronger predictors of retention than any individual engagement metric could. This approach can lead to more effective targeting in marketing campaigns, improved product placement, and inventory management that is closely aligned with customer behavior patterns.

3️⃣Handling Categorical Data

Categorical data presents unique challenges in regression analysis since they can’t be added to a regression model without any transformation. They often have several dozen outcomes (e.g., car maker) or even hundreds of outcomes (car model), often show non-linear behavior, and meaningful outcomes are not represented well enough (see imbalanced dataset).

When the analysis of descriptive data highlights an effect on the y-variable,📌 constructing dummy variables becomes an mandatory step. This process transforms categorical variables into a series of binary variables that can be more easily digested by your regression model, a technique also known as one-hot encoding. The need for dummy variables comes from the fact that regression models need numerical input to evaluate the relationships between variables. Categorical data, which represents qualitative attributes like product categories or geographic regions, does not meet this requirement. Dummy variables allow these attributes to be encoded numerically, making it possible to include them in the analysis.

For example, consider a dataset with a categorical variable representing regions where a company operates, such as North, South, East, and West. To include this in a regression analysis, you would create separate dummy variables for each region except one, which would serve as the base or reference category. The choice of the base category is arbitrary but crucial, as the effects of the other categories are interpreted relative to it. If “North” is chosen as the base, then the dummy variables for South, East, and West indicate the presence of each region with a 1 (and 0 if not present), with the understanding that when all dummy variables are 0, the observation belongs to the North region.

In Excel, this transformation can be efficiently managed using the IF function for smaller datasets, e.g. with =if(A2=”South”,1,0)or XLOOKUP for larger, more complex data. These functions check each record’s category and assign a 1 if it matches the dummy variable being created and a 0 otherwise.

📌Ordinal Encoding: If the categorical variables are ordinal (i.e., the categories have a natural order such as Very Unsatisfied to Very Satisfied), you might consider ordinal encoding, where each category is assigned a numerical value that reflects its order. This can reduce the number of variables introduced into the model compared to dummy coding. However, this approach assumes equal spacing between the categories, which may not always be appropriate, and it should be used with caution. While this method simplifies the inclusion of ordinal data in models, it can introduce bias if the assumed ordinal relationship does not accurately represent the variable’s impact on the outcome, potentially misleading the analysis.

📌Grouping Low-frequency Categories: This technique is particularly useful when a categorical variable includes several levels that occur infrequently. Combining these into a single “Other” category simplifies your analysis and avoids the pitfalls of overfitting your model with categories that have minimal data. Generally, a category is considered low-frequency if it contains so few observations that its effect on the regression model’s outcomes cannot be reliably estimated. This threshold varies by context but often applies to categories representing less than 5% of the dataset or less than 30 observations. The consequences of not addressing low-frequency categories can be significant. In regression analysis, these sparse categories can lead to inflated standard errors and, consequently, unreliable p-values for associated coefficients. This uncertainty can obscure the true relationship between predictors and the dependent variable and potentially lead to misleading recommendations to your client.

For instance, if analyzing sales data across 50 regions, where some regions only contribute to 1% of the data, these regions might be grouped into an “Other” category. Excel’s XLOOKUP function, along with conditional formatting, can help identify and consolidate these categories efficiently (see binning example above). This adjustment allows for a more generalized understanding of regional trends without losing focus on the main insights.

4️⃣Model Building and Refinement

After simplifying the data and handling categorical variables, the next phase in analytics involves constructing and refining your model to ensure it is both accurate and robust.

Addressing the limitations imposed by Excel’s add-on functionalities requires innovative approaches, one of which is the application of Stepwise Regression within Excel’s Limitations. While Excel might not support a fully automated stepwise regression process, you can approximate this technique by manually testing the addition or removal of feature variables based on criteria such as the p-value, typically using a threshold like 0.05 for inclusion. This involves using Excel’s DATA ANALYSIS REGRESSIONtoolpack for regression analysis, where you can assess the impact of adding or removing variables on the model’s adjusted R-squared value.

📌Manual Forward Selection starts with an empty model, gradually adding variables one at a time based on their predictive power and significance. Begin by selecting the variable that exhibits the strongest correlation with the dependent variable. Subsequently, add more variables one by one, examining the adjusted R-square value after each addition to assess any improvement in the model’s explanatory power. Because of Excel’s limitations in dealing with multiple variables at once, you might need to evaluate subsets of variables in separate models. This approach to stepwise regression allows for a thorough exploration of potential predictors, making sure the model stays both robust and interpretable.

Conversely, 📌Manual Backward Elimination begins with a full model, including all potential predictors, and systematically removes the least significant variable one at a time. This process helps you trim down to a leaner set of variables that still capture the essence of your data. However, it requires you to closely monitor changes in model metrics, such as p-values and the adjusted R-squared value, to ensure the model’s predictive accuracy does not deteriorate with the removal of variables. Variables with p-values above a certain threshold (commonly 0.05) indicate that they do not significantly contribute to the model. You would then remove the variable with the highest p-value and rerun the regression, repeating this process until all remaining variables have p-values below the threshold. ⚠️Many (less effective) analysts lean towards removing all variables with a high p-value without asking questions. That should never be your standard approach. Our job is to be critical and investigate. Understanding the reasons behind a feature’s high p-value is imperative, as it opens the door to potential improvements. The presence of data outliers or the false belief in a linear relationship can lead to high p-values, which are often simple to resolve. So we should not rush into systematically excluding variables without careful consideration.

5️⃣Validation and Interpretability

In the analysis process, 📌validation and testing for predictive accuracy are important, especially when considering how your model will perform in practical situations. Imagine your model as a student. The training dataset is what it learns from, and the testing dataset is like an exam to assess its knowledge. During model development, using a testing set separate from the training set helps you assess how well your model adapts to new, unseen data (see ‘Resampling’ articles). This is important because a model could perform very well on the training dataset by memorizing it, rather than understanding the underlying patterns — similar to a student who memorizes facts without grasping the concepts. This difference is especially significant after deploying the model, because it allows you to become more confident in your model’s ability to remain effective and reliable when analyzing real-world data. So, by splitting your dataset and testing your trained model on the testing dataset, you gain insight into the model’s generalizability. It also helps you identify areas for improvement, making it a critical part of the analysis process.

Lastly: 📌Model Interpretability Over Complexity. A question I often get from students and junior analysts in this field is, “How can I increase my RSQ?” My answer: While complex models may boast high accuracy, they often come at the cost of interpretability. A model’s value lies not just in its predictive power but also in its ability to provide insights that can inform business decisions. A complex model might perform slightly better in terms of predictive accuracy but can become a black box, difficult to understand and act upon. Strive for simplicity — or, in more technical terms, 📌’model parsimony’ — where possible, choosing a model that provides clear, actionable insights.

But this search for simplicity does not mean compromising the quality of insights. On the contrary, it often leads to more robust, universally applicable findings. Consider, for example, a retail company aiming to enhance customer loyalty. A complex model might analyze dozens of variables, from purchase history to minute demographic details. However, a simpler model focusing on key factors such as frequency of purchases, average spend, and customer feedback scores can often provide more straightforward, actionable insights. This simpler model, by concentrating on variables that directly correlate with loyalty, allows business managers to easily identify and implement strategies to improve customer retention. Such a model, despite its apparent simplicity, can dramatically improve strategic decision-making processes, demonstrating that sometimes, less is indeed more.

Please find my next article in this series HERE.

1️⃣Variable Selection Strategies

2️⃣Data Simplification Methods

3️⃣Handling Categorical Data

4️⃣Model Building and Refinement

5️⃣Validation and Interpretability

Written by Prof. Frenzel

Responses (1)