Dear friends!
Have you ever tried a new marketing strategy and questioned whether the boost in sales was a result of your actions or just luck? Think about running an ad campaign and seeing a higher average customer spend. How can you tell if this growth is meaningful or just a coincidence? Or what if you invest in a stock and its price goes up? How do you know if it’s because of your smart prediction or simply market trends?
Statistical Significance
Statistical significance helps you as a data analyst to determine if your observed effects are due to actual business decisions or just chance. I believe the value of this information is very clear, and understanding the concepts behind it allows you to separate meaningful insights from random noise, leading to reliable and evidence-based decisions. Despite all the fancy analytics and machine learning tools out there, many fall back on their statistical classes from college to relearn Probability Theory, Inferential Statistics, Experimental Design, Hypothesis Testing, or Bayesian Statistics.
Designing Experiments and Studies
Setting Up Hypotheses
Formulating clear and testable hypotheses is the first step in any statistical analysis. Hypotheses provide a framework for the research question and guide the analysis. In business analytics, this process begins by defining two competing hypotheses: the null hypothesis (H₀) and the alternative hypothesis (H₁). The null hypothesis typically asserts that there is no effect or no difference in the variable of interest, acting as a default or baseline assumption. In contrast, the alternative hypothesis posits that there is an effect or a difference. This hypothesis represents the outcome you seek to provide evidence for through data analysis.
For example, suppose your company wants to test the effectiveness of a new marketing campaign on sales. The null hypothesis (H₀) might state that the marketing campaign has no impact on sales, implying that any observed changes are due to random variation or other factors. The alternative hypothesis (H₁) would state that the marketing campaign has increased sales, suggesting that the campaign has a measurable effect. These hypotheses set the stage for data collection and analysis, guiding the business in determining whether the new marketing strategy is effective.
Choosing the Right Test
Using the right statistical test is fundamental for achieving valid results and proper hypothesis testing. Different tests are suited for various data types and research questions, and meeting their assumptions helps prevent invalid conclusions. This approach also strengthens the generalizability of the findings to the broader population. Common tests include:
- T-tests: Used to compare the means of two groups. For instance, a company might use a t-test to compare average sales before and after implementing a new marketing strategy. This test helps determine if the observed difference in means is statistically significant.
- Chi-square tests: Used for categorical data to assess the relationship between two variables. For example, a business might employ a chi-square test to examine whether customer satisfaction levels differ across various regions. This test checks if the observed frequencies differ from expected frequencies under the null hypothesis.
- ANOVA (Analysis of Variance): Used to compare the means of three or more groups. A company could use ANOVA to compare customer satisfaction scores across different product lines. This test determines if at least one group mean is statistically different from the others, indicating a significant effect of the product line on satisfaction.
Advanced Techniques
Non-Parametric Tests
Non-parametric tests are statistical tests that do not assume a specific distribution for the data. These tests are useful when the data doesn’t meet the assumptions required for parametric tests, such as normality.
📌The Mann-Whitney U test, also known as the Wilcoxon rank-sum test, is used to determine whether there is a significant difference between the distributions of two independent groups. Unlike the t-test, it doesn’t require the assumption of normally distributed data. For example, you might use the Mann-Whitney U test to compare customer satisfaction ratings between two different stores, determining if one store significantly outperforms the other.
The image above shows how to visualize the test using the R library ‘ggbetweenstats’. It compares the salary distributions of two job classes: Industrial and Information. The violin plots illustrate the distribution shapes, while the red dots and black lines represent the mean and median salaries, respectively. The Mann-Whitney U test is particularly useful here because the salary data may not be normally distributed, which violates the assumptions of the t-test. The test results indicate a significant difference between the two groups, with the Information job class having higher salaries on average. This visualization helps in understanding the statistical difference and the distribution of salaries across the two job classes.
📌The Kruskal-Wallis test is an extension of the Mann-Whitney U test that allows for comparisons among three or more independent groups. This test ranks all data points and then evaluates if there are statistically significant differences between the groups’ medians. A business could apply this test to compare the effectiveness of three different training programs on employee performance, identifying which program leads to the highest improvement.
Interpreting Results
Understanding P-Values
The p-value represents the probability of obtaining test results at least as extreme as the observed results, assuming that the null hypothesis is true. A lower p-value indicates that the observed data is less likely under the null hypothesis. Typically, a p-value threshold (alpha level) of 0.05 is used, meaning if the p-value is below 0.05, the null hypothesis is rejected.
Let me use this illustration to explain the concept of p-values in statistical significance testing. The plot shows a probability density curve with a green shaded area representing the p-value. This area indicates the probability of obtaining results as extreme as the observed data point, assuming the null hypothesis is true. In the example of comparing sales before and after a marketing campaign, if the p-value is 0.03, it suggests there is only a 3% chance that the observed increase in sales happened by chance. Since this p-value is below the common threshold of 0.05, the null hypothesis is rejected, implying that the marketing campaign had a significant effect (so that is good news for you, money well spent).
Confidence Intervals
Confidence intervals provide a range of values within which the true population parameter is expected to fall, with a certain level of confidence (usually 95%). Unlike a single p-value, confidence intervals give more information about the precision and variability of the estimate. A narrow confidence interval indicates a more precise estimate, whereas a wider interval suggests more variability.
The image shows a scatter plot with sample observations (blue dots), a regression line (red line), and the 95% confidence interval limits (dashed blue lines). The confidence interval indicates the range within which we expect the true regression line to lie with 95% confidence (see the math above). For example, if you were predicting sales based on advertising spend, the confidence intervals give you a range of plausible values for the increase in sales for a given increase in advertising spend. Your regression model might estimate that for every additional $1,000 spent on advertising, the predicted increase in sales ranges from $800 to $1,200 with 95% confidence. This means we are 95% confident that the true increase in sales for each $1,000 spent lies between $800 and $1,200.
Statistical vs. Practical Significance
Junior analysts often make this mistake. They think a p-value alone can confirm a hypothesis. However, a p-value only shows the probability of observing the data, or something more extreme, assuming the null hypothesis is true. It does not measure the size of an effect or the importance of a result. A small p-value indicates that the observed data is unlikely under the null hypothesis, leading to its rejection, but it doesn’t confirm the practical significance of the findings. Practical significance assesses the size of the effect and its relevance in a practical context. For business decisions, an effect must be both statistically and practically significant to be valuable.
For instance, consider a regression model evaluating the impact of two different marketing strategies on sales. The model yields a p-value of 0.04 for the coefficient of the new marketing strategy, indicating statistical significance at the 5% level. However, the corresponding effect size reveals that the new strategy only increases sales by 0.5%. While the p-value suggests that the effect is unlikely to be due to chance, the practical impact is minimal. Without considering the effect size, stakeholders might mistakenly believe the new strategy is highly effective. The small increase in sales, although statistically significant, may not justify the costs or efforts of implementing the new strategy. This underscores the importance of evaluating both statistical significance (p-value) and practical significance (effect size) to make well-informed business decisions.
Types of Errors in Hypothesis Testing
Type I Error
A Type I error, or false positive, occurs when the null hypothesis is incorrectly rejected when it is actually true. This means that the test suggests an effect or difference exists when it doesn’t. The probability of making a Type I error is denoted by the significance level (alpha), commonly set at 0.05. For example, if a business analyst concludes that a new advertisement significantly increases sales based on a p-value of 0.04, but the advertisement has no effect, this is a Type I error. The analyst has identified an effect that doesn’t exist, which could lead to poor business decisions.
Type II Error
A Type II error, or false negative, happens when the null hypothesis is not rejected when it is actually false. This indicates the test fails to detect an effect or difference that truly exists. The probability of a Type II error is denoted by beta (β), and 1-β represents the test’s power. For example, if a business analyst does not notice the real impact of a new marketing strategy on sales due to a small sample size, resulting in a p-value of 0.07, a Type II error has happened. The analyst misses a potentially useful strategy.
Balancing Errors
Reducing one type of error typically increases the probability of the other. Lowering the alpha level reduces Type I errors but increases the risk of Type II errors, and vice versa. Analysts must carefully weigh these trade-offs and the specific context of their analysis when determining acceptable error rates. This balance can be managed by adjusting the significance level and using adequate sample sizes. For important business decisions, analysts might use more stringent alpha levels or increase sample sizes to improve test power, minimizing the risk of both errors.
Importance of Sample Size
Sample size significantly affects detecting statistical significance. Larger sample sizes increase the test power, making it more likely to detect a true effect. Smaller sample sizes can lead to Type II errors, where real effects go undetected. The test power, which is the probability of correctly rejecting the null hypothesis when it is false, is directly influenced by the sample size. Determining the appropriate sample size for a study is important for obtaining reliable results.
Various methods and tools are available to calculate sample sizes, such as power analysis, which considers the desired power level (commonly 0.80), the significance level (alpha), and the expected effect size. The formula for calculating the required sample size 𝑛 for comparing two means can be expressed as:
Where 𝑛 n is the required sample size, 𝛿 is the effect size, 𝜎 is the standard deviation, 𝑍 𝛼 / 2 is the critical value for a two-tailed test at significance level 𝛼, and 𝑍 𝛽 is the critical value for the desired power 1 − 𝛽.
Effect size (𝛿) quantifies the difference between two groups, providing a standardized measure. It is calculated as the difference between the means of the two groups divided by the standard deviation: 𝛿 = (𝜇1 − 𝜇2) / 𝜎. For example, if the mean of group 1 is 120 and the mean of group 2 is 130, with a standard deviation of 15, the effect size is 𝛿 = (120 − 130) / 15 = −0.67. This helps in calculating the sample size required to detect a significant difference.
👣 Example
Let’s review the statistical significance (or robustness) of the output of a regression model that was estimated to understand the impact of advertising spending on a company’s sales revenue. The goal is to determine if changes in advertising spending significantly predict changes in sales revenue.
- Formulate Hypotheses:
- Null hypothesis (H₀): Advertising spending has no significant effect on sales revenue.
- Alternative hypothesis (H₁): Advertising spending has a significant effect on sales revenue.
- Collect Data: Gather data on advertising spend and corresponding sales revenue over a specific period.
- Fit the Regression Model: Use a statistical software package to fit a linear regression model where sales revenue is the dependent variable and advertising spend is the independent variable.
The regression equation might look like this:
Sales Revenue = 𝛽0 + 𝛽1 (Advertising Spend) + 𝜖
- Review the Output: Look at the regression output, focusing on the coefficient for advertising spend (𝛽1), its standard error, t-statistic, and p-value.
- Interpret Results:
- If the p-value for the advertising spend coefficient is less than the chosen significance level (e.g., 0.05), reject the null hypothesis. This indicates that advertising spending significantly affects sales revenue.
- Additionally, consider the effect size, which in this context is the magnitude of the 𝛽1 coefficient. This shows the practical impact of advertising spending on sales revenue.
For example, suppose the regression output shows that 𝛽1 = 1.5, with a p-value of 0.02 and a 95% confidence interval of [0.3, 2.7]. This indicates that each dollar spent on advertising is associated with a $1.50 increase in sales revenue, and this effect is statistically significant.
Next Level: Bayesian Approaches to Significance
Bayesian statistics offers an alternative approach to significance testing that incorporates prior knowledge or beliefs into the analysis. Unlike traditional methods that rely solely on the data at hand, Bayesian methods combine prior information with current data to update the probability of a hypothesis being true. This approach provides a more flexible framework for decision-making, particularly in situations where prior information is available or when dealing with complex models.
Where:
- P(H∣E) is the posterior probability: the probability of hypothesis 𝐻 given the evidence 𝐸. Updated probability of the hypothesis after considering the new evidence.
- P(E∣H) is the likelihood: the probability of evidence 𝐸 given that hypothesis 𝐻 is true. How likely the observed data is under different hypotheses.
- P(H) is the prior probability: the initial probability of hypothesis 𝐻 before seeing the evidence. Represents what is known before observing the data.
- P(E) is the marginal likelihood: the total probability of evidence 𝐸 under all possible hypotheses.
Consider a scenario where a business wants to evaluate the impact of a new advertising campaign on sales. Using Bayesian methods, the analyst can utilize data from previous campaigns to form an initial understanding (prior probability). As new sales data are gathered, the analyst updates the likelihood of the current campaign’s success (posterior probability). Bayesian methods provide a framework for continually refining predictions, allowing businesses to make informed decisions based on the most current information.
For further reading on Bayesian methods and their practical applications, see my article “Frequentist vs Bayesian”.