Recently, I was preparing to send an important bottom-of-funnel (BOFU) email to our audience. I had two subject lines and couldn’t decide which one would perform better.
Of course, I thought, “Let’s do an A/B test!” However, our email marketer quickly pointed out a caveat that I hadn’t considered:
This seemed counterintuitive at first. Surely 5,000 subscribers was enough to run a simple test between two subject lines?
This conversation took me down a fascinating rabbit hole into the world of statistical significance and why it is so important in marketing decisions.
While tools like HubSpot’s free statistical significance calculator can make the calculation easier. Understanding what they charge and how this impacts your strategy is invaluable.
Below, I’ll break down statistical significance with a real-world example and give you the tools to make smarter, data-driven decisions in your marketing campaigns.
Table of contents
What is statistical significance?
In marketing, statistical significance occurs when the results of your research show that the relationships between the variables you tested (like conversion rate and landing page type) are not random; they influence each other.
Why is statistical significance important?
Statistical significance is like a truth detector for your data. It helps you determine the difference between any two options – such as: E.g. your subject lines – is probably a real or accidental coincidence.
Think of it like flipping a coin. If you flip five times and get heads four times, does that mean your coin is biased? Probably not.
But if you flip it 1,000 times and get heads 800 times, you might be on the right track.
That is the role of statistical significance: it separates coincidences from meaningful patterns. This is exactly what our email expert was trying to explain when I suggested A/B testing our subject lines.
Just like the coin toss example, she pointed out that a seemingly significant difference — a 2% gap in open rates, for example — may not tell the whole story.
We needed to understand statistical significance before we could make decisions that could impact our overall email strategy.
She then walked me through her testing process:
- Group A would receive subject line A and group B would receive subject line B.
- She tracked open rates for both groups, compared the results, and declared a winner.
“Seems straightforward, right?” she asked. Then she revealed where things get tricky.
She showed me a scenario: Imagine Group A had an open rate of 25% and Group B had an open rate of 27%. At first glance, it looks like subject line B performed better. But can we trust this result?
What if the difference was just due to chance and not because subject line B was actually better?
This question led me down a fascinating path to understanding why statistical significance is so important in marketing decisions. Here’s what I discovered:
Here’s why statistical significance is important
- Sample size affects reliability: My initial assumption that our 5,000 subscribers would be enough was wrong. If split evenly between the two groups, each subject line would only be tested on 2,500 people. With an average open rate of 20%, we would only see around 500 opens per group. I learned that this isn’t a big number when I tried to spot small differences like a 2% gap. The smaller the sample, the higher the chance that random variability will bias your results.
- The difference may not be real: That opened my eyes. Even though subject line B was opened 10 times more than subject line A, that doesn’t mean it’s definitely better. A statistical significance test could be used to determine whether this difference is significant or whether it could have occurred by chance.
- Making a wrong decision is costly: That’s really true. If we incorrectly concluded that Subject Line B was better and used it in future campaigns, we could miss opportunities to target our audience more effectively. Worse, we could waste time and resources implementing a strategy that isn’t actually working.
Through my research, I have found that statistical significance helps avoid reacting to a possible coincidence. It asks a crucial question: “If we were to repeat this test 100 times, how likely is it that we would see the same difference in results?”
If the answer is “very likely,” you can trust the result. If not, it’s time to rethink your approach.
Although I was eager to learn the statistical calculations, I first needed to understand a more fundamental question: When should we even do these tests?
How to Test Statistical Significance: My Quick Decision Framework
When deciding whether to conduct a test, use this decision framework to assess whether it is worth the time and effort. Here’s how I break it down.
Run tests when:
- You have a sufficient sample size. The test can reach statistical significance based on the number of users or recipients.
- The change could impact business metrics. For example, testing a new call to action could directly improve conversions.
- If you can wait the entire test period. Impatience can lead to inconclusive results. I always make sure the test has enough time to complete its course.
- The difference would justify the implementation costs. If the results lead to meaningful ROI or reduced resource costs, it’s worth testing.
Do not run the test if:
- The sample size is too small. Without sufficient data, the results are neither reliable nor actionable.
- You need immediate results. If a decision is urgent, testing may not be the best approach.
- The change is minimal. Testing small optimizations, such as moving a button a few pixels, often requires enormous sample sizes to produce meaningful results.
- The implementation costs exceed the potential benefits. If the resources required to implement the winning version outweigh the expected benefits, testing is not worth it.
Test prioritization matrix
If you’re juggling multiple testing ideas, I recommend using a prioritization matrix to focus on high-impact opportunities.
High priority tests:
- High traffic pages. These sites offer the largest sample sizes and the fastest path to significance.
- Key conversion points. Test areas like registration forms or checkout processes that directly impact sales.
- Revenue Generating Elements. Headlines, CTAs, or offers that encourage purchases or subscriptions.
- Touchpoints for customer acquisition. Email subject lines, ads, or landing pages that influence lead generation.
Low priority tests:
- Low traffic pages. These sites take much longer to produce actionable results.
- Smaller design elements. Small stylistic changes often don’t move the pointer enough to warrant testing.
- Non-revenue pages. Pages or blogs without direct links to conversions may not require extensive testing.
- Secondary metrics. Testing for vanity metrics like time on page may not align with business goals.
This framework ensures you focus your efforts where they matter most.
But that led to my next big question: How do you actually determine statistical significance once you’ve decided to run a test?
Although the math may sound intimidating, fortunately there are simple tools and methods to get accurate answers. Let’s break it down step by step.
How to calculate and determine statistical significance
- Decide what you want to test.
- Determine your hypothesis.
- Start collecting your data.
- Calculate chi-square results.
- Calculate your expected values.
- See how your results differ from your expectations.
- Find your total.
- Interpret your results.
- Determine statistical significance.
- Report statistical significance to your team.
1. Decide what you want to test.
The first step is to figure out what you want to test. This could be:
- Comparison of conversion rates on two landing pages with different images.
- Test click rates for emails with different subject lines.
- Evaluate conversion rates on various call-to-action buttons at the end of a blog post.
The possibilities are endless, but simplicity is key. Start with a specific piece of content you want to improve and set a clear goal – for example, increasing conversion rates or increasing views.
While you can try more complex approaches like testing multiple variants (multivariate testing), I recommend starting with a simple A/B test. In this example, I compare two variants of a landing page with the aim of increasing conversion rates.
Pro tip: If you are interested in the difference between A/B and multivariate testing, take a look here This guide to A/B vs. multivariate testing.
2. Determine your hypothesis.
When it comes to A/B testing, our email expert always makes it a point to start with a clear hypothesis. She explained that a hypothesis helps focus the test and ensure meaningful results.
In this case, since we are testing two email subject lines, the hypothesis could be as follows:
Another important step is to set a confidence level before starting the test. A 95% confidence level is standard for most tests because it ensures that the results are statistically reliable and not just based on random chance.
This structured approach makes it easier to interpret your results and take meaningful action.
3. Start collecting your data.
Once you’ve determined what you want to test, it’s time to start collecting data. Since the goal of this test is to find out which subject line performs better for future campaigns, you need to choose an appropriate sample size.
For email, this might mean dividing your list into random sample groups and sending each group a different subject line variation.
For example, if you’re testing two subject lines, split your list evenly and randomly to ensure both groups are comparable.
Determining the correct sample size can be difficult because it varies for each test. A good rule of thumb is to aim for an expected value greater than 5 for each variation.
This ensures that your results are statistically valid. (I’ll explain how the expected values are calculated below.)
4. Calculate the chi-square results.
As I researched how we could analyze our email testing results, I discovered that while there are several statistical tests, the Chi-Square test is particularly well-suited to A/B testing scenarios like ours.
This made perfect sense for our email testing scenario. A chi-square test is used for this discrete datawhich simply means that the results fall into different categories.
In our case, an email recipient will either open the email or not open it – there is no middle ground.
A key concept I needed to understand was confidence level (also called “confidence level”). alpha of the test). A 95% confidence level is standard, meaning there is only a 5% chance (alpha = 0.05) that the observed relationship is due to chance.
For example: “The results are statistically significant with a certainty of 95%” indicates that the alpha was 0.05, meaning that the probability of error in the results is 1 in 20.
My research shows that organizing the data into a simple chart for clarity is the best place to start.
Since I’m testing two variants (subject line A and subject line B) and two results (opened, not opened), I can use a 2×2 diagram:
Result |
Subject line A |
Subject line B |
In total |
Open |
X (e.g. 125) |
Y (e.g. 135) |
X + Y |
Not open |
Z (e.g. 375) |
W (e.g. 365) |
Z + W |
In total |
X + Z |
Y + W |
N |
This makes it easier to visualize the data and calculate your chi-square results. The totals for each column and row provide a clear overview of the overall results and prepare you for the next step: running the actual test.
While tools like HubSpot’s A/B testing kit Because you can automatically calculate statistical significance, understanding the underlying process helps you make better testing decisions. Let’s look at how these calculations actually work:
Performing the chi-square test
After organizing my data into a chart, the next step is to calculate statistical significance using the chi-square formula.
This is what the formula looks like:
In this formula:
- Σ means to sum (add) all calculated values.
- O represents the observed (actual) values of your test.
- E represents the expected values that you calculate based on the totals in your chart.
To use the formula:
- Subtract the expected value (E) from the observed value (O) for each cell in the chart.
- Square the result.
- Divide the squared difference by the expected value (E).
- Repeat these steps for all cells and then sum all results by the Σ to get your chi-square value.
This calculation tells you whether the differences between your groups are statistically significant or likely due to chance.
5. Calculate your expected values.
Now it’s time to calculate the expected values (E) for each result in your test. If there is no connection between the subject line and the opening of an email, we assume that the open rates for both variants (A and B) are proportional.
Let’s assume:
- Total number of emails sent = 5,000
- Overall opens up = 1,000 (20% open rate)
- Subject line A was sent to 2,500 recipients.
- Subject line B was also sent 2,500 recipients.
To organize the data in a table:
Result |
Subject line A |
Subject line B |
In total |
Open |
500 (O) |
500 (O) |
1,000 |
Not open |
2,000 (O) |
2,000 (O) |
4,000 |
In total |
2,500 |
2,500 |
5,000 |
Expected values (E):
To calculate the expected value for each cell, use this formula:
E=(row sum×column sum)total sumE = \frac{(\text{row sum} \times \text{column sum})}{\text{total sum}}E=total sum(row sum×column sum)
For example, to calculate the expected number of opens for subject line A:
E=(1,000×2,500)5,000=500E = \frac{(1,000 \times 2,500)}{5,000} = 500E=5,000(1,000×2,500)=500
Repeat this calculation for each cell:
Result |
Subject line A (E) |
Subject line B (E) |
In total |
Open |
500 |
500 |
1,000 |
Not open |
2,000 |
2,000 |
4,000 |
In total |
2,500 |
2,500 |
5,000 |
These expected values now represent the baseline that you use in the chi-square formula to compare with the observed values.
6. See how your results differ from your expectations.
To calculate the chi-square value, compare the observed frequencies (O) to the expected frequencies (E) in every cell of your table. The formula for each cell is:
χ2=(O−E)2E\chi^2 = \frac{(O – E)^2}{E}χ2=E(O−E)2
Steps:
- Subtract the observed value from the expected value.
- Square the result to increase the difference.
- Divide this squared difference by the expected value.
- Sum all the results for each cell to get your overall chi-square value.
Let’s work through the data from the previous example:
Result |
Subject line A (O) |
Subject line B (O) |
Subject line A (E) |
Subject line B (E) |
(O−E)2/E(O – E)^2 / E(O−E)2/E |
Open |
550 |
450 |
500 |
500 |
(550−500)2/500=5(550-500)^2 / 500 = 5(550−500)2/500=5 |
Not open |
1,950 |
2,050 |
2,000 |
2,000 |
(1950-2000)2/2000=1.25(1950-2000)^2 / 2000 = 1.25(1950-2000)2/2000=1.25 |
Now summarize the (O−E)2/E(O – E)^2 / E(O−E)2/E values:
χ2=5+1.25=6.25\chi^2 = 5 + 1.25 = 6.25χ2=5+1.25=6.25
This is your overall chi-square value, which indicates how much the observed results differ from expectations.
What does this value mean?
You now compare this chi-square value with a critical value from a chi-square distribution table based on your degrees of freedom (number of categories – 1) and your confidence level. If your value exceeds the critical value, the difference is statistically significant.
7. Find your total.
Finally, I sum the results of all the cells in the table to get my chi-square value. This value represents the overall difference between the observed and expected results.
Using the earlier example:
Result |
(O−E)2/E(O – E)^2 / E(O−E)2/E for subject line A |
(O−E)2/E(O – E)^2 / E(O−E)2/E for subject line B |
Open |
5 |
5 |
Not open |
1.25 |
1.25 |
χ2=5+5+1.25+1.25=12.5\chi^2 = 5 + 5 + 1.25 + 1.25 = 12.5χ2=5+5+1.25+1.25= 12.5
Compare your chi-square value with the distribution table.
To determine whether the results are statistically significant, I compare the chi-square value (12.5) to a critical value from a chi-square distribution table based on:
- Degrees of freedom (df): This is determined by (number of rows −1)×(number of columns −1)(number\ of rows\ – 1) \times (number\ of columns\ – 1)(number of rows −1)× (number of columns −1). For a 2×2 table, df=1df = 1df=1.
- Alpha (α\alphaα): The confidence level of the test. At an alpha of 0.05 (95% confidence), the critical value for df=1df = 1df=1 3.84.
In this case:
- Chi-square value = 12.5
- Critical value = 3.84
Since 12.5 > 3.8412.5 > 3.8412.5 > 3.84, the results are statistically significant. This indicates that there is a relationship between subject line and open rate.
If the chi-square value was lower…
For example, if the chi-square value had been 0.95 (as in the original scenario), it would be less than 3.84, meaning the results would not be statistically significant. This would suggest that there is no meaningful relationship between subject line and open rate.
8. Interpret your results.
As I delved deeper into statistical testing, I realized that correctly interpreting the results is just as important as administering the tests themselves. Through my research, I discovered a systematic approach to evaluating test results.
Strong results (act immediately)
Results are considered meaningful and actionable if they meet these key criteria:
- 95%+ confidence level. The results are statistically significant and the risk that they are due to chance is minimal.
- Consistent results across all segments. Performance remains consistent across different user groups or populations.
- A clear winner emerges. One version consistently outperforms the other.
- Corresponds to business logic. The results are consistent with expectations or reasonable business assumptions.
If the results meet these criteria, the best course of action is to act quickly: implement the winning variant, document what worked, and plan follow-up tests for further optimization.
Weak results (more data needed)
On the other hand, results are typically considered weak or inconclusive if they have the following characteristics:
- Below 95% confidence level. The results do not meet the threshold for statistical significance.
- Inconsistent across segments. One version performs well with certain groups but poorly with others.
- No clear winner. Both variants show similar performance with no significant difference.
- Contradicts previous tests. The results differ from previous experiments without a clear explanation.
In these cases, the recommended approach is to collect more data by retesting with a larger sample or extending the test duration.
Decision tree for next steps
My research provided a practical decision-making framework for determining next steps after interpreting the results.
If the results are significant:
- Implement the winning version. Introduce the more powerful variant.
- Document findings. Write down what worked and why for future reference.
- Schedule follow-up tests. Build on success by testing related elements (e.g. test headings if subject lines work well).
- Scale to similar areas. Apply insights to other campaigns or channels.
If the results are not significant:
- Continue with the current version. Stick with existing design or content.
- Plan a larger sample test. Revisit the test with a larger audience to validate the results.
- Test major changes. Experiment with more dramatic variations to increase the likelihood of a measurable impact.
- Focus on other options. Redirect resources to higher priority tests or initiatives.
This systematic approach ensures that every test, whether significant or not, brings valuable insights into the optimization process.
9. Determine statistical significance.
Through my research, I have found that determining statistical significance comes down to understanding how to interpret the chi-square value. Here’s what I learned.
Two key factors determine statistical significance:
- Degrees of freedom (df). This is calculated based on the number of categories in the test. For a 2×2 table, df=1.
- Critical value. This is determined by the confidence level (e.g. 95% confidence has an alpha of 0.05).
Compare values:
The process turned out to be quite straightforward: you compare your calculated chi-square value with the critical value from a chi-square distribution table. For example, with df=1 and a confidence level of 95%, the critical value is 3.84.
What the numbers tell you:
- If your chi-square value is greater than or equal to the critical value, your results are statistically significant. This suggests that the observed differences are real and not due to random chance.
- If your chi-square value is below the critical value, your results are not statistically significant, suggesting that the differences you observe could be due to chance.
What happens if the results are not significant? Through my research, I learned that non-significant results are not necessarily errors – they are common and provide valuable insight. Here’s what I discovered about dealing with such situations.
Check the test setup:
- Was the sample size sufficient?
- Were the variations clear enough?
- Did the test run long enough?
Making decisions with insignificant results:
If the results are inconclusive, there are several productive paths forward.
- Run another test with a larger sample.
- Test for more dramatic variations that may show more noticeable differences.
- Use the data as a basis for future experiments.
10. Report statistical significance to your team.
After you run your experiment, it’s important to communicate the results to your team so everyone understands the results and agrees on next steps.
Using the email subject line example, here’s how I would approach reporting.
- If the results are not significant: I would like to inform my team that the test results show no statistically significant difference between the two subject lines. This means that your choice of subject line is unlikely to impact the open rates of future campaigns. We could either repeat the test with a larger sample or continue with either subject line.
- If the results are significant: I would declare that Subject Line A performs significantly better than Subject Line B with a statistical significance of 95%. Based on this result, we should use Subject Line A for our upcoming campaign to maximize open rates.
When reporting your results, here are some best practices.
- Use clear images: Add a summary table or chart that compares observed and expected values alongside the calculated chi-square value.
- Explain the implications: Go beyond the numbers to clarify how the results will influence future decisions.
- Suggest next steps: Whether you’re implementing the winning variant or planning follow-up testing, make sure your team knows what to do.
By presenting results in a clear and actionable way, you help your team make data-driven decisions with confidence.
From Simple Test to Statistical Journey: What I Learned About Data-Driven Marketing
What started as a simple desire to test two email subject lines led me down a fascinating path into the world of statistical significance.
While my initial instinct was to simply split our audience and compare the results, I realized that making truly data-driven decisions requires a more nuanced approach.
Three key insights changed the way I think about A/B testing:
First, sample size is more important than I initially thought. What seems like a large enough audience (even 5,000 subscribers!) may not give you reliable results, especially if you’re looking for small but meaningful differences in performance.
Second, statistical significance is not just a mathematical hurdle – it is a practical tool that helps avoid costly mistakes. Without it, we risk developing scaling strategies that rely more on chance than real improvement.
Finally, I learned that “failed” tests aren’t failures at all. Even if the results are not statistically significant, they provide valuable insights that help design future experiments and prevent us from wasting resources on minimal changes that make no difference.
This trip gave me a new understanding of the role of statistical accuracy in marketing decisions.
While the math may seem intimidating at first, understanding these concepts is the difference between guessing and knowing – between hoping our marketing works and knowing it will.
Editor’s Note: This post was originally published in April 2013 and has been updated for completeness.