Mistakes People Make When A/B Testing and How to Avoid Them

Aaron Shapiro

A great way to make decisions based on data in through A/B Testing. It gives you proof instead of speculation, so you can find out what your customers really want.

But it’s not as easy as just creating two versions of a webpage and waiting for a winner to perform a reliable test. If you commit a lot of typical A/B testing mistakes, your results could not be genuine, which could cause you to make bad business decisions based on bad data.

The first step to making a good experimentation program is to know about these problems. A well-done test gives clear, useful information, whereas a poorly done test causes confusion and wastes precious resources.

This tutorial lists the most common mistakes people make when doing A/B tests and gives clear, concrete ways to avoid them. We will give you the tools you need to execute experiments that give you accurate results and help your business develop.

1. Testing Without a Reason

When you run studies without a specific hypothesis, you often get random, inconclusive findings. It’s easy to get caught up in testing things on a whim, like starting a fresh headline just to “see what happens,” but this method doesn’t usually give you useful information.

What causes it: Being eager to get better outcomes or feeling like you have to “just test something” can get in the way of strategic planning.

Why it’s wrong: You can’t tell if something succeeded (or didn’t) or quantify success without a clear hypothesis.

What to do to remedy it:

  • Always start with a precise, testable hypothesis, like “We think that changing X will make Y go up because of Z.”  
  • Make sure your tests have a goal, not just a notion for the sake of doing something.
  • Write down your hypothesis in your pre-launch checklist so that it is clear and in line with the rest of your work.

2. Not Getting a Statistically Significant Result

One of the worst things you can do during A/B testing is end the test as soon as one version appears like it might be ahead. Early trends can be deceiving, and if you don’t have enough data, your “winner” could not be real.

Why it happens: Teams may be tempted to end experiments too soon because they are excited about the early results and feel pressure to proceed swiftly.

Why it’s wrong: A sample size or length that is too small makes conclusions that are not accurate and raises the chance of getting false positives or negatives.

How to correct it:

  • Before you start your test, figure out how many samples you need and keep to that number.
  • To account for changes from day to day and week to week, set a minimum test length, like one or two full business cycles.
  • Before you make any conclusions, use your post-launch checklist to make sure the results are statistically significant (usually 95% confidence or greater).

3. Trying to test too many things at once

If you modify more than one thing in a single test, such as headlines, graphics, and CTAs, you can’t tell which modification produced the outcome.

Why it happens: People who want to get the most out of their learning or group updates may make assessments too hard.

Why it’s wrong: Testing more than one variable makes it harder to see clearly what has to be done to improve performance.

How to repair it:

  • When you can, just alter one variable at a time.
  • If you need to test more than one thing, think about using a structured multivariate test (MVT) and make sure you have enough traffic.
  • For the sake of transparency, write out exactly what is changing in your pre-launch checklist.

4. Not paying attention to differences between segments

A general outcome can hide how distinct groups of users react. Desktop users might be doing well, while mobile users might not be, or new visitors might act differently than returning consumers.

Why it happens: When you’re short on time, it’s best to merely look at the big picture.

Why it’s a bad idea: If you don’t have segment-specific knowledge, you can make adjustments that help some consumers but hurt others.

How to repair it:

  • Decide ahead of time which user groups (by device, new vs. returning, geographical) you will look at.
  • After the test, look at the findings by segment to find differences or trade-offs.
  • Add segmentation analysis to your list of things to do after the test.
marketing calendar

5. Ignoring Outside Factors

Tests done during holidays, sales, or abrupt traffic spikes can give results that won’t happen again in normal conditions.

Why it happens: Sometimes, tests are started without taking into account the big picture of the marketing calendar or strange circumstances.

Why it’s wrong: Outside variables might change how people perform, which can lead to wrong results.

How to fix it:

  • Before scheduling tests, look for large campaigns, promotions, or events that are out of the ordinary.
  • Write down anything strange that happens during your test and use that information in your analysis or even think about doing the test again.
  • Add a phase to your pre-launch checklist to look over events that are coming up outside of your business.

6. Not writing down and sharing results

When test results and lessons learned aren’t recorded and shared, the institution loses vital knowledge. Teams keep making the same mistakes or losing chances to build on what they’ve learned in the past.

Why it happens: Teams move quickly and think about “what’s next” instead of what they’ve done.

Why it’s wrong: Not keeping track of results makes it harder for organizations to learn and slows down their efforts to improve in the future.

What to do to remedy it:

  • Make a simple template to write down your hypothesis, the changes you tested, the findings, the statistical significance, and what you learned.
  • Keep results in a place where the whole team can search and get to.
  • Make sure to include documentation on your post-launch checklist.

7. Seeing testing as a one-time project

Long-term growth is hurt when people see A/B testing as a one-time event instead of an ongoing process.

What causes it: After a few tests, teams could decide that experimentation isn’t as important and only do it again when things go wrong.

Why it’s wrong: To keep making progress, learn new things, and adjust to how customers act, experimentation should never stop.

How to make it better:

  • Make A/B testing a regular part of your job and set up regular test cycles.
  • Keep an eye on how things are doing over time and check for trends across different experiments.
  • Make testing and learning a regular part of your marketing strategy.
decorative image of graphs and tables

7. Making assumptions based on small increase

Just because you have a greater conversion rate doesn’t guarantee you have a winner. To be reliable, the difference that was seen must be statistically significant.

What causes it to happen: Teams look at the raw conversion rates of each variation and choose the one with the greater number as the winner, not taking chance into account.

Why it’s wrong: You can’t be sure that the result is legitimate if A/B testing don’t show statistical significance. You are making a business decision based on a coin flip.

What to do to remedy it:

  • Set a minimum level of confidence before you start (95% is the industry norm).
  • Don’t say who won until your testing tool says the result is statistically significant at the level you set. If the results aren’t significant, treat them as inconclusive.

Checklists for A/B Testing Best Practices

List of Things to Do Before Launch

  • Is the test hypothesis well-defined and recorded?
  • Is the hypothesis easy to understand and test?
  • Are you only changing one thing, or is your design set up to make it clear how different things work together?
  • Is your main metric linked to an important company goal?
  • Have you figured out how many samples you need and how long the test should last?
  • Has the test been checked for quality on all major browsers and devices?
  • Is QA done and analytics tracking checked for all versions?

Checklist After the Test

  • Did the test go as long and have as many people as planned?
  • Has statistical significance (95% confidence or greater) been reached and checked?
  • Were the results for each segment (device, new/returning) looked at for differences?
  • Were there any mistakes in tracking or outside events that could have changed the data?
  • Have you written down and communicated the results, insights, and next steps with your team?

Common Questions (FAQ)

How long should I keep testing?

Your test should be long enough to get your sample size and span at least one full business cycle (seven days). Most organizations can trust that doing a test for 14 days will give them good results because it smooths out daily changes and shows how various users act.

Is it ever permissible to use a confidence level that is lower than 85% or 90%?

There is a reason why 95% is the standard, but for low-risk decisions, a lower confidence level may be fine. If you are evaluating a tiny modification where the risk of being incorrect is relatively minimal, an 85% confidence level could provide a useful directional suggestion.  Always ask for 95% confidence or higher for any test that will have a big effect, like a redesign of the checkout sequence.

What should I do if my test results don’t provide me a clear answer?

A result that is not statistically significant is not a failure. It’s a chance to learn. It notifies you that the modification you made wasn’t big enough to influence how people use the site. This could suggest that your theory was erroneous or that you didn’t do it well enough. Use this information to come up with a new, bolder idea for your next test.

Conclusion

To create a culture of experimentation that gets genuine results, you need to stay away from these frequent A/B testing mistakes. If you want to go over the basics again, our CRO Statistics Foundations tutorial can help you remember the statistical ideas that make testing reliable.

Need More Assistance

A Practical Guide to Statistics for Marketers

Aaron Shapiro

Y ou don’t need an advanced degree to make data-driven marketing decisions.

However, a basic grasp of statistics is essential for correctly interpreting campaign results, understanding A/B tests, and drawing reliable conclusions from your analytics. Without it, you risk acting on misleading data, cutting a winning test short, or investing in a strategy that only appeared to work by random chance.

This can lead to costly errors. You might scale a campaign that wasn’t truly effective or claim a victory that was just statistical noise. The good news is that a few core concepts are all you need to avoid these common traps.

This guide is your practical introduction to statistics for marketers. We will cover the essential concepts you need to run smarter, more effective campaigns—no complex equations, just straightforward explanations to help you build confidence in your data.

What We’ll Cover:

  • Why sample size in marketing tests is critical
  • Understanding confidence levels in A/B testing
  • The difference between a real result and random noise
  • How to interpret p-values without the jargon
  • Avoiding the correlation vs. causation trap
  • Why averages can sometimes hide the truth

1. Sample Size: Why More Data Leads to More Trust

One of the most frequent mistakes in marketing analytics is drawing conclusions from a small data set. When your sample size is too low, random fluctuations can create extreme results that aren’t sustainable or real.

Imagine you launch a new ad campaign. On the first day, ten people click through, and four make a purchase. That’s a 40% conversion rate. While impressive, it’s highly unlikely that four out of every ten people will convert. The sample is just too small to be reliable.

As you gather more data, the numbers will almost always regress toward a more realistic, stable average. After collecting 2,000 clicks, you might find that 80 people converted. Your conversion rate is now 4%, a far more accurate and trustworthy metric for forecasting.

Key takeaway: Avoid making decisions until you have collected enough data to minimize the impact of randomness. Dramatic swings in performance with small samples are common and often misleading. For A/B testing, a general rule is to aim for at least a few hundred conversions per variation to ensure your results have a stable foundation.

2. Confidence Levels and Statistical Significance Explained

These two concepts work together to tell you if your A/B test results are dependable. They act as a filter, helping you separate a true change in user behavior from random chance.

Confidence Levels in A/B Testing Explained

A confidence level tells you how certain you can be that your results are not a fluke. In marketing and web optimization, a 95% confidence level is the industry standard. This means if you were to run the same test 100 times, you would see the same winning result in at least 95 of those tests. The remaining 5% represents the risk that your outcome was due to random luck.

  • Higher confidence (e.g., 99%) provides stronger proof but requires more traffic and time.
  • Lower confidence (e.g., 80-90%) can offer directional insights but carries a higher risk of being wrong.

Think of it like a weather forecast. A 95% chance of rain means you should definitely bring an umbrella. An 80% chance means you still might, but you accept a greater possibility of staying dry.

Confidence Levels

What is Statistical Significance?

Significance is the direct output of your confidence level. If your test result reaches a 95% confidence level, it is considered “statistically significant.”

Let’s say you test a new checkout button. Version A (the original) has a 10% conversion rate, and Version B (the new design) has an 11% rate. Is that 1% lift a real improvement, or is it just statistical noise? Significance testing answers that question. If the result is not statistically significant, you cannot confidently declare Version B a winner, even if its conversion rate is higher.

Key takeaway: Always test until you reach your predetermined confidence level, typically 95%. Acting on non-significant results is equivalent to making a decision based on a coin flip.

3. P-Values: A Simple Definition

The p-value is another misunderstood metric, but its purpose is quite simple. The p-value measures the probability that the results you observed were purely due to random chance.

In short, it’s the probability of a fluke.

  • A p-value of less than 0.05 (p < 0.05) is the standard for significance. It means there is less than a 5% chance that your result is random noise. This corresponds directly to a 95% confidence level.
  • A smaller p-value means stronger evidence. A p-value of 0.01 suggests only a 1% chance that the outcome was random.

It’s important to know what a p-value is not. It doesn’t tell you the probability that your winning variation is the “true” winner or how big the uplift is. It only quantifies the likelihood that random chance created the observed difference.

4. Correlation vs. Causation: A Critical Distinction

It’s easy to assume that when two things happen at the same time, one must have caused the other. This is the classic trap of confusing correlation with causation.

  • Correlation: Two variables move in the same direction. For example, ice cream sales and sunglass sales both increase during the summer. They are correlated.
  • Causation: One event directly causes another. However, buying sunglasses doesn’t cause people to eat ice cream. The hidden factor is the warm weather, which causes both.

In a marketing context, you might see that revenue increased after you launched a new feature on your website. Did the feature cause the revenue lift? Not necessarily. Perhaps a major holiday occurred, a competitor went offline, or you were featured in a news article.

The only reliable way to prove causation is with a controlled experiment (like an A/B test), where you show the new feature to one group and not to another, keeping all other conditions the same.

5. Beyond Averages: Finding the Real Story in Your Data

The average is a useful starting point, but it can often hide important details. Relying solely on averages can lead to flawed strategies because they smooth over the nuances in customer behavior.

For example, imagine your site’s average order value (AOV) is $120. This could mean most customers spend around $120. Or, it could mean half your customers spend $40 while the other half spend $200. These two scenarios tell very different stories and call for different marketing actions. The first suggests a uniform customer base, while the second indicates distinct segments of low and high spenders.

To get the full picture, look beyond the average. Use tools like medians, distributions, and customer segments to understand your data more deeply. You may discover that new customers have a much lower AOV than returning ones, an insight that would be completely hidden by a single average.

A Marketer’s Quick Guide to Statistical Thinking

Mastering these basic statistics concepts for marketing will make you a stronger, more confident decision-maker. It’s not about becoming a statistician—it’s about reducing risk and replacing assumptions with evidence.

By building a culture of testing, you empower your team to learn faster and make smarter investments. Every test, even an inconclusive one, provides valuable information about your customers.

Frequently Asked Questions (FAQ)

1. How long should I run an A/B test?
A test should run long enough to collect a sufficient sample size and account for natural business cycles. A common mistake is stopping a test as soon as it reaches significance. Best practice is to run tests for full weekly cycles (e.g., 7, 14, or 21 days) to capture variations in user behavior between weekdays and weekends.

2. Is an 80% or 90% confidence level ever acceptable?
While 95% is the standard, a lower confidence level can be acceptable for low-risk decisions. For example, if you are testing a minor headline change where the cost of being wrong is minimal, an 85% confidence level might be enough to provide a directional signal. For high-stakes decisions, like a checkout redesign, you should always aim for 95% or higher.

3. What if my test result is inconclusive?
An inconclusive result—one that doesn’t reach statistical significance—is a learning opportunity. It tells you that the change you made was not impactful enough to create a detectable difference in user behavior. This might mean your hypothesis was incorrect or the change was too subtle. Use this outcome to iterate on your hypothesis and design a bolder test.

Request a Consultation

Need a great partner?
We would love to connect.

Rather talk to someone? Call + 312 316 0295