Data Solutions & Analytics

How to Run A/B Tests with Small Sample Sizes

By Krissy Tripp

How to deal with small sample sizes is one of the most frequent questions we get from clients, particularly when enterprise experimentation programs scale and individual business units begin testing on their specific site sections.

There are three factors that impact necessary sample size in traditional frequentist methods:

- Desired lift
- Tolerance for risk
- Current conversion rate

Once the stakeholders understand all of these, they can make more informed decisions on where to relax restrictions.

It takes a smaller population to be confident of a large difference than a small one. If the test idea is likely to see a large conversion increase, then it will take much less time to test than a small conversion increase. This may mean testing big ideas or multiple changes rather than one at a time.

Evolytics offers a free Sample Size Calculator to let you experiment with your necessary sample size, given different lift estimates.

95% confidence is the scientific standard because it’s the agreed-upon definition of random chance. Scientists have agreed that if they see a false positive one in 20 times, they had bad luck, and it won’t ruin their reputation in peer-review journals. Most businesses aren’t publishing their tests, and while we do want them to be sure they are rolling out winners, perhaps they are willing to increase their tolerance for risk of a false positive to 1 in 10.

While we do recommend they have seasonal holdouts, or try back testing if they use this method, it certainly increases velocity, and they will still make the correct call 9 of 10 times. Back testing is especially important when increasing velocity like this because basic math says if they are able to run 10 successful tests in a year with this method, one was actually a dud.

While we commonly discuss confidence, we rarely discuss power – or the likelihood we will see a change if one exists. Decreasing the power increases your likelihood of observing flat results, even when there is a change, but it’s less risky to decrease, to say, 80% than it is to lower confidence because the worst case scenario is leaving money on the table by keeping the control, rather than investing in rolling out a bad recipe.

Seeing a 110 index on a 10% conversion rate is easier than seeing a 110 index on a 2% conversion rate. While it’s best practice to test against the final purchase, sometimes we can gain traction testing against easier-to-hit micro-conversions along the funnel. When we do this method, we recommend keeping an eye on the bottom line metric to ensure we do no harm to it. Some A/B Testing tools such as Split keep an eye on multiple “standard” metrics, no matter what primary KPI you choose.

Difference of proportions (comparing percentages) is usually a more difficult standard than testing differences of means (a raw number such as revenue per visitor). While more difficult to automate, you could use difference of means to understand the tide of the test more quickly. Beware, it is possible to get too large a sample size with this method. Too large a sample size will flag a statistical difference, even when it’s not a meaningful or impactful one – such as one treatment recipe having an average order value of $23.25 and another having an average order value of $23.24. With enough sample, this will be statistically significant, but finding that has cost you valuable resources and is unlikely to have a positive ROI.

Some A/B Testing tools such as Optimize and VWO already use Bayesian statistics for test readouts. While more expensive to compute and more difficult to understand mathematically, it is a logical mechanism to discuss the probability of the hypothesis, and it doesn’t require specific sample sizes so much as it gets more and more exact as sample size grows.

Optimizely’s Stats Accelerator uses multi-armed bandit algorithms to make a call on which recipe will win more quickly than traditional frequentist methods. It’s important to note that the Stats Accelerator will adjust your traffic dynamically, up to once an hour, but the algorithm does control for Simpsons Paradox.

We recommend creating a threshold matrix using A/A experiment results. Doing this allows you to see the normal level of variance for your primary KPI, and short of statistical significance, it can inform you if a test is acting out of sorts. For instance, if during an A/A, we learn that conversion never varied by more than 5%, and we see a test has been +7% for two weeks, but is still far from confidence, your team may be comfortable rolling it out knowing that the results are consistently higher than a normally observed variance level. A trendline visualization is particularly helpful to make this type of decision.

Want to run your sample size plan by a Statistician? Get in touch.

Sign up to receive our bimonthly newsletter!

Not sure on your next step? We'd love to hear about your business challenges. No pitch. No strings attached.

©2024 Concord. All Rights Reserved