In a previous post, I provided a downloadable A/B testing significance calculator (in excel). In this post, I will provide a calculator which lets you estimate how many days should you run a test in order to obtain statistically significant results. But, first, a disclaimer.

**There is no guarantee of results for an A/B test**

When someone asks how long should s/he run an A/B test, the ideal answer would be until eternity or till the time you get results (whichever is sooner). In an A/B test, you can never say with full confidence that you will get statistically significant results after running the test X number of days. Instead, what you can say is that there is 80% (or 95%, whatever you choose) probability of getting statistically significant result (if it indeed exists) after X number of days. But, of course, it may be the case that there is in fact no difference in performance of control and variation so no matter how long you wait, you will never get a statistically significant result.

**So, how long should you run your A/B test?**

Download and use the calculator below to find out how many visitors you need to include in the test. There are 4 pieces of information that you need to enter:

- Conversion Rate of original page
- What % difference in conversion rate do you want to detect (if you want to detect even the slightest improvement, it will take much longer)
- Number of variations to test (more variations you test, more traffic you need)
- Average daily traffic on your site (not really needed, optional)

Once you enter these 4 parameters, the calculator below will find out how many visitors you need to test (for 80% and 95% probability of finding the result). You can stop the test after you test those many visitors, but you should never stop earlier than that. You may end up concluding wrong results.

**A/B test duration calculator (Excel spreadsheet)**

Click below to download the calculator:

Download A/B testing duration calculator.

Please feel free to share the file with your friends and colleagues or post it on your blog / twitter.

PS: By the way, if you want to do quick calculations, we have a version of this calculator hosted on Google Docs (please make a copy of the Google Doc sheet into your own account before you make any changes to it).

**How does the calculator work?**

Ah! The million dollar calculator. Explaining how it works is beyond the scope of this post as it is too technical (maybe a separate post). But, if you have got stomach for it, below is gist of how we calculate number of visitors needed to get significant results.

The graph above is taken from an excellent book called *Statistical Rules of Thumb*. Luckily, the chapter on estimating sample size is available to download freely [PDF]. Another excellent source to get more information on sample size estimation for A/B testing is Microsoft’s paper: Controlled Experiments on the Web: Survey and Practical Guide [PDF].

Hope you like the calculator and related resources. Excited to know your feedback and comments!

## Comments (32)

Great addition to the significance calculator! Thanks guys.

Very useful calculator, thanks! Minor note: maybe you should change the text from “expected” improvement in conversion to, say, “targeted” improvement in conversion? You don’t really know what to expect when you’re still setting up the test.

@Ana: it is actually the change in conversion you want to detect. But I agree it can be phrased in a better manner.

Is there a problem with the calculator?

I entered 1% in Expected Improvement in Conversion Rate, and it told me 19200 number of days for 85%.

I entered 50% in Expected Improvement in Conversion Rate, and it told me 8 days for 85%.

Isn’t it much easier to improve 1% in conversion rate than 50%? Why is the calculator showing otherwise?

Looking through the chapter you linked to, and the excel document, I can’t identify which sample size calculation you are using. Which one are you using?

@Phillip: it is actually the one which says 16*(std/diff)^2

@Tyler: if you want to detect 50% or more change, you will require less traffic as compared to the situation where you want to detect even 2% change in conversion.

So it is a rearrangement of (2.3), using the substitutions in (2.4), and with power set to .95?

@Phillip: the power is 80%, the confidence level is 95%.

In the Excel document, the formula uses 26 where I would have expected to see 16. The table at the top of page 30 (4 of 25 in the pdf) indicates that 26 is the number to use when power of .95 is desired. The entire table is for a confidence level of 95%.

Philip: there are two rows in the excel. 16 corresponds to 80% power. 26 corresponds to 95% power.

Right, the upper formula (J8) uses 16 (corresponding to power of .80), the lower formula (J9) uses 26 (corresponding to power of .95).

Should (D7*D8) just be (D8)? It looks like this should be the difference between the means under the null and alternative hypothesis.

I’m struggling with this because it doesn’t give me anything close to the number I get using the standard sample size calculation (n=p(1-p)z^2/e^2). When I try using the formula (2.27) I get an answer that is closer to what I am expecting, but also very different from what I am expecting to see. So I’m trying to figure out where the differences are between these various methods of calculating the same thing.

It seems like the standard formula is simple enough that if we are using Excel to calculate the sample size, we may as well use it.

Thanks for the continued responses, they have been helpful.

Paras, I’m not quite understanding here. Why would it be harder to detect a small change (1%) over a huge change(50%)?

@Tyler: Detecting a small change (1%) that is statistically significant requires a lot of traffic. That is because when you are collecting data, conversion rate has a range: say 6% +/- 2%. To make this range narrower and to have better estimate of true conversion rate, you need more data. In essence, by saying you want to detect even 1% change you are saying that you want to detect statistically significant different so range has to be very, very small. Something like 6% =/- 0.001%

When we want to detect a larger difference, then it doesn’t matter if your range is larger. So, if your control is 6% +/- 2% and your variation is 9% +/- 1%, you are still able to detect a huge change 50% change.

@Phillip: It is D7*D8 because you are calculating delta here. So, it is actually Mean * % change in mean (this gives us the delta). You are trying to calculate only with mean which is not correct. I hope I have clarified. Feel free to reach me at paras@wingify.com

Hy Paras,

Question when I use the tool with these numbers 500 visitors, conversion 20% and uplift expected 10% then it takes 42 days to get significant results. When I use your signification tool and add these numbers (5000 visitors and conversions 1000) and for te variation (5000 visitors and conversions 1100) there is already a significant result. Based on the last tool it whould only take 10 days to complete the test. I came up with this because the GWO (Google) calculator gave me 10 days and your tool 42 days. I hope you get my point 🙂

Hy Paras can you explain why this tool gives me back 42 days when i use these numbers (500 visits, 20% conversion and an uplift of 10%) and the calculator of Google 10 days.

When I use the numbers in the signification tool of VWO then I also get significant results after 10 days. I’m lost 🙂

Hi, I run numbers on 2 duration calculators (google website optimizer calculator and yours) and got significantly different results. Can you please explain why?

@Jan: Don’t know about GWO calculator (they may be using different confidence level and power).

If I’m only testing one change from the Control (such as Control has a headline & Test page doesn’t), are the number of variations in the test equal to 1 or 2?

Knowing this makes a big difference in testing days..basically doubles it according to your formula.

Thank you again for your work, I feel smarter already!

@Quan: the number of variations will be two. One is control and one is the actual variation you are testing.

@Paras: Thank you for the quick reply, I greatly appreciate it!

Regarding comment 20, so testing then 3 separate ad copy, only a single item in each such as headline. Then it will count as 3 tests? If correct than that will work just fine, the time frame matches up (given my estimated improvement) with what I’ve been ball parking as of late, good to know.

Oh! That’s great to know that the calculator is available to know how much duration it would require to complete the test. Thanks for the information that you provided to calculate the duration

Hi,

Thank you for providing this great A/B test duration calculator!

I am also interested in getting the duration estimates when the chance of finding a difference is 90% or 99%.

I found the 90% duration estimate by using Table 2.1 in the book you reference on this page, i.e in the chapter from “Statistical Rules of Thumb” at http://www.vanbelle.org/chapters/webchapter2.pdf (the two sample numerator is 21). And I used the formula in your spreadsheet to add a few columns to display the 90% values.

Could you please tell me what the missing row with Power = 0.99 from Table 2.1 is, or point me to a reference that has this information.

Better yet, it would be great to add these extra 2 “confidence levels” to your calculator!

Thanks,

Lori Proos

@Lori: I think this paper from Microsoft has a formula for 90% power http://www.exp-platform.com/Pages/hippo_long.aspx

So with a 95% Confidence Level & 80% Power, the Numerator for Sample Size Equation is 16.

What if I wanted to drop the Confidence Level 80% (alpha) with 80% Power (beta)? Do you know what the Numerator for Sample Size Equation would be and what would be the pitfall for doing that?

Reason I ask is my boss wanted to test to a lower confidence, but I could only adjust the Power using table 2.1 in the Statistical Rules of Thumb link you provided.

@Quan: I’m afraid I don’t know the numerator for 80% alpha. Would have to do a literature search for that.

Hi, I’m unclear what the calculator is telling me. If I start at 10% conversion rate and put in 50% expected improvement, taking the conversion rate to 15%, then if there is a statistical significant difference am I expecting to see the variation page at around 15% at or after the duration suggested?

@Segio: this means that if you don’t see at least 50% of performance difference within X amount of days, you should give up on the test. The improvement can be seen any time during day 0 – day X but if it isn’t seen even after day X, you should decide that the variation is performing equivalently to control (within confidence level you have set).