# What you really need to know about mathematics of A/B split testing

Founder and Chairman of Wingify.

Recently, I published an A/B split testing case study where an eCommerce store reduced bounce rate by 20%. Some of the blog readers were worried about statistical significance of the results. Their main concern was that a value of 125-150 visitors per variation is not enough to produce reliable results. This concern is a typical by-product of having superficial knowledge of statistics which powers A/B (and multivariate) testing. I’m writing this post to provide an essential primer on mathematics of a/b split testing so that you never jump to a conclusion on reliability of a test results simply on the basis of number of visitors.

### What Exactly Goes Behind A/B Split Testing?

Imagine your website as a black box containing balls of two colors (red and green) in unequal proportions. Every time a visitor arrives on your website he takes out a ball from that box: if it is green, he makes a purchase. If the ball is of red color, he leaves the website. This way, essentially, that black box decides the conversion rate of your website.

A key point to note here is that you cannot look inside the box to count the number of balls of different colors in order to determine true conversion rate. You can only *estimate* the conversion rate based on different balls you see coming out of that box. Because conversion rate is an estimate (or a guess), you always have a range for it; never a single value. For example, mathematically, the way you describe a range is:

“Based on the information I have, 95% of the times conversion rate of my website ranges from 4.5%-7%.”

As you would expect, with more number of visitors, you get to observe more number of balls. Hence, your range gets narrower and your estimate starts approaching true conversion rate.

### The Maths of A/B Split Testing

Mathematically, the conversion rate is represented by a binomial random variable, which is a fancy way of saying that it can have two possible values: conversion or non-conversion. Let’s call this variable as *p*. Our job is to estimate the value of *p *and for that we do *n* trials (or observe *n* visits to the website). After observing those *n* visits, we calculate how many visits resulted in a conversion. That percentage value (which we represent from 0 to 1 instead of 0% to 100%) is the conversion rate of your website.

Now imagine that you repeat this experiment multiple times. It is very likely that, due to chance, every single time you will calculate a different value of *p*. Having all (different) values of *p*, you get a range for the conversion rate (which is what we want for next step of analysis). To avoid doing repeated experiments, statistics has a neat trick in its toolbox. There is a concept called *standard error*, which tells how much deviation from average conversion rate (*p*) can be expected if this experiment is repeated multiple times. Smaller the deviation, more confident you can be about estimating true conversion rate. For a given conversion rate (*p*) and number of trials (*n*), standard error is calculated as:

*Standard Error (SE) = Square root of (p * (1-p) / n)*Without going much into details, to get 95% range for conversion rate multiply the standard error value by 2 (or 1.96 to be precise). In other words, you can be sure with 95% confidence that your true conversion rate lies within this range: *p* % ± 2 * *SE*

(In Visual Website Optimizer, when we show conversion rate range in reports, we show it for 80%, not 95%. So we multiply standard error by 1.28)

### What Does it Have to do With Reliability of Results?

In addition to calculating conversion rate of the website, we also calculate a range for its variations in an A/B split test. Because we have already established (with 95% confidence) that true conversion rate lies within that range, all we have to observe now is the overlap between conversion rate range of the website (control) and its variation. If there is no overlap, the variation is definitely better (or worse if variation has lower conversion rate) than the control. It is that simple.

As an example, suppose control conversion rate has a range of 6.5% ± 1.5% and a variation has range of 9% ± 1%. In this case, there is no overlap and you can be sure about the reliability of results.

### Do You Call All That Math Simple?

Okay, not really simple but it is definitely intuitive. To save the trouble of doing all the math by yourself, either use a tool like Visual Website Optimizer which automatically does all the number crunching for you. Or, if you are doing a test manually (such as for Adwords), use our free A/B split test significance calculator.

### So, What is the Take-Home Lesson Here?

*Always, always, always* use an A/B split testing calculator to determine significance of results before jumping to conclusions. Sometimes you may discount significant results as non-significant solely on the basis of number of visitors (such as you may do for this case study). Sometimes you may think results are significant due to large number of visitors when in fact they are not (such as here). You really want to avoid both scenarios, don’t you?

Great look at the reliability of A/B test results. When you get into quantification and accountability, many designers–the very people who need to be running A/B tests–get discouraged and never take the time do tests.

Can I confirm the maths for this formula with an example? Suppose the control web page has 1000 visitors of which 100 covert (10% conversion rate) while the variation has 1000 visitors and 150 convert (15%). Would the respective SE be:

Control:

SQRT(0.1 * (1-0.1) / 1000)= 0.00949

SE = 0.00949 * 1.96 = 0.0186

thus 10% ± 1.9% = 8.1% to 11.9%

Variation:

SQRT(0.15 * (1-0.15) / 1000)= 0.01129

SE = 0.01129 * 1.96 = 0.02213

thus 15% ± 2.2% = 12.8% to 17.2%

Thus, since there is no overlap, the variation results are reliable.

Is this correct (or is another number used for n)?

Yes, your calculations look fine to me.

Thanks Paras.

I think what was (and still is) confusing me is that when I tried to verify it using an online calculator (e.g. http://www.dimensionsintl.com/error_calculator.html) for 95% confidence with 1000 for population, 0.1 for proportion and 100 for sample size, it gives me double the ‘standard error’ as my calculations above.

…so I suspect I am misunderstanding something either here or in using other online calculators.

@Duane. That standard error is double in other online calculators because it is +/-. I think they are probably reporting error around mean while in this article I give a range. It is a matter of reporting -x to +x v/s 2x

@Paras. Thanks – makes sense now. The fact that the difference was half/double give me a suspicion it was something like that, but as I am currently sleep deprived, it wasn’t clicking 🙂

Thanks for a great post and an interesting service – I remember a few years back when services like the Visual Website Optimiser were so expensive, individuals and small companies couldn’t afford them. So nice to see that changing.

Thanks for this great explanation, really helps!

(1) I think there is an error in formula used by Duane for SE, Standard Error. There is NO 1.96, the t-value, for 95% confidence in SE formula).

(2) SE is one standard deviation

(3) Range is typically reported as +/- t multiplied by SE

(4) The constant (critical t value 1.96 is an approximation, assuming normality for n=30; the

critical t value changes as n changes.

(5) people use Two-sided tests(Are two conversion rates different? (as in here) versus One-sided tests ( Is the new change better than the contol?)

use of different types of tests will result in different t values. Duane probabaly sees the effects of different n values resulting in different t-values, different type of tests one sided vs two sided.

(6) The normaility approximation is reasonable when (a) p is small and (b) number of conversions exceed 30, not number of trials.

Hope this helps,

hi Paras,

if I use the standard error formula given in this post, the numbers I get are not matching the standard error in your image.

I have created an excel spreadsheet here. If there is something wrong with the formula please feel free to make changes: http://spreadsheets.google.com/ccc?key=0AlNACDtsQ-AzdFNzNHBaWHo4aktfUjRIcTJmek9VZXc&hl=en

for the comment above, by image I mean https://vwo.com/blog/wp-content/uploads/2010/01/result.png

Hi Inventov,

Actually, you just calculated SE – remember you need to multiply it with 1.96 to get 95% range of conversion rate. In the image, we show 80% range which corresponds to z-score of 1.28.

I have made modifications to your excel sheet and numbers do match now. If you make a great A/B testing spreadsheet, I’d love to share it here on this blog.

-Paras

Thanks. I’ve updated the file. Paras, can you update the column on chances to beat the original with your formula?

I’m curious to know what the math is for the “chance to beat original”. How does that get decided and is it really accurate?

“Chance to beat orginal” simply measures overlap between two distribution. If there is 1% overlap between conversion rate distribution of control and variation, then there is 99% chance of variation beating the control.

Good stuff, thanks! I have an additional complication however. How do you define p and n when the conversion event may take place at some future point (i.e. not on the same day)?

So let’s say you get 10,000 visitors to your site per day who register. Then, at some future point, they may decide to ‘convert’ (make a purchase for example – can only happen once) at any time between the registration date and a year from that date or more. Of those who will convert, most do so within the first 6 months, and then the conversions trail off. How do you set up this experiment?

Say you expose two groups, Pick-A and Pick-B to two different landing pages and you want to determine the effect of the landing page on the ultimate conversion. So you create a “class” which you define as anyone who visits the landing pages for one month. At that point, the class is defined, but the test continues because they have not yet converted.

My questions are, how do you define the conversion rate (do you average the total conversions over the exposure time of one month?), how do you define the trials (is one trial the first visit to the landing page so that a trial is a unique visitor?), and how long do you wait before you stop the test and decide that you have enough conversion data?

Hi John,

No matter how long your test is running, it won’t affect your conversions. If your visitor converts after 6 months of first getting included in the test, it will still count as a conversion (assuming you have the test still running). There are several calculators available on the Internet one on our site https://vwo.com/ab-split-test-duration/

Using these calculators you can calculate how long to wait for the results before giving up.

Thanks Paras!

So just to clarify with an example, let’s say I get 10,000 visitors per day for thirty days, and so I have a total of 300K in my test population. Then, over the next 6-8 months, I get different conversions per month, but in the end I get a total of 3,000 conversions. Do I then use n=300K and p=1%? i.e. do I average the TOTAL conversions over the 30 days I created my population even though they take place on very different timelines?

On a related note, are there rules of thumb about the proximity of conversion events to the page affected? To clarify, in my example, I am making a cosmetic change to a landing page. The nearest conversion event is registration – where they create an account on the landing page itself. That is a same-day event, and it makes a lot of sense that my changes in Pick-B might affect the conversion rate. However, if we now go out 6 months where the user has interacted with many different parts of my site, logging in and out, researching, etc. There are many exogenous factors that affect their purchase decision in that time that I have no influence over – life factors, income, age, competitors, etc. Is it really still valid to test to a conversion so far out based on the color of a button (or similar) far upstream?

My hypothesis is that if there is enough separation between the two events – interaction with the landing page and conversion – that even if Pick-A and Pick-B were exactly the same, that I would still likely see a slight difference in conversions between the picks. Are there tests that just don’t make sense to run?

Hi John,

This is interesting. I think ultimately it is upon the test creator to be aware of what his conversion goals are actually going to mean. A period of six-months is too long a period, however if your test is designed with such a goal in mind, then you could of course take it as a valid goal.

Theoretically if your variations do not have any effect on the six-month goal, you should see no statistical significance in the difference between conversion rates (because visitors were randomly distributed).

But you raise an interesting point that the time horizon must make an impact, perhaps due to sheer chance group A experienced better customer service as compared to group B and that is why they converted (and not because of test variations). More you lengthen the period, more there are chances of such unknown variables impacting different groups.

I don’t have a mathematical theory for this (yet), but is is a very interesting point for sure.

-Paras

I think there is one basic flaw with the pure mathematical approach or at least with this approach – it doesn’t take trending into account! If I see a test ‘graph’ that has a lot of ‘noise’ (both graphs cutting across each other) in other words, if one day one is winning and the next day the other variation is winning, and so on, despite using cumulative data, then I don’t trust the result. For a result to be truly trustworthy or significant, the ‘noise’ must have subsided and the trend remaining the same. In this way, I’d say there are a lot of folks calling tests ‘significant’ when in fact they are not. There is a lot of noise caused by day time, day of week, holidays, news, etc… and this will muddy your results. I’d love so see a mathematical calculation that takes time/trending into account!

@Anne: you make a good point and it will be great to capture trending into a mathematical number. However, ‘chance to beat original’ or ‘statistical significance’ talks about results in overall context. With these metrics we want to understand what is the likelihood that variation is performing better as compared to control given a specific sample (over a number of days).

What you are asking is a number that says how consistent is the performance. Those are two different things but nevertheless consistency can be important too.

Can you explain how you get to this formula, please?

Standard Error (SE) = Square root of (p * (1-p) / n)

I don’t understand how you can calculate the standard error without knowing anything about the variance.

That would be really helpful, thank you!

@Andi: it is a binomial distribution and for binomial distribution, variance is p * (1 – p)

How much overlap is allowed between the two distributions to be confident that version B is better?

You said that if there is 1% overlap between conversion rate distribution of control and variation, then there is 99% chance of variation beating the control.

What if there is a 5% overlap? In this case, is there a 95% chance of the variation beating the original?

What about a 6% overlap?

Thanks!

@Rafael: it depends on how important results are for an organization and how much risk (of being wrong) it is willing to accept. 99% chance to beat original is always better, but if stakes aren’t high some organizations are okay with 95% chance to beat original too.

Hey Paras, thank you for the answer!

So the “chance to beat the control” can be measured just by measuring this overlap?

For instance, does a 10% overlap mean a 90% chance to beat the control?

And a 15% overlap -> 85% chance

20% overlap -> 80% chance

And so on, so forth. Is this the case, or did I misinterpret it?

@Rafael: yes, your understanding is correct.

Thanks a lot, this was just what I was looking for!

Keep up the good work.

Hi,

Great article! Thanks for sharing. I have one question, what if I want to use metrics not represented by a binomial or normal distribution?

For instance, what happens if I want to compare control vs variation looking at the metric: visits/user?

Thanks,

J

I have a question about the math that goes in to finding the z-score. On the excel sheet, you used the equation: =(control_p-variation_p)/SQRT(POWER(control_se,2)+POWER(variation_se,2)) which = 1.721671363

However, shouldn’t we use the difference between two proportions (the conversion rates) to find the z-score and see if the difference is not 0? This formula would involve calculating the pooled p (conversion rate)…etc.

Also, to find whether or not it’s significantly different, don’t you have to do 1-p or 2(1-p) (for 2 tails) to find alpha and see if alpha is <= 0.05, 0.01, etc?

Thanks!

Great blog thanks for sharing.

What considerations should be taken into account when the A/B/N test has an uneven traffic split for example in a A/B/C/D test with a traffic split of 70% (existing site) ,10%,10%,10% respectively?

Just wanted to stress that

“Based on the information I have, 95% of the times conversion rate of my website ranges from 4.5%-7%.”

is an incorrect statement.

A confidence level is NOT confidence in the specific interval. It is confidence in the method for generating the interval, which produces a range of plausible values. It is not a probability or a chance of true value being in that range.

So the above statement means “4.5-7% are all plausible values for the true value for the population represented by this specific sample, because the method I used to generate the confidence interval produces a range that contains the true value 95% of the time”.

It does not mean that if I reran the test, the true value would be 95% certain to lie in the 4.5-7 range. It would be 95% certain to lie in whatever range I calculate for that test, which may be 6-9%, for example. It would depend on sample size and actual performance during that different time period.

Could you show an example of how to determine the % overlap? Such as if you have two intervals (6.4, 7.3) and (6.8, 8.1) what would be the percent overlap? Thanks!