How reliable are your split test results?
With split testing, there is always a fear at back of the mind that the results you observe are not real. This fear becomes especially prominent when you see an underdog variation beating your favorite variation by a huge margin. You may start justifying that the results may be due to chance or you have too less data. So, how do you really make sure that the test results are reliable and can be trusted? Reliability here means that running the test for more duration isn’t going to change the results; whatever results you have now are for real (and not by chance).
So, how do you determine reliability of your A/B test? Hint: you don’t. You let your tool do the work for you. VWO employs a statistical technique where your conversion events are treated as binomial variables. Above a certain sample size (10-15 visitors), binomial variables can be approximated to a normal distribution. Key point to note is that your conversion rate is a distribution, not a single value. That is, you always get a range (e.g. 31.3% ± 14.8%) for your conversion rate.
As you can see in the image above, a range for conversion rate is provided in the report. For statistics-geeks, this range actually represents 80% of the total area of the normal distribution. One peculiar property of the range (thanks to a concept called standard error) is that initially it is very wide but as more data gets collected, with time, it becomes narrower and narrower. Moreover, with time, the estimate of true conversion rate becomes preciser. For example, if on day 2 of your split test you observe that your conversion rate for a variation is 50% ± 25%, after a week you may observe that it has changed to 40% ± 8% (note that conversion rate has changed and the range has become narrower).
The way to be sure that your results are reliable is to compare conversion rate ranges of different variations and see if there is NO overlap between them. You can visualize this overlap in the chart above. Observe that there is little or no overlap between Control and “free download” variations. So you can be pretty sure that this result is reliable and “free download” indeed works better than the control (which in this case was a simple “download”). This overlap (in distributions) can also be calculated numerically and VWO calculates it as “Chance to Beat Original” metric. If that value is >95%, you can be pretty confident that the variation will be better than the control.
If you don’t really trust the statistics (by the way, there is no way you shouldn’t), you can still be confident about the test results by employing a neat trick. The idea is to have two identical variations (usually of the control) and see if there is any difference in their conversion rates. Of course, a minor difference will be there (due to randomness) but if you see a large difference between conversion rates of identical variations, you shouldn’t trust the results.
In upcoming posts, I’m going to describe the maths behind split testing, so stay tuned.