Type I And Type II Errors In A/B Testing And How To Avoid Them
A/B testing involves randomly splitting incoming traffic on your website between multiple variations of a particular webpage to gauge which one positively impacts your critical metrics. Pretty straightforward, right? Well, not so much. While A/B testing might sound simple,the science and math behind its operation and the computation of the results can get quite tricky.
Statistics is the cornerstone of A/B testing, and calculating probabilities is the basis of statistics. Therefore, you can never be 100% sure of the accuracy of the results you receive or reduce risk to 0%. Instead, you can only increase the possibility of the test result being true. But as test owners, you should not need to bother about this as your tool should take care of this.
Even after following all the essential steps, your test result reports might get skewed by errors that unknowingly creep into the process. Popularly known as Type I and Type II errors, these essentially lead to an incorrect conclusion of tests and/or erroneous declaration of winner and loser. This causes misinterpretation of test result reports, which ultimately misleads your entire optimization program and could cost you conversions and even revenue.
Let’s take a closer look at what exactly do we mean by Type I and Type II errors, their consequences, and how you can avoid them.
What are some of the errors that creep into your A/B test results?
Type I errors
Also known as Alpha (α) errors or false positives, in the case of Type I error, your test seems to be succeeding, and your variation seems to cause an impact (better or worse) on the goals defined for the test. However, the uplift or drop is, in fact, only temporary and is not going to last once you deploy the winning version universally and measure its impact over a significant period. It happens when you conclude your test before reaching statistical significance or the pre-decided criteria and rush into rejecting your null hypothesis and accepting the winning variation. The null hypothesis states that the said change will have no impact on the given metric/goal. And in the case of Type I errors, the null hypothesis is true but rejected because of the untimely conclusion of tests or miscalculation of the criteria for the conclusion.
The probability of making a Type I error is denoted by ‘α’ and correlated to the confidence level, where you decide to conclude your test. This means that if you conclude your test at a 95% confidence level, you accept that there is a 5% probability of getting the wrong result. Similarly, if that confidence level is 99%, the probability of the test result being wrong is 1%. You could call it sheer bad luck, but if you run into an α error even after concluding your test at a 95% confidence level, it means that an event with merely 5% probability has occurred.
Let’s assume, you devise a hypothesis that shifting your landing page CTA to above the fold will lead to an increase in the number of sign-ups. The null hypothesis here is that there would be no impact of changing the placement of the CTA on the number of sign-ups received. Once the test commences, you get tempted to peek into the results and notice a whopping 45% uplift in sign-ups generated by the variation within a week. You are convinced that the contrast is considerably better and end up concluding the test, rejecting the null hypothesis, and deploying the variation universally—only to notice that it no longer has a similar impact but instead has no impact at all. The only explanation is that your test result report has been skewed by the Type I error.
How to avoid type I errors
While you cannot completely do away with the possibility of running into a Type I error, you can certainly reduce it. For that, make sure you conclude your tests only once they’ve reached a high enough confidence level. A 95% confidence level is considered ideal, and that is what you must aim to achieve. Even after reaching a 95% confidence level, your test results might get altered by the Type I error (as discussed above). Therefore, you also need to ensure that you run your tests for long enough to guarantee that a good sample size has been tested upon, thereby increasing the credibility of your test results.
You can use VWO’s A/B testing duration calculator to determine the ideal period for which you must run a particular test. Similarly, you can also calculate your A/B testing sample size to ensure you conclude tests only when you have the lowest chance of ending up with adulterated results.
VWO’s Bayesian model-powered statistics engine, SmartStats, helps you reduce the probability of encountering a Type 1 error. Read more about it here.
Type II errors
Also known as Beta (β) errors or false negatives, in the case of Type II errors, a particular test seems to be inconclusive or unsuccessful, with the null hypothesis appearing to be true. In reality, the variation impacts the desired goal, but the results fail to show, and the evidence favors the null hypothesis. You, therefore, end up (incorrectly) accepting the null hypothesis and rejecting your hypothesis and variation.
Type II errors usually lead to the abandonment and discouragement of tests but, in worst cases, lack of motivation to pursue the CRO roadmap as one tends to disregard the efforts, assuming it to have made no impact.
‘β’ denotes the probability of making a Type II error. The probability of not running into a Type II error is denoted by 1 – β, dependent on the statistical power of the test. The higher the statistical power of your test, the lower the likelihood of encountering Type II error. If you are running a test at 90% statistical power, there is merely a 10% chance that you might end up with a false negative.
The statistical power of a test is dependent on the statistical significance threshold, sample size, the minimum effect size of interest, and even the number of variations of the test.
Here’s how they are related:
Let’s assume that you hypothesize that adding security badges on your payment page would help you decrease the percentage of drop-offs at that stage. You create a variation of the payment page with the security badges and run your test, only to peek at the results 10 days after its commencement. Upon noticing no change in the number of conversions or drop-offs, you decide to conclude the test and declare the null hypothesis to be true. Not convinced by the test results, you decide to rerun the test—only this time you let it run for longer. Consequently, you notice a significant improvement in your conversion goal this time around. What happened the first time was that you had encountered the Type II error by concluding the test before the required time.
How to avoid type II errors
By improving the statistical power of your tests, you can avoid Type II errors. You can do this by increasing your sample size and decreasing the number of variants. Interestingly, improving the statistical power to reduce the probability of Type II errors can also be achieved by decreasing the statistical significance threshold, but, in turn, it increases the probability of Type I errors. However, since reducing the probability of Type I errors usually takes precedence over avoiding Type II errors (as its consequences can be more severe), it is advisable to not interfere with the statistical significance threshold for the sake of improving power.
VWO SmartStats – the smarter, Bayesian way to business decision-making
Ideally, as a test owner, statistics is not something you should focus on since your quest is not to find the truth with your experiments—your motive is to make a better business decision that generates higher revenue for you. So, the important thing is to work with a tool that helps you make a better, smarter choice—without you having to get into the details of statistics.
As per the Frequentist model of inferential statistics, the conclusion of a test is entirely dependent on reaching statistical significance. If you end a test before it reaches statistical significance, you are likely to end up with a false positive (Type I error).
VWO’s Bayesian model-powered statistics engine, SmartStats, calculates the probability that this variation will beat the control as well as the potential loss that you might incur upon deploying it. VWO shows you the possible loss associated with deploying the variation so you can make an informed choice.
This potential loss also helps to decide when to conclude a particular test. After the conclusion of the test, the variation is declared the winner only if the potential loss of the variation is below a certain threshold. This threshold is determined by taking into account the conversion rate of the control version, the number of visitors that were a part of the test, and a constant value.
Not only does VWO SmartStats reduce your testing time by 50%—as you do not rely on reaching a set time and sample size to conclude your test—but, also gives you more control over your experiments. It gives you a clear probability which helps you make decisions based on the type of test you are running. For instance, if you are testing a low impact change such as changing button color, maybe a 90% probability is good enough to call a variation a winner. Or, if you are testing something at the last step of the funnel, you may want to wait until 99% probability. You’re, then, in a position to increase your testing velocity by concluding low impact tests quicker and prioritizing high impact ones in your roadmap.
A Frequentist-based statistics model will only give you the probability of seeing a difference in variations by assuming that it is an A/A test. This approach, however, assumes that you are doing the test computation only after you have obtained sufficient sample size. VWO SmartStats doesn’t make any assumptions, instead empowers you to make smarter business decisions by reducing the probability of running into Type I and Type II errors. This is because it estimates the probability of the variation beating the control, by how much, along with the associated potential loss associated, allowing you to continuously monitor these metrics while the test is running.
Since aiming for absolute certainty is extremely difficult with statistics, you cannot eliminate the possibility of your test results not being skewed due to an error. However, by choosing a robust tool like VWO, you can lower your chances of making errors or reduce the risk associated with these errors to an acceptable level. To understand more about how exactly VWO can keep you from falling prey to such errors, try out VWO’s free trial or request a demo by one of our optimization experts.