Type I and Type II Errors in A/B Testing and How to Avoid Them
A/B testing involves randomly splitting incoming traffic on your website between multiple variations of a particular webpage to gauge which one positively impacts your critical metrics. That’s fairly straightforward, right? Well, not so much. While A/B testing might sound simple, the science and math behind how it is actually conducted and how results are computed can get quite tricky.
Statistics is the cornerstone of A/B testing, and statistics are based on calculating probabilities. Therefore, you can never be 100% sure of the accuracy of the results you receive or reduce risk to 0%. Instead, you can only increase the possibility of the test result being true. But as test owners, you should not need to bother about this, your tool should take care of this.
Even after following all the essential steps, your test result reports can be skewed by some errors that unknowingly creep into the process. Popularly known as Type I and Type II errors, these essentially lead to an incorrect conclusion of tests and/or erroneous declaration of winner and loser. This causes misinterpretation of test result reports, which ultimately misleads your entire optimization program and could cost you conversions and even revenue.
Let’s take a closer look at what exactly do we mean by Type I and Type II errors, their consequences, and how you can avoid them.
What are some of the errors that creep into your A/B test results?
Type I errors
Also known as Alpha (α) errors or false positives, in the case of Type I error, your test seems to be succeeding, and your variation seems to be causing an impact (better or worse) on the goals defined for the test. However, the uplift or drop is, in fact, only temporary and is not going to last once you deploy the winning version universally and measure its impact over a significant period of time. This happens when you conclude your test before reaching statistical significance or the pre-decided criteria and rush into rejecting your null hypothesis and accepting the winning variation. The null hypothesis states that the said change is going to have no impact on the given metric/goal. And in the case of Type I errors, the null hypothesis is true but rejected because of the untimely conclusion of tests or miscalculation of the criteria for the conclusion.
The probability of making a Type I error is denoted by ‘α’. Type I error is correlated to the level of confidence that you decide to conclude your test at. This means that if you decide to conclude your test at a 95% confidence level, you are accepting that there is a 5% probability that your test result is wrong. Similarly, if that confidence level is 99%, the probability of the test result being wrong is 1%. You could call it sheer bad luck, but if you run into an α error even after concluding your test at a 95% confidence level, it means that an event with merely 5% probability has occurred.
Let’s assume that you devise a hypothesis that shifting your landing page CTA to above the fold will lead to an increase in the number of sign-ups. The null hypothesis here is that there would be no impact of changing the placement of the CTA on the number of sign-ups received. Once the test commences, you are tempted to peek into the results and notice a whopping 45% uplift in sign-ups generated by the variation within a week. You are convinced that the variation is considerably better and end up concluding the test, rejecting the null hypothesis, and deploying the variation universally – only to notice that it no longer has a similar impact but instead has no impact at all. This simply means that your test result report has been skewed by the Type I error.
How to avoid type I errors
While you cannot completely do away with the possibility of running into a Type I error, you can certainly reduce it. For that, make sure you conclude your tests only once they’ve reached a high enough confidence level. A 95% confidence level is considered to be ideal, and that is what you must aim for. Even after reaching a 95% confidence level, your test results might be altered by the Type I error (as we discussed above). Therefore, you also need to ensure that you run your tests for long enough, so that a good sample size has been tested upon, thereby increasing the credibility of your test results.
VWO’s A/B testing duration calculator can be used to determine the ideal period for which you must run a particular test. Similarly, you can also calculate your A/B testing sample size to ensure you conclude tests only when you have the minimum chance of ending up with adulterated results.
VWO’s Bayesian model powered statistics engine, SmartStats also helps you reduce the probability of encountering a Type 1 error. Read more about it here.
Type II errors
Also known as Beta (β) errors or false negatives, in the case of Type II errors, a particular test seems to be inconclusive or unsuccessful, with the null hypothesis appearing to be true. In reality, the variation has actually led to an impact on the desired goal, but the results fail to show so, and the evidence is in favor of the null hypothesis. You, therefore, end up (incorrectly) accepting the null hypothesis and rejecting your hypothesis and variation.
Type II errors usually lead to the abandonment of tests and discouragement and, in worst cases, lack of motivation to pursue the CRO roadmap as one tends to disregard their efforts, assuming them to have made no impact.
‘β’ denotes the probability of making a Type II error. The probability of not running into a Type II error is denoted by 1 – β and is dependent on the statistical power of the test. The higher the statistical power of your test, the lower will be the likelihood of encountering Type II error. If you are running a test at 90% statistical power, there is merely 10% chance that you might end up with a false negative.
The statistical power of a test is dependent on the statistical significance threshold, sample size, the minimum effect size of interest, and even the number of variations of a test.
Here’s how they are related:
Let’s assume that you hypothesize that adding security badges on your payment page would help you decrease the percentage of drop-offs at that stage. You create a variation of the payment page with the security badges and run your test, only to peek at the results 10 days after its commencement. Upon noticing no change in the number of conversions or drop-offs, you decide to conclude the test and declare the null hypothesis to be true. Not convinced by the test results, you decide to rerun the test – only this time you let it run for longer. Consequently, you notice a significant improvement in your conversion goal this time around. What happened the first time was that you had encountered the Type II error by concluding the test before the required sample size could have been tested on.
How to avoid type II errors
It’s clear that Type II errors can be avoided by improving the statistical power of your tests. This can be done by increasing your sample size and decreasing the number of variants. Also, bear in mind that improving the statistical power to reduce the probability of Type II errors can also be done by decreasing the statistical significance threshold, and in turn, increasing the probability of Type I errors. However, since reducing the probability of Type I errors usually takes precedence over avoiding Type II errors (as its consequences can be more severe), it is advisable to not interfere with the statistical significance threshold for the sake of improving power.
VWO SmartStats – The smarter, Bayesian way to business decision-making
Ideally, as a test owner, statistics is not something you should have to focus on, your quest is not to find the truth with your experiments – your motive is to make a better business decision, one which generates higher revenue for you. So, the important thing is to work with a tool that helps you make the better, smarter choice – without you having to get into the details of statistics.
As per the Frequentist model of inferential statistics, the conclusion of a test is dependent completely on reaching statistical significance. If you end a test before reaching statistical significance, you are likely to end up with a false positive (Type I error).
VWO’s Bayesian model powered statistics engine, SmartStats, calculates the probability that this variation will beat the control as well as the potential loss that you might incur upon deploying it. VWO shows you the potential loss associated with deploying the variation so you are more confident as you now know what is the risk associated if you deploy it universally.
This potential loss is also used to decide when to conclude a particular test. A test is concluded, and the variation is declared the winner only once the potential loss of the variation is below a certain threshold. This threshold is determined by taking into account the conversion rate of the control version, the number of visitors that were a part of the test, and a constant value.
Not only does VWO SmartStats reduce your testing time by 50% – as you do not rely on reaching a set time and sample size to conclude your test – but also gives you more control over your experiments. It gives you a clear probability which helps you take decisions based on the type of test you are running. For instance, if you are testing a low impact change such as changing button color, maybe 90% probability is good enough to call a variation a winner. Or if you are testing something at the last step of the funnel, you may want to wait until 99% probability. You’re then in a position to increase your testing velocity by concluding low impact tests quicker and prioritizing high impact ones in your roadmap.
A Frequentist-based statistics model will only give you the probability of seeing a difference in variations by assuming that it is an A/A test. This approach however assumes that you are doing the test computation only after you have obtained sufficient sample size. VWO SmartStats however doesn’t assume this and empowers you to make smarter business decisions by reducing the probability of running into Type I and Type II errors. This is because it estimates the probability of the variation beating the control, by how much, and also the potential loss associated with it, allowing you to continuously monitor these metrics while the test is running.
You cannot completely eliminate the possibility of your test results being skewed due to an unanticipated error, as aiming for absolute certainty is extremely difficult with statistics. However, by choosing a robust tool like VWO, you can lower your chances of making errors or reduce the risk associated with these errors to an acceptable level. To understand more about how exactly VWO can keep you from falling prey to such errors, try out VWO’s 30-day free trial or request a demo by one of our optimization experts.