VWO Logo
Dashboard
Request Demo

Statistical Significance

Digital businesses have started adopting the culture of experimentation to enable data-driven decision-making to improve their KPIs. However, before getting into experimentation and testing new ideas it is important to understand the concept of a statistically significant result. In a hypothesis test, a result is declared statistically significant if one can rule out the possibility to obtain a result by a random chance, and instead attribute it to a truly existing effect. With a significant result, you can feel confident that the effect is real and you haven’t got lucky (or unlucky) in choosing the sample. Even while accepting or disapproving a hypothesis, you cannot be 100% certain of the outcome. But you can settle for some level of confidence or the significance level for which you want to be correct. 

In a statistical framework, a result is declared statistically significant if the p-value(probability of observing no effect) of the test goes below the chosen significance level threshold. In practice, a significance level ɑ (typically 0.05 or 5%) is agreed upon beforehand as a suggested industry standard. With statistical significance, you can have a certain degree of confidence that the observed effect is real and not due to chance.

How is statistical significance used in testing

An A/B or MVT test involves a controlled comparison of performance in variations on a certain metric of choice like conversion rate, average revenue, etc. Before making any critical business decisions like deploying the variation with a higher metric value, one must ensure that the results are statistically significant.

There are two domains in statistics, Frequentist and Bayesian, which provide competing approaches for hypothesis testing. The Frequentist way assumes that a metric of choice has a single value while the Bayesian way describes it as a distribution of possible values having a certain degree of belief associated with them. 

In Frequentist statistics, while testing against the Null hypothesis (no effect is present among variations), the probability of obtaining a result at least as extreme as the one observed is estimated. This statistic is termed p-value in the Frequentist domain and provides the measure of evidence against the null hypothesis. The smaller the value, the stronger the evidence against the null hypothesis. After collecting the required sample size in the experiment, if this p-value falls below the set significance level, the observed result is declared statistically significant.

How VWO calculates statistical significance

VWO is one of the leading experimentation platforms that conform to Bayesian statistics. Two statistics that VWO calculates for every variation in a test are Probability to be Best(PBB) and Potential Loss(PL). 

  • PBB represents the chances that a variation outperforms every other variation. The significance level  threshold for PBB is 95%(1-ɑ) 
  • PL represents the average loss one will likely occur on deployment of the variation if it performs worse relative to other variations. The significance level for decision in PL is called the Threshold of Caring(TOC). TOC is a critical quantity as it represents the loss that one’s business can afford to sustain if the recommended variation underperforms after deployment. It is estimated as, Metric Value for baseline * Mode of Certainty* 10%.

When PTBA and PL breach their respective significance levels, VWO recommends the variation as a better alternative for your business and declares the result statistically significant. The use of PL metric with PBB ensures that even if a test has declared a false positive result the overall impact of the error is tolerable by the business. Try VWO’s free statistical significance calculator or request a demo with our product experts to understand VWO reporting in detail.

Pitfalls of statistical significance

After observing sufficient data, it is important to ensure statistical significance before drawing any insights from data to keep Type-1 and Type-2 errors in check. However, if there are issues involved in the way the experiment was performed and assumptions of the test are violated, statistical significance check isn’t a robust methodology and could result in an increased error rate. Some issues that may occur while performing an experiment other than deriving insights from insufficient sample data are:

  1. Inaccuracies in data collection
    • Statistical significance does not account for the robustness of the data collection process and can provide meaningless insights in such a case.
  2. Issues with Randomization
    • If the randomization of the population is not random but biased for an experiment, non-existent effects can appear as significant.
  3. Coverage errors
    • Incorrect visitor tagging can result in multiple duplicate data points from a single visitor. It skews the collected data and can make effects seem more or less pronounced than they are.

These errors can significantly influence the insights that one draws from a statistically significant result. One needs to track both sampling and non-sampling errors before making a critical decision. Statistical significance on its own is not a robust criterion for decision-making and should be supplemented with corrections for non-sampling errors.

Some business concerns on statistical significance

A few concerns come up when businesses look at statistical significance, like how to

  • choose the significance level(alpha) 
  • choose the appropriate statistical approach
  • contextualize statistical significance in business

Many academic pieces of literature specify an alpha of 0.05 for their significance level. However, there isn’t any strong mathematical rationale in picking an alpha of 0.05. The only reasoning provided by its originators is that in many applications being wrong 1 in 20 times is acceptable. For situations when you need to be more cautious in errors you can decrease the alpha even further with a tradeoff of running the experiment for a longer duration. 

A second concern is selecting an appropriate statistical analysis method. Even after choosing a statistical philosophy between Bayesian and Frequentist, there are numerous methods of performing a hypothesis test. Depending on the type of data, the number of data points, and the question being asked, an appropriate statistical test framework is decided. A statistical analysis method can help determine how to collect data and the required sample size. Therefore, a statistical testing methodology must be identified during the experiment design itself. In an experiment, if a wrong statistical method is used then it can produce meaningless results.

The third concern is how to use the results of the test to make a decision. In testing, statistical significance determines if there is any mathematical difference (no matter how small) in the performance of the variations. A difference of even 0.00001% is a statistically significant difference in a test but it can be practically meaningless for your business. On the other hand, even if the test says no significant difference, it still can have some utility to your company. So it makes sense to identify first what strategically is important for your business and use the result of the statistical test to make a decision. 

Share