Statistical Significance

Digital businesses have started adopting the culture of experimentation to enable data-driven decision-making to improve their KPIs. However, before getting into experimentation and testing new ideas it is important to understand the concept of a statistically significant result. In a hypothesis test, a result is declared statistically significant if one can rule out the possibility to obtain a result by a random chance, and instead attribute it to a truly existing effect. With a significant result, you can feel confident that the effect is real and you haven’t got lucky (or unlucky) in choosing the sample. Even while accepting or disapproving a hypothesis, you cannot be 100% certain of the outcome. But you can settle for some level of confidence or the significance level for which you want to be correct.

In a statistical framework, a result is declared statistically significant if the p-value(probability of observing no effect) of the test goes below the chosen significance level threshold. In practice, a significance level ɑ (typically 0.05 or 5%) is agreed upon beforehand as a suggested industry standard. With statistical significance, you can have a certain degree of confidence that the observed effect is real and not due to chance.

How is statistical significance used in testing

An A/B or MVT test involves a controlled comparison of performance in variations on a certain metric of choice like conversion rate, average revenue, etc. Before making any critical business decisions like deploying the variation with a higher metric value, one must ensure that the results are statistically significant.

There are two domains in statistics, Frequentist and Bayesian, which provide competing approaches for hypothesis testing. The Frequentist way assumes that a metric of choice has a single value while the Bayesian way describes it as a distribution of possible values having a certain degree of belief associated with them.

In Frequentist statistics, while testing against the Null hypothesis (no effect is present among variations), the probability of obtaining a result at least as extreme as the one observed is estimated. This statistic is termed p-value in the Frequentist domain and provides the measure of evidence against the null hypothesis. The smaller the value, the stronger the evidence against the null hypothesis. After collecting the required sample size in the experiment, if this p-value falls below the set significance level, the observed result is declared statistically significant.

How VWO calculates statistical significance

VWO is one of the leading experimentation platforms that conform to Bayesian statistics. Two statistics that VWO calculates for every variation in a test are Probability to be Best(PBB) and Potential Loss(PL).

PBB represents the chances that a variation outperforms every other variation. The significance level threshold for PBB is 95%(1-ɑ)
PL represents the average loss one will likely occur on deployment of the variation if it performs worse relative to other variations. The significance level for decision in PL is called the Threshold of Caring(TOC). TOC is a critical quantity as it represents the loss that one’s business can afford to sustain if the recommended variation underperforms after deployment. It is estimated as, Metric Value for baseline * Mode of Certainty* 10%.

When PTBA and PL breach their respective significance levels, VWO recommends the variation as a better alternative for your business and declares the result statistically significant. The use of PL metric with PBB ensures that even if a test has declared a false positive result the overall impact of the error is tolerable by the business. Try VWO’s free statistical significance calculator or request a demo with our product experts to understand VWO reporting in detail.

Pitfalls of statistical significance

After observing sufficient data, it is important to ensure statistical significance before drawing any insights from data to keep Type-1 and Type-2 errors in check. However, if there are issues involved in the way the experiment was performed and assumptions of the test are violated, statistical significance check isn’t a robust methodology and could result in an increased error rate. Some issues that may occur while performing an experiment other than deriving insights from insufficient sample data are:

Inaccuracies in data collection
- Statistical significance does not account for the robustness of the data collection process and can provide meaningless insights in such a case.
Issues with Randomization
- If the randomization of the population is not random but biased for an experiment, non-existent effects can appear as significant.
Coverage errors
- Incorrect visitor tagging can result in multiple duplicate data points from a single visitor. It skews the collected data and can make effects seem more or less pronounced than they are.

These errors can significantly influence the insights that one draws from a statistically significant result. One needs to track both sampling and non-sampling errors before making a critical decision. Statistical significance on its own is not a robust criterion for decision-making and should be supplemented with corrections for non-sampling errors.

Some business concerns on statistical significance

A few concerns come up when businesses look at statistical significance, like how to

choose the significance level(alpha)
choose the appropriate statistical approach
contextualize statistical significance in business

Many academic pieces of literature specify an alpha of 0.05 for their significance level. However, there isn’t any strong mathematical rationale in picking an alpha of 0.05. The only reasoning provided by its originators is that in many applications being wrong 1 in 20 times is acceptable. For situations when you need to be more cautious in errors you can decrease the alpha even further with a tradeoff of running the experiment for a longer duration.

A second concern is selecting an appropriate statistical analysis method. Even after choosing a statistical philosophy between Bayesian and Frequentist, there are numerous methods of performing a hypothesis test. Depending on the type of data, the number of data points, and the question being asked, an appropriate statistical test framework is decided. A statistical analysis method can help determine how to collect data and the required sample size. Therefore, a statistical testing methodology must be identified during the experiment design itself. In an experiment, if a wrong statistical method is used then it can produce meaningless results.

The third concern is how to use the results of the test to make a decision. In testing, statistical significance determines if there is any mathematical difference (no matter how small) in the performance of the variations. A difference of even 0.00001% is a statistically significant difference in a test but it can be practically meaningless for your business. On the other hand, even if the test says no significant difference, it still can have some utility to your company. So it makes sense to identify first what strategically is important for your business and use the result of the statistical test to make a decision.

Explore more Glossary terms

Test Duration

The test duration comes from a sample size calculator which estimates the number of samples needed to estimate a significant effect within the chosen error bounds correctly. This calculation is performed before running an experiment to plan resources such that a timeline of an experiment is decided.

Test Hypothesis

A test hypothesis is a proposed explanation for a phenomenon that can be tested through experiments or observations. It serves as a tentative answer to a research question and is formulated based on prior observations, theories, or logical deductions.

Testing in Production

Testing in production is a standard practice in the modern software development cycle to ensure the release of high-quality products and improve user experience. Using this methodology, marketers can run tests to understand users’ responses to any new feature releases, while product managers can use these data to improve product experiences.

Title Tag

The title tag is an HTML element used to define the title of a web page.

Trust Badges

Trust badges are small logos or icons that convince visitors about the safety and credibility of the website.

Type-1 Error

In an A/B test, if both variations are similar and don’t affect the metric being tested any differently, an error may occur where the null hypothesis is rejected after the test concludes. In such a case, if it’s determined that there is a statistical difference between the variations, the result is a Type I error.

Type-2 error

During the hypothesis testing process, when the competing variations affect the experiment's metric differently but the test fails to reject the null hypothesis (representing no effect), then it is called a Type-I error or a False Negative.

Unique Visitors

Unique visitors are internet users who access websites using a distinct IP address.

Deliver great experiences. Grow faster, starting today.

Features +125 more

Features +120 more

Statistical Significance

How is statistical significance used in testing

How VWO calculates statistical significance

Pitfalls of statistical significance

Some business concerns on statistical significance

More Resources

Explore more Glossary terms

Test Duration

Test Hypothesis

Testing in Production

Title Tag

Trust Badges

Type-1 Error

Type-2 error

Unique Visitors

Deliver great experiences. Grow faster, starting today.

See VWO in action now.

While we will deliver a demo that covers the entire VWO platform, please share a few details for us to personalize the demo for you.

Select the capabilities that you would like us to emphasise on during the demo.

Which of these sounds like you?

Please share the use cases, goals or needs that you are trying to solve.

Please provide your website URL or links to your application.