Standard Deviation

Standard deviation is a measure of how spread out the values in a dataset are around their mean (average). A low standard deviation means most values cluster tightly near the mean. A high standard deviation indicates that values are widely scattered.

The formal definition: standard deviation is the square root of the average of the squared differences from the mean. It is expressed in the same units as the original data, which makes it far more interpretable than its close cousin, variance.

It is denoted as σ (sigma in lowercase) for a population and s for a sample.

The One-sentence intuition

On average, how far does any single data point stray from the group’s average? That distance, that typical deviation, is what standard deviation measures.

Before touching formulas, here are three experiments to make standard deviation click.

Experiment 1: Two archery teams

Imagine two archery teams, each of 5 players, each scoring an average of 80 points per round.

Team A scores: 79, 80, 80, 81, 80. They’re remarkably consistent. Standard deviation ≈ 0.6.

Team B scores: 60, 70, 80, 90, 100. Their average is identical, but their performance is wildly unpredictable. Standard deviation = 15.8.

Same mean. Completely different story. Standard deviation tells you the part the average hides.

Experiment 2: Two cities, same annual average temperature

San Diego, CA, and Kansas City, MO both average around 63°F (17°C) annually. But San Diego’s monthly temperatures stay between 58-72°F, a small standard deviation. Kansas City swings from 26°F in January to 91°F in July, a huge standard deviation.

If you’re packing for a move, the average tells you almost nothing useful. The standard deviation tells you everything.

Experiment 3: Manufacturing bolts

A factory produces bolts specified at exactly 10mm in diameter. Two machines produce bolts with an average diameter of 10mm.

Machine A has σ = 0.01mm; Machine B has σ = 0.5mm.

Machine B’s bolts will frequently fall outside acceptable tolerances and cause failures. The mean passes inspection. The standard deviation catches the defect.

The Core Insight

Means describe the center of data. Standard deviation describes the shape of data, specifically, its spread. You almost never have the full picture without both.

Here are two formulas: one for populations, one for samples, and a breakdown of every component.

Population standard deviation

Use this when you have data for every single member of the group you care about (e.g., every student in a specific classroom).

σ = √[ Σ(xᵢ − μ)² / N ]

Where, σ = population std dev, xᵢ = each value, μ = population mean, N = total count, Σ = sum of all

Sample standard deviation

Use this when your data is a sample drawn from a larger population. The denominator changes to N − 1 (called Bessel’s correction) to correct for the fact that samples tend to underestimate true spread.

s = √[ Σ(xᵢ − x̄)² / (n − 1) ]

Where, s = sample std dev, x̄ = sample mean, n − 1 = Bessel’s correction

Why do we square the differences

If you simply averaged the raw differences from the mean, they’d always sum to zero; positive and negative deviations cancel out. Squaring them eliminates negative signs and amplifies large deviations (outliers get extra “weight”). Taking the square root at the end brings the result back to the original unit scale.

Bessel’s Correction (N − 1): Why it matters

When you sample from a population, you tend to pick values closer to the sample mean than data points in the full population would be to the true mean. Dividing by n − 1 instead of n compensates for this systematic underestimate, giving an unbiased estimator. Named after German mathematician Friedrich Bessel (1784–1846).

Step-by-step calculation

Scenario: A teacher records the test scores of 6 students: 72, 85, 90, 68, 95, 78. These 6 students are the entire class (population), so we use σ.

Step 1: Calculate the mean (μ): μ = (72 + 85 + 90 + 68 + 95 + 78) / 6 = 488 / 6 = 81.33
Step 2: Subtract the mean from each score; square the result: See the table below.
Step 3: Average the squared differences (divide by N = 6): Sum of squared diffs = 586.67 → 586.67 / 6 = 97.78 (this is the variance)
Step 4: Take the square root: σ = √97.78 = 9.89 points

The 68–95–99.7 empirical rule

For normally distributed data, the standard deviation lets you predict exactly what percentage of values fall within any given range.

68% of the data falls within 1σ (μ − σ to μ + σ)

95% of the data falls within 2σ (μ − 2σ to μ + 2σ)

99.7% of the data falls within 3σ (μ − 3σ to μ + 3σ)

Standard deviation in CRO & experimentation

A/B testing lives and dies by standard deviation. Understanding the difference between shipping winning experiments and making expensive decisions on statistical noise.

Why σ is the engine of every A/B test

When you run a conversion rate experiment, say, testing a new checkout button, you’re not measuring one person. You’re measuring thousands of visitors, each of whom either converts or doesn’t. Standard deviation quantifies how much natural variability exists in that behavior. That variability is the “noise” your test signal has to overcome to be trustworthy.

Every major component of A/B testing is either derived from σ or gated by it: statistical significance, confidence intervals, minimum detectable effect, and required sample size all depend on knowing, or estimating, the standard deviation of your metric.

The CRO Practitioner’s core equation

The harder it is to detect a real change (small effect, high σ, noisy metric), the more traffic you need. Standard deviation is the single biggest driver of how long your tests need to run.

Standard deviation for conversion rate metrics

For binary metrics (converted/didn’t convert), the standard deviation of a proportion is a special formula derived from the binomial distribution:

σ = √[ p × (1 − p) ]

Where, p = baseline conversion rate, 1 − p = non-conversion rate, σ = std dev of a single observation

This matters because conversion rate variance is not constant; it peaks at p = 0.50 (maximum uncertainty) and shrinks toward zero as p approaches 0 or 1.

A page converting at 5% has σ = √(0.05 × 0.95) = 0.218. A page converting at 50% has σ = √(0.50 × 0.50) = 0.500. The 50% page is more than twice as noisy per visitor, requiring a much larger sample to detect the same absolute lift.

Sample size: Where σ becomes money

The standard sample size formula for a two-sample proportion test is:

n = 2 × (Zα/2 + Zβ)² × σ² / δ²

Where, n = visitors per variant, σ² = variance of metric, δ = minimum detectable effect, Zα/2 = significance threshold (1.96 for 95%), Zβ = power threshold (0.84 for 80% power)

Practically: double the variance (σ²) of your metric and you need twice the sample size, meaning twice the traffic and twice the runtime. This is why high-variability metrics like revenue-per-visitor require far more traffic than clean binary conversion rates to test reliably.

The Revenue-Per-Visitor trap

A common CRO mistake in E-commerce experimentation:

A large e-commerce retailer ran an A/B test on a product page redesign for 2 weeks. Their primary metric was revenue per visitor (RPV). Control: $4.20 average RPV. Variant: $4.65 average RPV. The team celebrated a 10.7% lift! They shipped the variant.

What they missed: RPV had a standard deviation of $38.50, driven by occasional large orders ($500–$2,000). With σ so massive relative to the $0.45 mean difference, the test needed roughly 580,000 visitors per variant to reach 95% confidence. They had collected 22,000. The “lift” was pure noise.

The fix: either run the test for 6 months, or switch to conversion rate as the primary metric (σ ≈ 0.22, far more tractable) and treat RPV as a guardrail metric. This is a classic case where understanding σ before launching a test would have prevented a costly false positive.

Confidence intervals and what they actually mean

When a test reports “Variant B converted at 4.8% vs Control’s 4.2%, 95% CI [+0.1%, +1.1%]” that interval is constructed entirely from standard deviation:

CI = Observed Lift ± Z × SE_diff where SE_diff = √(σ²_A/n_A + σ²_B/n_B)

A wide confidence interval is a direct symptom of high σ relative to sample size. When a CRO tool shows you a wide band of uncertainty, it’s telling you: your metric is noisy, your sample is small, or both. Shipping based on wide CIs is the single most common cause of “winning” experiments that fail to replicate post-launch.

Variance reduction techniques: The CRO secret weapon

Because σ directly controls how much traffic you need, sophisticated experimentation teams don’t just accept metric variance; they actively reduce it. Lower σ means faster, cheaper, more reliable tests.

CUPED: The industry’s most powerful σ-reduction tool

CUPED (Controlled-experiment Using Pre-Experiment Data), developed at Microsoft, uses pre-experiment behavior to “explain away” variance in your outcome metric. If a user converted 3 times last month, their future behavior is partly predictable — that predictable part is noise you can remove.

CUPED-adjusted metric: Ŷ = Y − θ × (X − E[X]), where X is the pre-experiment covariate. In practice, CUPED routinely reduces metric variance by 30–70%, which proportionally reduces required sample size. Netflix, Booking.com, Airbnb, and Uber all use CUPED or variants of it in production. It is arguably the highest-leverage statistical technique available to CRO teams.

Other variance reduction approaches used by leading teams include stratified sampling (ensuring variants are balanced across high-variance user segments), ratio metrics (sessions-based rather than user-based aggregation), and trimming outliers in revenue metrics to cap extreme order values that inflate σ.

Novelty effect and variance over time

A frequently overlooked σ issue in CRO: standard deviation is not static over a test’s runtime. Conversion rate variance often inflates in the first few days of a test (novelty effect, returning users behave unusually when they notice a change) and then stabilizes. Teams that peek at results on Day 2 are sampling from a non-representative σ, which is why early “significant” results so often evaporate by the end of the test.

Best practice: estimate σ from historical data before launching, not from live test data. Use that pre-test σ to calculate your required sample size, commit to it, and don’t peek until you’ve reached it.

Segment-level standard deviation: The analysis goldmine

Overall test results mask enormous within-segment variance. A test might show +0% lift overall, but when you decompose by device type, new vs. returning users, or traffic source, you might find mobile users respond at +12% while desktop users respond at −8%. The σ within each segment is often far lower than the overall metric σ, because you’ve removed a source of variation (device behavior).

This is why CRO teams at companies like Booking.com run hundreds of micro-segment analyses post-test. Every unexplained source of variance in your overall metric is a potential personalization opportunity hiding in the data.

The CRO practitioner’s σ checklist

(1) Calculate your metric’s historical σ. (2) Use it to calculate the required sample size, not the runtime. (3) Check whether CUPED or covariate adjustment is available to reduce σ. (4) If using a revenue metric, consider whether σ is tractably small; if not, use a binary proxy metric instead. (5) Pre-commit to your stopping rule and don’t peek. (6) Post-test, inspect segment-level σ for personalization signals.

The Z-Score: Standard deviation’s most useful derivative

When you convert any data point to a z-score, dividing by σ, you create a universal, unitless measure of relative position. This is why you can compare an SAT score, a height measurement, and a stock return on the same scale. Z-scores are literally “number of standard deviations from the average.”

A z-score of 0 means exactly average. A z-score of +2 means 2 standard deviations above average (better than ~97.7% of the population in a normal distribution). A z-score of −1.5 means 1.5 standard deviations below average.

The single most important thing to remember

Standard deviation only tells you about the spread. It says nothing about the shape of the distribution (normal? skewed?), the direction of the spread, or whether the spread is acceptable. Always pair σ with a visualization of your data, a report of sample size, and domain context. Statistics without context are noise.

Explore more Glossary terms

Standard Error

The degree to which the result of an A/B test can be apart from the actual figure is called standard error.

Statistical Significance

In a hypothesis test, a result is declared statistically significant if one can rule out the possibility to obtain a result by a random chance, and instead attribute it to a truly existing effect.

Test Duration

The test duration comes from a sample size calculator which estimates the number of samples needed to estimate a significant effect within the chosen error bounds correctly. This calculation is performed before running an experiment to plan resources such that a timeline of an experiment is decided.

Test Hypothesis

A test hypothesis is a proposed explanation for a phenomenon that can be tested through experiments or observations. It serves as a tentative answer to a research question and is formulated based on prior observations, theories, or logical deductions.

Testing in Production

Testing in production is a standard practice in the modern software development cycle to ensure the release of high-quality products and improve user experience. Using this methodology, marketers can run tests to understand users’ responses to any new feature releases, while product managers can use these data to improve product experiences.

Title Tag

The title tag is an HTML element used to define the title of a web page.

Trust Badges

Trust badges are small logos or icons that convince visitors about the safety and credibility of the website.

Type-1 Error

In an A/B test, if both variations are similar and don’t affect the metric being tested any differently, an error may occur where the null hypothesis is rejected after the test concludes. In such a case, if it’s determined that there is a statistical difference between the variations, the result is a Type I error.

Deliver great experiences. Grow faster, starting today.

Features +125 more

Features +120 more