Stats playground

Move the sliders to see how the Bayesian threshold-stopping rule behaves under different conditions. Setting true lift to 0 means you're looking at the false-positive rate under the null — slide the threshold from 0.95 to 0.995 and watch the realized FP rate drop from ~60% to ~5%. This is the playground version of the calibration table.

True lift (treatment vs control)+0.0 pp

0pp means the null hypothesis holds — any 'winner' is a false positive.

Posterior threshold0.9950

Stop when P(B > A) or P(A > B) crosses this threshold.

Min sample per variant20,000

No stopping before each arm has at least this many exposures.

Snapshot cadenceevery 4 hours

How often the analysis worker recomputes — more frequent = more peeking opportunities.

Control rate fixed at 10%; treatment rate = 10.00%.

How the simulator works. Each run draws Bernoulli outcomes per arm at the traffic rate matched to the calibration sim (4,000 users/day total), updates the Beta-Binomial posterior at each snapshot, checks the stopping rule. P(B >A) uses the normal approximation to the Beta posterior (Cook-Forbes, <1e-4 deviation at α+β > 1000). Numbers will fluctuate ±2 percentage points run-to-run at 200 sims. The repo's nightly stats-correctness.yml runs the same simulation at 10,000 sims for the committed calibration table.

Two scenarios worth trying

The plan's draft defaults under the null. Set true lift = 0, threshold = 0.95, min sample = 1,000, cadence = 1 hour. Run the simulation — the realized false-positive rate will land at ~60%. This is exactly the problem the Phase 0 calibration uncovered.
The calibrated tuple under a real lift. Set true lift = 2pp, threshold = 0.995, min sample = 20,000, cadence = 4 hours. Realized power should land at ~80% — the shipped defaults can detect a 20% relative lift in a 10% conversion rate, at production sample sizes, with ~5% FP under the null.