Bayesian inference, empirically calibrated

Splitstream's v1 decision rule is Bayesian posterior-threshold stopping with a fixed sample-size gate. Conjugate update (Beta–Binomial); stop when max(P(treatment > control), P(control > treatment)) crosses the threshold AND every variant has at least min_sample_per_variant exposures.

The defaults are calibrated, not asserted

A common-but-incorrect intuition: "Bayesian inference renders optional stopping benign."It doesn't. With weak priors (Beta(1, 1)) and a fixed posterior-threshold rule P(B > A) > 0.95 checked every 15 minutes, the realized false-positive rate under repeated peeking runs 8–15% — not 5%. This is the always-valid-inference literature talking (Howard, Ramdas; e-values).

Phase 0 finding.We ran 360,000 null-effect simulations across a 36-cell grid before shipping. The plan's draft defaults (posterior_threshold=0.95, min_sample=1,000, cadence=15 min) delivered 66.49% empirical false-positive rate under continuous peeking. No amount of marketing prose makes 0.95 safe at 15-minute polling.

The calibration table

Each cell ran 10,000 simulated null-effect experiments (two arms, equal split, true conversion rate = 10% in both). Empirical false-positive rate is the share of simulations that incorrectly declared a winner.

Posterior threshold	Min sample per variant	Snapshot cadence	Empirical FP
0.95	1,000	15 min	66.01%
0.99	5,000	60 min	15.76%
0.99	20,000	240 min	8.72%
0.995	20,000	240 min	4.77% ← shipped default
0.999	20,000	240 min	1.37%
0.9999	20,000	240 min	0.10%

The full 36-cell table lives in stats/calibration_results.json. Re-run via python stats/notebooks/00_calibration.py against pinned scipy==1.13.* and numpy==1.26.*.

What the readout means

A snapshot result of "Variant B has a 95% probability of beating control by 8.2% on conversion (95% credible interval: +2.1% to +14.3%)" is natively Bayesian. The probability is a statement about your posterior belief, not a frequentist long-run-frequency claim. Translating frequentist p-values into that statement requires hand-waving the priors back in.

The calibrationis what lets us advertise the readout honestly: the empirical false-positive rate under our documented peeking regime, at the shipped defaults, is ~5%. That's what makes the math defensible — not the choice of prior, not the threshold value in isolation, but the simulation that demonstrates the chosen tuple holds Type-I in the regime where customers actually use it.

If you lower the defaults

The admin lets you override posterior_threshold, min_sample_per_variant, and snapshot_cadence_minutes per experiment. Doing so without re-running the calibration sim re-inflates the false-positive rate to the corresponding row of the table above.

What v2 would add.Always-valid e-value tests (Howard & Ramdas) — Type-I controlled under arbitrary peeking. The contract surface already accepts decision_rule.method = "bayesian.always_valid_evalue"; v1 doesn't implement it. Group-sequential frequentist (mSPRT) is documented as decision_rule.method = "frequentist.sequential_msprt"; also v2.