Bayesian inference, empirically calibrated
Splitstream's v1 decision rule is Bayesian posterior-threshold stopping with a fixed sample-size gate. Conjugate update (Beta–Binomial); stop when max(P(treatment > control), P(control > treatment)) crosses the threshold AND every variant has at least min_sample_per_variant exposures.
The defaults are calibrated, not asserted
A common-but-incorrect intuition: "Bayesian inference renders optional stopping benign."It doesn't. With weak priors (Beta(1, 1)) and a fixed posterior-threshold rule P(B > A) > 0.95 checked every 15 minutes, the realized false-positive rate under repeated peeking runs 8–15% — not 5%. This is the always-valid-inference literature talking (Howard, Ramdas; e-values).
posterior_threshold=0.95, min_sample=1,000, cadence=15 min) delivered 66.49% empirical false-positive rate under continuous peeking. No amount of marketing prose makes 0.95 safe at 15-minute polling.The calibration table
Each cell ran 10,000 simulated null-effect experiments (two arms, equal split, true conversion rate = 10% in both). Empirical false-positive rate is the share of simulations that incorrectly declared a winner.
| Posterior threshold | Min sample per variant | Snapshot cadence | Empirical FP |
|---|---|---|---|
| 0.95 | 1,000 | 15 min | 66.01% |
| 0.99 | 5,000 | 60 min | 15.76% |
| 0.99 | 20,000 | 240 min | 8.72% |
| 0.995 | 20,000 | 240 min | 4.77% ← shipped default |
| 0.999 | 20,000 | 240 min | 1.37% |
| 0.9999 | 20,000 | 240 min | 0.10% |
The full 36-cell table lives in stats/calibration_results.json. Re-run via python stats/notebooks/00_calibration.py against pinned scipy==1.13.* and numpy==1.26.*.
What the readout means
A snapshot result of "Variant B has a 95% probability of beating control by 8.2% on conversion (95% credible interval: +2.1% to +14.3%)" is natively Bayesian. The probability is a statement about your posterior belief, not a frequentist long-run-frequency claim. Translating frequentist p-values into that statement requires hand-waving the priors back in.
The calibrationis what lets us advertise the readout honestly: the empirical false-positive rate under our documented peeking regime, at the shipped defaults, is ~5%. That's what makes the math defensible — not the choice of prior, not the threshold value in isolation, but the simulation that demonstrates the chosen tuple holds Type-I in the regime where customers actually use it.
If you lower the defaults
The admin lets you override posterior_threshold, min_sample_per_variant, and snapshot_cadence_minutes per experiment. Doing so without re-running the calibration sim re-inflates the false-positive rate to the corresponding row of the table above.
decision_rule.method = "bayesian.always_valid_evalue"; v1 doesn't implement it. Group-sequential frequentist (mSPRT) is documented as decision_rule.method = "frequentist.sequential_msprt"; also v2.