A/B testing that doesn't lie about the math.
Splitstream is an experimentation platform with sticky bucketing, mutex experiment groups, Sample Ratio Mismatch detection, guardrail metrics, and a Bayesian stats engine whose decision rule is empirically calibrated, not asserted. Pair with Pennant for the full experimentation stack.
The plan's draft defaults delivered 66% empirical false-positive rate.
Bayesian threshold-stopping at posterior_threshold=0.95 isn't safe under continuous peeking. We ran 360,000 null-effect simulations across a 36-cell grid before shipping. The calibrated tuple — the one Splitstream ships as the default — is 0.995 / 20,000 / 240 min, which lands at 4.77% empirical FP. The calibration is what lets the docs claim the math is right.
Five lines per stack.
Splitstream ships SDKs for TypeScript, React, PHP (Laravel-ready), and Python (Django + FastAPI adapters). The bucketing function is guarded by a shared corpus — every SDK matches byte-for-byte.
import { Splitstream } from "@philiprehberger/splitstream";
const splitstream = new Splitstream({
apiKey: process.env.SPLITSTREAM_KEY!,
context: { userId: "alice" },
});
const variant = await splitstream.assign("checkout-v2", { fallback: "control" });
splitstream.track("purchase-completed", { revenue: 49 });What ships in v0.1.
Sticky-forever assignment
First /v1/assign call writes the variant to a PostgreSQL row partitioned by experiment_id. Subsequent calls — same user, same variant, same forever.
Empirically calibrated defaults
360k null-effect simulations picked the threshold/sample/cadence tuple. No hand-waved priors. The calibration JSON ships in the repo.
Mutex experiment groups
A unit assigned to one experiment in the group cannot be assigned to another. SELECT FOR UPDATE race-protects simultaneous calls; holdouts are surfaced.
Sample Ratio Mismatch detection
Chi-squared test on observed vs declared weights, every 5 minutes. p < 0.001 raises a banner — most SRM is a bug, not a real traffic skew.
Guardrail metrics
Lower bound on the 95% credible interval of the relative difference. If the regression bound crosses the threshold, the experiment auto-stops with reason "lost."
Peek-event audit
Every dashboard open is logged. The audit log surfaces "you peeked 12 times before calling this experiment" alongside the outcome.
Octane hot-path carve-out
/v1/assign and /v1/events run on Laravel Octane (RoadRunner) under an Apache mod_proxy split. mod_php handles low-volume admin + management routes.
Redis-buffered ingest
Synchronous endpoint, async persistence via COPY. Cross-partition dedup via SETNX with a 30-day TTL. Backpressure 429 with Retry-After.
Cross-SDK bucketing corpus
TS, PHP, and Python SDKs round-trip the same bucketing corpus in CI. Drift between any implementation silently poisons experiments — the corpus catches it.
Built by Philip Rehberger.
This is a portfolio demo, not a hosted service. The code is the artifact.