A portfolio demonstration · production-shaped, not production-grade

A/B testing that doesn't lie about the math.

Splitstream is an experimentation platform with sticky bucketing, mutex experiment groups, Sample Ratio Mismatch detection, guardrail metrics, and a Bayesian stats engine whose decision rule is empirically calibrated, not asserted. Pair with Pennant for the full experimentation stack.

Read the docs API reference Source on GitHub →

The plan's draft defaults delivered 66% empirical false-positive rate.

Bayesian threshold-stopping at posterior_threshold=0.95 isn't safe under continuous peeking. We ran 360,000 null-effect simulations across a 36-cell grid before shipping. The calibrated tuple — the one Splitstream ships as the default — is 0.995 / 20,000 / 240 min, which lands at 4.77% empirical FP. The calibration is what lets the docs claim the math is right.

Original draft

66%

Threshold 0.99

~15%

Threshold 0.995

4.77%

Threshold 0.999

1.4%

Read the calibration table →

Five lines per stack.

Splitstream ships SDKs for TypeScript, React, PHP (Laravel-ready), and Python (Django + FastAPI adapters). The bucketing function is guarded by a shared corpus — every SDK matches byte-for-byte.

import { Splitstream } from "@philiprehberger/splitstream";

const splitstream = new Splitstream({
  apiKey: process.env.SPLITSTREAM_KEY!,
  context: { userId: "alice" },
});

const variant = await splitstream.assign("checkout-v2", { fallback: "control" });
splitstream.track("purchase-completed", { revenue: 49 });

What ships in v0.1.

Sticky-forever assignment

First /v1/assign call writes the variant to a PostgreSQL row partitioned by experiment_id. Subsequent calls — same user, same variant, same forever.

Empirically calibrated defaults

360k null-effect simulations picked the threshold/sample/cadence tuple. No hand-waved priors. The calibration JSON ships in the repo.

Mutex experiment groups

A unit assigned to one experiment in the group cannot be assigned to another. SELECT FOR UPDATE race-protects simultaneous calls; holdouts are surfaced.

Sample Ratio Mismatch detection

Chi-squared test on observed vs declared weights, every 5 minutes. p < 0.001 raises a banner — most SRM is a bug, not a real traffic skew.

Guardrail metrics

Lower bound on the 95% credible interval of the relative difference. If the regression bound crosses the threshold, the experiment auto-stops with reason "lost."

Peek-event audit

Every dashboard open is logged. The audit log surfaces "you peeked 12 times before calling this experiment" alongside the outcome.

Octane hot-path carve-out

/v1/assign and /v1/events run on Laravel Octane (RoadRunner) under an Apache mod_proxy split. mod_php handles low-volume admin + management routes.

Redis-buffered ingest

Synchronous endpoint, async persistence via COPY. Cross-partition dedup via SETNX with a 30-day TTL. Backpressure 429 with Retry-After.

Cross-SDK bucketing corpus

TS, PHP, and Python SDKs round-trip the same bucketing corpus in CI. Drift between any implementation silently poisons experiments — the corpus catches it.

Built by Philip Rehberger.

This is a portfolio demo, not a hosted service. The code is the artifact.

About Philip GitHub repo