Forge Benchmark Board
Open methodology. Reproducible. Including the cases where we lose.
Kairon Forge is defensible in 2 of 9 layers: the intel-graph data accumulation + the 6-hook deterministic router composition. Everything else is commodity B2B hygiene. These benchmarks publish both — the wins AND the losses.
Cycle source: os/research/compact-api-kernel-diff-2026-05-20.md
Bench #1 — Calibration Brier vs Polymarket midpoint
Leave-one-out calibration curve vs raw Polymarket midpoint vs constant 50/50 baseline. n=30 stub corpus (v0.1) mimicking favourite-longshot bias; 1000-iter bootstrap 95% CI.
Bench #3 — Audit replay determinism
100 agent decisions x 10 replays = 1000 trials per sub-test. Wilson 95% CI over the binomial sample. Pure router + seeded LLM stub at 100%; free-temp control is the negative-control proving the harness detects non-determinism when it exists.
Bench #2 — Cost per task (4-tier matrix)
Per-task USD cost across 4 delivery tiers on a 30-scenario corpus balancing 5 regime profiles (whale_dominant / meme_volatile / macro_anchored / panic_liquidation / dead_liquidity). Cost-aware router escalates to Tier-2 only on anomaly. 1000-iter bootstrap 95% CI on the median. Price-catalog estimates, no real API calls.
Methodology
- Samples: Bench #1 n=30 (stub_v0_1). Bench #2 n=30 scenarios across 5 regime profiles (6 per regime). Bench #3 n=100 decisions x 10 replays = 1000 trials per sub-test.
- Confidence intervals: Bench #1 + Bench #2 nonparametric bootstrap (1000 iterations) seeded for reproducibility. Bench #3 Wilson 95% CI over binomial sample.
- PRNG: mulberry32 seeded (Bench #1 seed 12648430; Bench #2 seed 49239; Bench #3 seed 168630254).
- License: CC-BY-4.0 across JSON + CSV. Reproduce on your own machine without permission.
- Reproduce: npx tsx scripts/bench-calibration-brier.ts, npx tsx scripts/bench-cost-per-task.ts, and npx tsx scripts/bench-audit-determinism.ts.
- Limitations: v0.1 calibration corpus is a deterministic 30-sample stub mimicking favourite-longshot bias; real-market backfill target Q3 2026. Bench #2 uses Anthropic 2026-05 list-price estimates and a fixed 800-input / 300-output token assumption per inference; real-workload measurement target Q3 2026. Bench #3 uses a deterministic LLM stub calibrated against Anthropic temperature=0 + seeded inference; live real-LLM replay determinism target Q3 2026 once provider seed plumbing matures.