Benchmark Board

Forge Benchmark Board

Open methodology. Reproducible. Including the cases where we lose.

greymoth-jp/kairon—Apache-2.0 licensed kernel

Last reviewed: 2026-05-20License: CC-BY-4.0Trust Center

Honest moat

Kairon Forge is defensible in 2 of 9 layers: the intel-graph data accumulation + the 6-hook deterministic router composition. Everything else is commodity B2B hygiene. These benchmarks publish both — the wins AND the losses.

Cycle source: os/research/compact-api-kernel-diff-2026-05-20.md

Bench #1 — Calibration Brier vs Polymarket midpoint

Leave-one-out calibration curve vs raw Polymarket midpoint vs constant 50/50 baseline. n=30 stub corpus (v0.1) mimicking favourite-longshot bias; 1000-iter bootstrap 95% CI.

Kairon-calibrated

0.2506

95% CI [0.166, 0.348]

Lower is better (Brier)

Polymarket midpoint

0.1937

95% CI [0.143, 0.250]

Baseline competitor

Naive 0.5

0.2500

95% CI [0.250, 0.250]

Scoring floor

Negative result (Rule B Mandatory)

Kairon UNDERPERFORMS midpoint by 0.0569 Brier on this v0.1 stub. Loses 15/30 markets. Over-correcting on favourite-longshot biased data. Published as-is.

Bench #3 — Audit replay determinism

100 agent decisions x 10 replays = 1000 trials per sub-test. Wilson 95% CI over the binomial sample. Pure router + seeded LLM stub at 100%; free-temp control is the negative-control proving the harness detects non-determinism when it exists.

Pure router

100.00%

Wilson 95% CI [0.996, 1.000]

1000/1000 byte-equal

Seeded LLM stub

100.00%

Wilson 95% CI [0.996, 1.000]

1000/1000 byte-equal

Free-temp control

10.00%

Wilson 95% CI [0.083, 0.120]

Negative control

Bench #2 — Cost per task (4-tier matrix)

Per-task USD cost across 4 delivery tiers on a 30-scenario corpus balancing 5 regime profiles (whale_dominant / meme_volatile / macro_anchored / panic_liquidation / dead_liquidity). Cost-aware router escalates to Tier-2 only on anomaly. 1000-iter bootstrap 95% CI on the median. Price-catalog estimates, no real API calls.

Kairon Free

$0.0000

Median (Ollama-only)

Zero variable cost · quality trade-off accepted

Kairon Pro

$0.0023

95% CI [$0.0000, $0.0023]

Mean $0.0014 · 18/30 paid calls

Kairon Elite

$0.0069

95% CI [$0.0000, $0.0069]

Mean $0.0041 · 0.60x naive baseline

Methodology

Samples: Bench #1 n=30 (stub_v0_1). Bench #2 n=30 scenarios across 5 regime profiles (6 per regime). Bench #3 n=100 decisions x 10 replays = 1000 trials per sub-test.
Confidence intervals: Bench #1 + Bench #2 nonparametric bootstrap (1000 iterations) seeded for reproducibility. Bench #3 Wilson 95% CI over binomial sample.
PRNG: mulberry32 seeded (Bench #1 seed 12648430; Bench #2 seed 49239; Bench #3 seed 168630254).
License: CC-BY-4.0 across JSON + CSV. Reproduce on your own machine without permission.
Reproduce: npx tsx scripts/bench-calibration-brier.ts, npx tsx scripts/bench-cost-per-task.ts, and npx tsx scripts/bench-audit-determinism.ts.
Limitations: v0.1 calibration corpus is a deterministic 30-sample stub mimicking favourite-longshot bias; real-market backfill target Q3 2026. Bench #2 uses Anthropic 2026-05 list-price estimates and a fixed 800-input / 300-output token assumption per inference; real-workload measurement target Q3 2026. Bench #3 uses a deterministic LLM stub calibrated against Anthropic temperature=0 + seeded inference; live real-LLM replay determinism target Q3 2026 once provider seed plumbing matures.