How we measure whether the council helps.
Every probabilistic claim on this site reduces to one of three measurements: Brier-score decomposition, council aggregation gain, and conformal Kelly sizing. This page documents each — formula, implementation path, reproduction command.
What sharpness, calibration, and resolution actually mean.
The Brier score on a binary forecast is the mean-squared error between predicted probability and realized outcome:
BS = (1/n) · Σ (p_i − o_i)²
Murphy (1973) showed BS decomposes into three terms when forecasts are binned by predicted probability:
BS = Reliability − Resolution + Uncertainty
- Reliability measures calibration error. For each probability bin, how far does the average forecast differ from the average outcome? Lower is better. A perfectly calibrated forecaster has
Reliability = 0. - Resolution measures discrimination. How much does the forecast distinguish between events that resolve YES vs NO? Higher is better — but only when calibration is held constant.
- Uncertainty is a property of the sample, not the forecaster. It equals
ō · (1 − ō)where ō is the base rate. A random forecaster matches uncertainty; anything below adds value.
Implementation: services/ostrakon/src/ostrakon/brier.py. Tests: services/ostrakon/tests/test_brier.py.
Why eleven heads beat one.
On a 200-market Manifold sample, the empirical Brier scores are:
| Forecaster | Brier |
|---|---|
| Manifold consensus (free human prior) | 0.126 |
| Single-shot Gemini | 0.260 |
| 5-role council, simple mean aggregation | 0.149 |
The council closes 80% of the LLM-vs-human gap ((0.260 − 0.149) / (0.260 − 0.126) ≈ 0.83) without beating the human prior. We treat this as an honest empirical finding: LLMs are worse than free human consensus, but aggregating multiple structured roles recovers most of the gap. The aggregation gain of ~0.11 Brier is an order of magnitude larger than any individual data source we have falsified to date (best: −0.004 from Wikipedia attention).
The aggregation rule is a weighted mean over roles with weights tuned via softmax-over-(-Brier) on a calibration set:
p_council = Σ w_r · p_r , where w_r = softmax(−τ · Brier_r)
Implementation: services/ostrakon/src/ostrakon/brier_weighted_council.py. Tests: 14 hermetic cases covering edge degeneracies (one role, tied Briers, single-class outcome).
Sizing against the lower bound, not the point estimate.
The classical half-Kelly formula sizes positions against a point estimate of the edge. This over-trades whenever the model is miscalibrated. We use a conformal prediction interval (Vovk et al) and size against the lower bound.
For YES: p_used = max(0, p_council − q̂_α) For NO : p_used = 1 − min(1, p_council + q̂_α) half_kelly = max(0, (p_used · b − (1 − p_used)) / b) · 0.5
q̂_α is the empirical (1 − α)-quantile of absolute prediction errors on a held-out calibration set. With α = 0.10, we're sizing against the 90% lower bound — the model has to be confidently right, not just nominally right, before we touch a half-Kelly stake.
Implementation: services/areopagus/src/areopagus/kelly.py:size_position_conformal.
How a source earns the right to influence the prior.
12 Pythia sources are plumbed into apollo.scorer.MarketSnapshot. Each contributes an oracle delta clamped to ±0.05. Combined cap on the full 7-feature stack: ±0.35.
A source graduates HOLD → ADOPT when its paired Brier-delta on a ≥200-market resolved sample beats the council baseline by > 0.002, with applicability ≥ 5%. Anything below: HOLD (live for telemetry, oracle delta forced to zero) or UNTESTABLE (insufficient applicable sample).
Currently ADOPTED (2): Wikipedia attention (Δ −0.0040, 33.5% applicable), Nitter crowd_sentiment (Δ −0.0040, 36.0%). All others stay HOLD or UNTESTABLE pending a Polymarket-flavoured corpus.
From clone to verified Brier in ~5 minutes.
# clone + install git clone https://github.com/NAME0x0/Pantheon-Trades.git cd Pantheon-Trades pnpm install # JS deps cd contracts && forge test # 57 contract tests cd ../services/ostrakon && uv run pytest # Brier + calibration suite cd ../boule && uv run pytest # council aggregation suite # rerun the 200-market backtest (requires GEMINI_API_KEY in .env) cd ../.. python scripts/backtest_sources_xml.py \ --n 200 \ --provider gemini \ --model gemini-2.5-flash-lite \ --council # artifact lands in artifacts/sources_brier_<timestamp>.json # expected total cost: ~$0.12 USD on Gemini flash-lite
Every number cited on the demo page traces back to one of these commands. If a run produces a different number than this site claims, that is a bug worth reporting.
What this methodology does not prove.
- Sample bias. Manifold markets skew toward niche meta-bets, technology, and US politics. Brier numbers may not transfer cleanly to Polymarket's real-money distribution. We're working toward the Polymarket-flavoured corpus once the geo-block proxy is deployed.
- Beats LLM, not humans. On this sample, the council improves on a single-shot LLM by 0.11 Brier but still trails Manifold's play-money human consensus by 0.023 Brier. The system's alpha is conditional on outperforming LLMs and on edge sources that we have not yet falsified at scale.
- No live trading. Every Brier reported here is on resolved historical markets, not forward-looking real fills. The transition from backtest to production is the next-largest source of variance and is currently blocked on geo-bypass + builder-code approval.
- No external audit. Contracts pass Foundry + Halmos symbolic verification on the two flagship contracts (PoR + PantheonConstitution). They have not been audited by an external security firm. Do not deploy to mainnet without one.