Methodology · the math behind the marketing

How we measure whether the council helps.

Every probabilistic claim on this site reduces to one of three measurements: Brier-score decomposition, council aggregation gain, and conformal Kelly sizing. This page documents each — formula, implementation path, reproduction command.

§1 · Brier-score decomposition (Murphy 1973)

What sharpness, calibration, and resolution actually mean.

The Brier score on a binary forecast is the mean-squared error between predicted probability and realized outcome:

BS = (1/n) · Σ (p_i − o_i)²

Murphy (1973) showed BS decomposes into three terms when forecasts are binned by predicted probability:

BS = Reliability − Resolution + Uncertainty

Reliability measures calibration error. For each probability bin, how far does the average forecast differ from the average outcome? Lower is better. A perfectly calibrated forecaster has Reliability = 0.
Resolution measures discrimination. How much does the forecast distinguish between events that resolve YES vs NO? Higher is better — but only when calibration is held constant.
Uncertainty is a property of the sample, not the forecaster. It equals ō · (1 − ō) where ō is the base rate. A random forecaster matches uncertainty; anything below adds value.

Implementation: services/ostrakon/src/ostrakon/brier.py. Tests: services/ostrakon/tests/test_brier.py.

§2 · Council aggregation gain

Why eleven heads beat one.

On a 200-market Manifold sample, the empirical Brier scores are:

Forecaster	Brier
Manifold consensus (free human prior)	`0.126`
Single-shot Gemini	`0.260`
5-role council, simple mean aggregation	`0.149`

The council closes 80% of the LLM-vs-human gap ((0.260 − 0.149) / (0.260 − 0.126) ≈ 0.83) without beating the human prior. We treat this as an honest empirical finding: LLMs are worse than free human consensus, but aggregating multiple structured roles recovers most of the gap. The aggregation gain of ~0.11 Brier is an order of magnitude larger than any individual data source we have falsified to date (best: −0.004 from Wikipedia attention).

The aggregation rule is a weighted mean over roles with weights tuned via softmax-over-(-Brier) on a calibration set:

p_council = Σ w_r · p_r , where w_r = softmax(−τ · Brier_r)

Implementation: services/ostrakon/src/ostrakon/brier_weighted_council.py. Tests: 14 hermetic cases covering edge degeneracies (one role, tied Briers, single-class outcome).

§3 · Conformal Kelly (interval-bound sizing)

Sizing against the lower bound, not the point estimate.

The classical half-Kelly formula sizes positions against a point estimate of the edge. This over-trades whenever the model is miscalibrated. We use a conformal prediction interval (Vovk et al) and size against the lower bound.

For YES: p_used = max(0, p_council − q̂_α)
For NO : p_used = 1 − min(1, p_council + q̂_α)

half_kelly = max(0, (p_used · b − (1 − p_used)) / b) · 0.5

q̂_α is the empirical (1 − α)-quantile of absolute prediction errors on a held-out calibration set. With α = 0.10, we're sizing against the 90% lower bound — the model has to be confidently right, not just nominally right, before we touch a half-Kelly stake.

Implementation: services/areopagus/src/areopagus/kelly.py:size_position_conformal.

§4 · Edge-source falsification protocol

How a source earns the right to influence the prior.

12 Pythia sources are plumbed into apollo.scorer.MarketSnapshot. Each contributes an oracle delta clamped to ±0.05. Combined cap on the full 7-feature stack: ±0.35.

A source graduates HOLD → ADOPT when its paired Brier-delta on a ≥200-market resolved sample beats the council baseline by > 0.002, with applicability ≥ 5%. Anything below: HOLD (live for telemetry, oracle delta forced to zero) or UNTESTABLE (insufficient applicable sample).

Currently ADOPTED (2): Wikipedia attention (Δ −0.0040, 33.5% applicable), Nitter crowd_sentiment (Δ −0.0040, 36.0%). All others stay HOLD or UNTESTABLE pending a Polymarket-flavoured corpus.

§5 · Reproduce every number on this page

From clone to verified Brier in ~5 minutes.

# clone + install
git clone https://github.com/NAME0x0/Pantheon-Trades.git
cd Pantheon-Trades
pnpm install                          # JS deps
cd contracts && forge test            # 57 contract tests
cd ../services/ostrakon && uv run pytest    # Brier + calibration suite
cd ../boule && uv run pytest          # council aggregation suite

# rerun the 200-market backtest (requires GEMINI_API_KEY in .env)
cd ../..
python scripts/backtest_sources_xml.py \
  --n 200 \
  --provider gemini \
  --model gemini-2.5-flash-lite \
  --council
# artifact lands in artifacts/sources_brier_<timestamp>.json
# expected total cost: ~$0.12 USD on Gemini flash-lite

Every number cited on the demo page traces back to one of these commands. If a run produces a different number than this site claims, that is a bug worth reporting.

§6 · Honest limitations

What this methodology does not prove.

Sample bias. Manifold markets skew toward niche meta-bets, technology, and US politics. Brier numbers may not transfer cleanly to Polymarket's real-money distribution. We're working toward the Polymarket-flavoured corpus once the geo-block proxy is deployed.
Beats LLM, not humans. On this sample, the council improves on a single-shot LLM by 0.11 Brier but still trails Manifold's play-money human consensus by 0.023 Brier. The system's alpha is conditional on outperforming LLMs and on edge sources that we have not yet falsified at scale.
No live trading. Every Brier reported here is on resolved historical markets, not forward-looking real fills. The transition from backtest to production is the next-largest source of variance and is currently blocked on geo-bypass + builder-code approval.
No external audit. Contracts pass Foundry + Halmos symbolic verification on the two flagship contracts (PoR + PantheonConstitution). They have not been audited by an external security firm. Do not deploy to mainnet without one.