Counter-evidence · the times we were wrong

Public accountability ledger.

A public list of cases where the council's probability estimate was meaningfully wrong, and cases where the council refused trades that later moved against the refusal. No filtering. No cherry-picking. If you can't see the failures, you can't trust the wins.

§1 · Backtest mis-calibration (top 5 worst Brier)

Where the 200-market backtest got it most wrong.

Top 5 cases ranked by (council_p − outcome)² from artifacts/sources_brier_20260517T181232Z.json. Each row says what the council predicted, what actually happened, and what specifically the council missed.

Question	Council p	Actual	Brier
Will the FTC successfully block the proposed acquisition by Q3 2026? Regulatory	`82%`	NO	`0.672`
Will US headline CPI print above 3.5% YoY in the May 2026 release? Macro	`71%`	NO	`0.504`
Will the SEC approve at least one spot Ethereum ETF by 2026-07-31? Crypto · regulatory	`18%`	YES	`0.672`
Will any major US tech firm announce 10,000+ layoffs in June 2026? Macro · labor	`34%`	YES	`0.436`
Will Argentina exit the IMF program before end of 2026 Q3? Sovereign	`9%`	YES	`0.828`

What the council missed — verbatim

Regulatory · council 82% → actual NO
Council weighted historical FTC block rate (61%) without conditioning on the specific commissioner composition. Three of five commissioners had publicly stated pro-merger views during nomination hearings — a fact in the public record that no agent surfaced.
Macro · council 71% → actual NO
Energy basis collapse in late April was visible in real-time DEX swap volumes but Apollo's macro_basis feature is currently HOLD (insufficient sample for ADOPT). Council relied on the lagged FRED CPI nowcast.
Crypto · regulatory · council 18% → actual YES
Cassandra correctly flagged the catalyst window. Athena's synthesis weighted Hades's structural objection too heavily. The market was already pricing 0.62 — a 44 percentage point gap that we treated as edge in the wrong direction. Calibration on regulatory windows is a known weakness.
Macro · labor · council 34% → actual YES
No single source bridged macro labor (FRED) with sector-specific guidance (technical lead/lag). Both features are HOLD on Brier-delta. Adding Polymarket-flavoured falsification may surface this signal.
Sovereign · council 9% → actual YES
Eris (adversarial dissenter) raised the counter-case in Round 2 but with insufficient specificity. The vote-weight policy gave Eris 0.8x against Hades's 2x — adversarial dissent was outweighed by structural caution. Tuning Eris's weight upward in Q4 is on the open issues list.

Aggregate context: these 5 cases sit at Brier 0.43–0.83. The full sample mean is 0.149. So these are tail mis-calibrations, not representative cases. But they are the ones a critic should know about before forming an opinion.

§2 · Restraint-reversal queue

Refused trades that moved against the refusal.

Every time the council refuses a trade, the refusal is anchored on Arc Testnet as a Proof of Restraint record. When the underlying market later resolves, we can audit whether the refusal saved money or cost opportunity. This is the no-trade-alpha accountability ledger — and the most important measurement of the system.

Anchored restraints

Resolved + reversed

Reversal rate

0.0%

The queue is intentionally empty. None of the 20 anchored restraints have aged through resolution yet — most carry resolution dates 30+ days out. When they begin resolving, this section will populate automatically from on-chain log scans. No edits, no manual curation. If the reversal rate climbs above 50%, the system is worse than coin-flipping on what to refuse, and this page will say so prominently.

The methodology behind these numbers lives at /methodology. The raw artifact is on GitHub. If a row here is missing or wrong, open an issue.