Methodology
How the Intent Eval Platform works, what an Evidence Bundle is, why the eval-set browser ships ahead of the results browser, and what this dashboard refuses to do.
The unification thesis
Every validator in the platform — deterministic gate, behavioral eval, schema check, security scanner — emits its findings as rows inside a common artifact called an Evidence Bundle. The bundle is content-addressed, DSSE-signed, and anchored in the Rekor transparency log. The schema is authored once, in the canonical contracts kernel (intent-eval-core), and consumed by every downstream renderer.
That single binding — every validator emits a bundle — is the architectural pivot the platform's ratification council voted into place in DR-010. Without it, the eval space fragments into incompatible JSON shapes. With it, downstream consumers — including this dashboard — read one shape and trust its provenance via sigstore + Rekor.
The six repos
The platform composes via the Evidence Bundle, not via package consolidation. Six repos play distinct roles:
| Repo | Role |
|---|---|
| intent-eval-core | Canonical contracts kernel. TypeScript types, JSON Schemas, Zod validators, state machines for the canonical entities. Published as @intentsolutions/core. No execution, no judges, no runtime — just the contracts. |
| intent-eval-lab | Methodology + specs + the constitution. Blueprints A/B/C, the canonical glossary, every ratified Decision Record. This dashboard's architecture is defined in intent-eval-lab/000-docs/035-AT-DECR-.... |
| intent-audit-harness | Deterministic gates (escape-scan, CRAP score, architecture rules, bias count, Gherkin lint) plus the emit-evidence layer that wraps gate output in Evidence Bundles. |
| j-rig-skill-binary-eval | Behavioral evaluation harness — the 7-layer binary-criteria skill eval-set lives here. |
| intent-rollout-gate | A thin GitHub Action that consumes Evidence Bundles and decides ship / no-ship per repo policy. |
| intent-eval-dashboard | This dashboard. The most visible kernel consumer in the system, deliberately separate from the constitution repo so the spec author and the renderer share no release cycle. |
Where predicate URIs live
Every signed predicate row inside an Evidence Bundle declares a predicate_uri that names the contract it satisfies. Those URIs are normative immutable identifiers. They live exclusively at evals.intentsolutions.io — never on this dashboard's domain. This is a CISO-bound separation reaffirmed in DR-035.
labs.intentsolutions.io renders about predicate URIs but never declares one. A predicate-URI-typo in any content rendered here is not a new URI claim — it's a documentation bug.
Why the eval-set ships before the results
The eval-set is the spec. If a public eval dashboard renders results rows ahead of publishing the spec those rows attest against, the dashboard is publishing attestations of conformance to a spec that doesn't exist. The first page a hostile reader clicks should be the spec — the eval-set definition, its version history, its lineage, its adversarial audit pointer — not a number.
This sequencing is a Karpathy canonical position adopted into the platform's plan. It changes the engineering order of operations: the eval-set browser ships first; the results browser, the per-repo freshness strip, and the ingest supervision tree are Phase 2 work.
What this dashboard refuses to do
The seven adversarial seats of the ratification council each carry refusal authority on specific surfaces. Their refusals are preserved verbatim in DR-035 § 8. The most consequential for a visitor reading this page:
- No aggregate PASS% across heterogeneous predicates. The CI lint refuses it; the type system refuses it; three separate seats refuse to ratify any rendering that produces it.
- No basicauth on the public origin for operator views. If there is information internal to the operator, it lives on a tailnet-only hostname.
- No predicate URIs declared here. URIs at
evals.intentsolutions.ioonly. - No render-from-manifest without re-verification. Pinned OIDC subject and
workflow_refper source repo, Rekor inclusion proof, schema validation row-by-row before any HTML is generated. - No GCP-hosted object storage. Content-addressed Evidence Bundle storage lives on the Intent Solutions VPS at v0.1.0; migration to Backblaze B2 is queued at the 12-month / 100 GB trigger.
- No asymmetric rendering of null-hypothesis results. Both arms of a comparison render with identical layout, font weight, and chart axes. A null result is rendered identically to a positive result. Information design refuses to bury inconvenient outcomes.
What sequences in Phase 2
The roadmap, in dependency order:
- Schema evolution in
intent-eval-coreto v0.2.0 — addspre_registration_hash, theretraction/v1predicate type, and thedashboard-render/v1predicate type. The latter is sequenced behind the arrival of a second independent verifier. - Ingest supervision tree — six per-repo workers, each verifying signature, Rekor inclusion, and schema before content-addressing the bundle into local storage.
- Results browser — renders
gate-result/v1rows with the per-row visibility-tier gating and the four-timestamp staleness surface. - Freshness strip + landing-page mandatory chart — the per-repo decision-mix strip at the top of the landing page.
- Operator-internal view — separate tailnet-only hostname, Tailscale identity, all reports regardless of public visibility tags.
- Retraction protocol — content-addressed denylist, Caddy-level kill-switch, signed
retraction/v1attestations, 4-hour SLO. - Ops-lite alerting — push notifications, public status route, 7-day-silence pager threshold.
- Phase A.0 pre-registration rendering — symmetric arms, HTML structural diff CI test, blog-only fallback if symmetry cannot be guaranteed on the timeline.