Methodology

How the Intent Eval Platform works, what an Evidence Bundle is, why the eval-set browser ships ahead of the results browser, and what this dashboard refuses to do.

The unification thesis

Every validator in the platform — deterministic gate, behavioral eval, schema check, security scanner — emits its findings as rows inside a common artifact called an Evidence Bundle. The bundle is content-addressed, DSSE-signed, and anchored in the Rekor transparency log. The schema is authored once, in the canonical contracts kernel (intent-eval-core), and consumed by every downstream renderer.

That single binding — every validator emits a bundle — is the architectural pivot the platform's ratification council voted into place in DR-010. Without it, the eval space fragments into incompatible JSON shapes. With it, downstream consumers — including this dashboard — read one shape and trust its provenance via sigstore + Rekor.

The six repos

The platform composes via the Evidence Bundle, not via package consolidation. Six repos play distinct roles:

RepoRole
intent-eval-core Canonical contracts kernel. TypeScript types, JSON Schemas, Zod validators, state machines for the canonical entities. Published as @intentsolutions/core. No execution, no judges, no runtime — just the contracts.
intent-eval-lab Methodology + specs + the constitution. Blueprints A/B/C, the canonical glossary, every ratified Decision Record. This dashboard's architecture is defined in intent-eval-lab/000-docs/035-AT-DECR-....
intent-audit-harness Deterministic gates (escape-scan, CRAP score, architecture rules, bias count, Gherkin lint) plus the emit-evidence layer that wraps gate output in Evidence Bundles.
j-rig-skill-binary-eval Behavioral evaluation harness — the 7-layer binary-criteria skill eval-set lives here.
intent-rollout-gate A thin GitHub Action that consumes Evidence Bundles and decides ship / no-ship per repo policy.
intent-eval-dashboard This dashboard. The most visible kernel consumer in the system, deliberately separate from the constitution repo so the spec author and the renderer share no release cycle.

Where predicate URIs live

Every signed predicate row inside an Evidence Bundle declares a predicate_uri that names the contract it satisfies. Those URIs are normative immutable identifiers. They live exclusively at evals.intentsolutions.io — never on this dashboard's domain. This is a CISO-bound separation reaffirmed in DR-035.

labs.intentsolutions.io renders about predicate URIs but never declares one. A predicate-URI-typo in any content rendered here is not a new URI claim — it's a documentation bug.

Why the eval-set ships before the results

The eval-set is the spec. If a public eval dashboard renders results rows ahead of publishing the spec those rows attest against, the dashboard is publishing attestations of conformance to a spec that doesn't exist. The first page a hostile reader clicks should be the spec — the eval-set definition, its version history, its lineage, its adversarial audit pointer — not a number.

This sequencing is a Karpathy canonical position adopted into the platform's plan. It changes the engineering order of operations: the eval-set browser ships first; the results browser, the per-repo freshness strip, and the ingest supervision tree are Phase 2 work.

What this dashboard refuses to do

The seven adversarial seats of the ratification council each carry refusal authority on specific surfaces. Their refusals are preserved verbatim in DR-035 § 8. The most consequential for a visitor reading this page:

What sequences in Phase 2

The roadmap, in dependency order:

  1. Schema evolution in intent-eval-core to v0.2.0 — adds pre_registration_hash, the retraction/v1 predicate type, and the dashboard-render/v1 predicate type. The latter is sequenced behind the arrival of a second independent verifier.
  2. Ingest supervision tree — six per-repo workers, each verifying signature, Rekor inclusion, and schema before content-addressing the bundle into local storage.
  3. Results browser — renders gate-result/v1 rows with the per-row visibility-tier gating and the four-timestamp staleness surface.
  4. Freshness strip + landing-page mandatory chart — the per-repo decision-mix strip at the top of the landing page.
  5. Operator-internal view — separate tailnet-only hostname, Tailscale identity, all reports regardless of public visibility tags.
  6. Retraction protocol — content-addressed denylist, Caddy-level kill-switch, signed retraction/v1 attestations, 4-hour SLO.
  7. Ops-lite alerting — push notifications, public status route, 7-day-silence pager threshold.
  8. Phase A.0 pre-registration rendering — symmetric arms, HTML structural diff CI test, blog-only fallback if symmetry cannot be guaranteed on the timeline.

Reading further