Methodology

How the Intent Eval Platform works, what an Evidence Bundle is, why the eval-set browser shipped ahead of the results browser, and what this dashboard refuses to do.

The unification thesis

Every validator in the platform — deterministic gate, behavioral eval, schema check, security scanner — emits its findings as rows inside a common artifact called an Evidence Bundle. The bundle is content-addressed, DSSE-signed, and anchored in the Rekor transparency log. The schema is authored once, in the canonical contracts kernel (intent-eval-core), and consumed by every downstream renderer.

That single binding — every validator emits a bundle — is the architectural pivot the platform's ratification council voted into place in DR-010. Without it, the eval space fragments into incompatible JSON shapes. With it, downstream consumers — including this dashboard — read one shape and trust its provenance via sigstore + Rekor.

The six repos

The platform composes via the Evidence Bundle, not via package consolidation. Six repos play distinct roles:

Repo	Role
intent-eval-core	Canonical contracts kernel. TypeScript types, JSON Schemas, Zod validators, state machines for the canonical entities. Published as `@intentsolutions/core`. No execution, no judges, no runtime — just the contracts.
intent-eval-lab	Methodology + specs + the constitution. Blueprints A/B/C, the canonical glossary, every ratified Decision Record. This dashboard's architecture is defined in `intent-eval-lab/000-docs/035-AT-DECR-...`.
intent-audit-harness	Deterministic gates (escape-scan, CRAP score, architecture rules, bias count, Gherkin lint) plus the emit-evidence layer that wraps gate output in Evidence Bundles.
j-rig-skill-binary-eval	Behavioral evaluation harness — the 7-layer binary-criteria skill eval-set lives here.
intent-rollout-gate	A thin GitHub Action that consumes Evidence Bundles and decides ship / no-ship per repo policy.
intent-eval-dashboard	This dashboard. The most visible kernel consumer in the system, deliberately separate from the constitution repo so the spec author and the renderer share no release cycle.

Where predicate URIs live

Every signed predicate row inside an Evidence Bundle declares a predicate_uri that names the contract it satisfies. Those URIs are normative immutable identifiers. They live exclusively at evals.intentsolutions.io — never on this dashboard's domain. This is a CISO-bound separation reaffirmed in DR-035.

labs.intentsolutions.io renders about predicate URIs but never declares one. A predicate-URI-typo in any content rendered here is not a new URI claim — it's a documentation bug.

Why the eval-set shipped before the results

The eval-set is the spec. If a public eval dashboard renders results rows ahead of publishing the spec those rows attest against, the dashboard is publishing attestations of conformance to a spec that doesn't exist. The first page a hostile reader clicks should be the spec — the eval-set definition, its version history, its lineage, its adversarial audit pointer — not a number.

This sequencing is a Karpathy canonical position adopted into the platform's plan. It set the engineering order of operations: the eval-set browser shipped first. The surfaces that were sequenced behind it — the results browser, the per-repo freshness strip on the landing page, and the ingest supervision tree behind /status/ — have since shipped and are live. The argument stands; the queue it described has largely drained.

What this dashboard refuses to do

The seven adversarial seats of the ratification council each carry refusal authority on specific surfaces. Their refusals are preserved verbatim in DR-035 § 8. The most consequential for a visitor reading this page:

No aggregate PASS% across heterogeneous predicates. The CI lint refuses it; the type system refuses it; three separate seats refuse to ratify any rendering that produces it.
No basicauth on the public origin for operator views. If there is information internal to the operator, it lives on a tailnet-only hostname.
No predicate URIs declared here. URIs at evals.intentsolutions.io only.
No render-from-manifest without re-verification. Pinned OIDC subject and workflow_ref per source repo, Rekor inclusion proof, schema validation row-by-row before any HTML is generated.
No GCP-hosted object storage. Content-addressed Evidence Bundle storage lives on the Intent Solutions VPS at v0.1.0; migration to Backblaze B2 is queued at the 12-month / 100 GB trigger.
No asymmetric rendering of null-hypothesis results. Both arms of a comparison render with identical layout, font weight, and chart axes. A null result is rendered identically to a positive result. Information design refuses to bury inconvenient outcomes.

Phase 2 — what shipped, what is still queued

An earlier version of this page listed the Phase 2 roadmap as queued. Most of it has since landed. The original items, in their original dependency order, with honest current state:

Schema evolution in intent-eval-core — shipped, and overtaken. The kernel is published at 0.9.0 (well past the v0.2.0 target this page originally named) and carries pre_registration_hash plus the retraction/v1 and dashboard-render/v1 schema types.
Ingest supervision tree — shipped. Per-repo workers verify signature, Rekor inclusion, and schema before content-addressing each bundle; its USE-method view is public at /status/.
Results browser — shipped. /results/ renders verified gate-result/v1 rows per source repo.
Freshness strip + landing-page mandatory chart — shipped. The per-repo decision-mix strip is live at the top of the landing page; an hour of silence renders as loudly as a failure.
Operator-internal view — shipped. It lives on a tailnet-only hostname, which is why there is nothing to link here.
Retraction machinery — shipped in this repo: content-addressed denylist and tombstone rendering, with the retraction/v1 schema type in the kernel.
Ops-lite alerting — shipped. Public status route at /status/ plus silence-threshold liveness alerting.
Phase A.0 pre-registration rendering — shipped. Both arms render symmetrically on the Evidence Bench scorecard, enforced by an arm-symmetry CI lint.

Still queued: the signed dashboard-render/v1 attestation — the site signing its own render. The schema type exists in the kernel; the attestation itself remains sequenced behind the arrival of a second independent verifier. Until that lands, this page says so rather than implying it.