Eval Sets

Versioned, lineage-tracked specifications of what the Intent Eval Platform measures. The eval-set is the spec — every signed Evidence Bundle is an attestation about conformance to one of these.

Each eval-set is a complete document: its definition, its version history, its lineage to any predecessor, a pointer to an adversarial audit (when one exists), and the full list of tests it includes. Lineage is content-addressed so a hash mismatch breaks renderings of older runs against newer eval-sets.

How to read this page

An eval-set tagged active is the currently authoritative version. A draft tag means the methodology is open for review and the predicate URI it would attest against is reserved but not yet declared. A deprecated tag means a successor eval-set has replaced it; renderings of old runs against deprecated eval-sets are preserved for the audit trail but marked.

Current eval-sets

Scorecards

A scorecard is a result, not a spec. Each row is a measurement of one system against an eval-set, built to ship as a signed, Rekor-anchored Evidence Bundle. The eval-set above defines what is measured; a scorecard records what happened when something was measured against it. We keep them separate so a result is never mistaken for the specification it was measured against.

Coming next

These eval-sets are queued for v0.2.0 publication once their authoring lands upstream:

Neither will be published here until its specification is complete, its lineage is recorded, and an adversarial audit has been documented. We refuse to publish demo or skeleton eval-sets that would become the canonical example of a predicate URI before the methodology is sound.

How to contribute

Eval-sets are authored upstream in the repo they measure. Open an issue or pull request against the source repos linked above. When a new eval-set is ratified, it lands here automatically on the next daily cron refresh.