Eval Sets
Versioned, lineage-tracked specifications of what the Intent Eval Platform measures. The eval-set is the spec — every signed Evidence Bundle is an attestation about conformance to one of these.
Each eval-set is a complete document: its definition, its version history, its lineage to any predecessor, a pointer to an adversarial audit (when one exists), and the full list of tests it includes. Lineage is content-addressed so a hash mismatch breaks renderings of older runs against newer eval-sets.
How to read this page
An eval-set tagged active is the currently authoritative version. A draft tag means the methodology is open for review and the predicate URI it would attest against is reserved but not yet declared. A deprecated tag means a successor eval-set has replaced it; renderings of old runs against deprecated eval-sets are preserved for the audit trail but marked.
Current eval-sets
-
j-rig binary-criteria skill evaluation, 7 layers active
Score every Claude skill yes/no across seven layers — package integrity, trigger quality, functional quality, regression protection, baseline value, model variance, rollout safety. Never gradients. The eval-set that gives this platform its name.
Scorecards
A scorecard is a result, not a spec. Each row is a measurement of one system against an eval-set, built to ship as a signed, Rekor-anchored Evidence Bundle. The eval-set above defines what is measured; a scorecard records what happened when something was measured against it. We keep them separate so a result is never mistaken for the specification it was measured against.
-
Evidence Bench — j-rig-bench scorecard draft
Signed, reproducible AI eval results. Each row is built to ship as a sigstore-signed, Rekor-anchored Evidence Bundle — reproducibility via signature, not just transparency. First row: Skill Refiner Phase A.0 (PROCEED). Complements gbrain-evals on an orthogonal dimension; does not restate it.
Coming next
These eval-sets are queued for v0.2.0 publication once their authoring lands upstream:
- audit-harness deterministic gate-set (jeremylongshore/intent-audit-harness) — escape-scan, CRAP score, architecture rules, bias count, Gherkin lint. The infrastructure layer underneath every IEP repo.
- Skill Refiner Phase A.0 baseline eval-set (jeremylongshore/intent-eval-lab) — The null-hypothesis spec that gates Skill Refiner Phase A. Pre-registered ahead of the run per DR-028.
Neither will be published here until its specification is complete, its lineage is recorded, and an adversarial audit has been documented. We refuse to publish demo or skeleton eval-sets that would become the canonical example of a predicate URI before the methodology is sound.
How to contribute
Eval-sets are authored upstream in the repo they measure. Open an issue or pull request against the source repos linked above. When a new eval-set is ratified, it lands here automatically on the next daily cron refresh.