Eval Sets

Versioned, lineage-tracked specifications of what the Intent Eval Platform measures. The eval-set is the spec — every signed Evidence Bundle is an attestation about conformance to one of these.

Each eval-set is a complete document: its definition, its version history, its lineage to any predecessor, a pointer to an adversarial audit (when one exists), and the full list of tests it includes. Lineage is content-addressed so a hash mismatch breaks renderings of older runs against newer eval-sets.

How to read this page

An eval-set tagged active is the currently authoritative version. A draft tag means the methodology is open for review and the predicate URI it would attest against is reserved but not yet declared. A deprecated tag means a successor eval-set has replaced it; renderings of old runs against deprecated eval-sets are preserved for the audit trail but marked.

Current eval-sets

j-rig binary-criteria skill evaluation, 7 layers active

version 2.1.0 · last changed 2026-07-19 · source jeremylongshore/j-rig-skill-binary-eval@main

Score every Claude skill yes/no across seven layers — package integrity, trigger quality, functional quality, regression protection, baseline value, model variance, rollout safety. Never gradients. The eval-set that gives this platform its name.

Scorecards

A scorecard is a result, not a spec. Each row is a measurement of one system against an eval-set, built to ship as a signed, Rekor-anchored Evidence Bundle. The eval-set above defines what is measured; a scorecard records what happened when something was measured against it. We keep them separate so a result is never mistaken for the specification it was measured against.

Evidence Bench — j-rig-bench scorecard draft

version 2.1.0 · last changed 2026-07-19 · attestation 1 of 1 signed

Signed, reproducible AI eval results. Each row is built to ship as a sigstore-signed, Rekor-anchored Evidence Bundle — reproducibility via signature, not just transparency. First row: Skill Refiner Phase A.0 (PROCEED). Complements gbrain-evals on an orthogonal dimension; does not restate it.
Dogfood — the platform evaluates its own shipped skills draft

version 2.1.0 · last changed 2026-07-19 · attestation 2 of 2 signed

The 7-layer skill-binary-eval methodology run against Intent Solutions' own published CoreWeave skills — the harness that gates external skills, gating ours. Both verdicts are keyless-signed and anchored in the public Rekor log: coreweave-gpu-node-forensics (SHIP, index 2085904207) and coreweave-gpu-cost-leak-hunter (BLOCK, index 2091983416) — an honest failing grade of our own skill, promoted from held only after noise-robust 5-sample majority judging made the verdict reproduce 7/7, with the per-vote record inside the signed evidence.

Coming next

These eval-sets are queued for v0.2.0 publication once their authoring lands upstream:

audit-harness deterministic gate-set (jeremylongshore/intent-audit-harness) — escape-scan, CRAP score, architecture rules, bias count, Gherkin lint. The infrastructure layer underneath every IEP repo. The gates themselves are shipped and already emit signed gate-result rows (rendered in this site's results browser); what is queued here is the versioned eval-set specification of that gate-set.
Skill Refiner Phase A.0 baseline eval-set (jeremylongshore/intent-eval-lab) — The null-hypothesis spec that gated Skill Refiner Phase A, pre-registered ahead of the run per DR-028. The run has since completed — its signed result (Phase A.0 PROCEED, Rekor log index 1689291334) is the first row on the Evidence Bench scorecard; what is queued here is the eval-set spec page itself, pending its documented adversarial audit.

Neither will be published here until its specification is complete, its lineage is recorded, and an adversarial audit has been documented. We refuse to publish demo or skeleton eval-sets that would become the canonical example of a predicate URI before the methodology is sound.

How to contribute

Eval-sets are authored upstream in the repo they measure. Open an issue or pull request against the source repos linked above. When a new eval-set is ratified, it lands here automatically on the next daily cron refresh.

Eval Sets

How to read this page

Current eval-sets

j-rig binary-criteria skill evaluation, 7 layers active

Scorecards

Evidence Bench — j-rig-bench scorecard draft

Dogfood — the platform evaluates its own shipped skills draft

Coming next

How to contribute