j-rig binary-criteria skill evaluation, 7 layers active

Score every Claude skill yes/no across seven layers — package integrity, trigger quality, functional quality, regression protection, baseline value, model variance, rollout safety. Never gradients. The eval-set that gives the Intent Eval Platform its name.

id: j-rig-binary-eval-7-layer
version: 2.1.0
status: active
predicate URI: evals.intentsolutions.io/skill-binary-eval/v1 (reserved; declared at Phase 2)
predecessor: none — initial baseline
upstream source: jeremylongshore/j-rig-skill-binary-eval
last changed: 2026-07-19
adversarial audit: queued — placeholder for puxu.3 follow-up

The thesis: binary, not gradient

Skill evaluation produces a vector of yes/no findings, never a number. The reason is straightforward: a numerical "skill score" composites incommensurable failure modes (a broken trigger, a missed regression case, a model-variance flake) into a single dimension where the consumer cannot tell what failed. A binary vector per layer preserves the failure-mode taxonomy.

This is the same reason this dashboard refuses to publish an aggregate PASS% across heterogeneous predicates. NOT_APPLICABLE is not PASS. ADVISORY is not PASS. Composition across predicate semantics is metric laundering.

The seven layers

Layer	What it checks
L1 — Package integrity	Does the skill load? Is the manifest valid? Are referenced files present?
L2 — Trigger quality	Does the skill activate when it should and stay silent when it shouldn't? Tested with positive + negative trigger cases.
L3 — Functional quality	When activated, does the skill produce the expected output shape? Are exact-match assertions satisfied?
L4 — Regression protection	Do prior-good behaviors still hold? Pinned reference outputs verified against current run.
L5 — Baseline value	Does the skill outperform the naive-prompt baseline? If the model can do the task as well without the skill, the skill's value is unproven.
L6 — Model variance	Multi-seed stability — does the skill produce consistent output across model temperature / sampling variation within the declared bound?
L7 — Rollout safety	Cost-bounded, refusal-rate-bounded, and safety-property-preserving under the declared deployment envelope.

What an evaluation produces

A single skill run produces seven binary findings — one per layer — packaged inside an Evidence Bundle. Each finding is a separate predicate-attestation row, signed via sigstore, anchored in the Rekor transparency log. There is no aggregate "skill score." A consumer reading the bundle sees the seven yes/no findings and decides for themselves which combinations matter for their use.

Example reading: a skill that passes L1–L4 but fails L5 (no baseline lift) is a skill that works but isn't useful. A skill that passes L5 but fails L7 (cost or refusal envelope breach) is useful in principle but unshippable. The platform refuses to collapse those into one number.

Version history

Version	Date	Change
`2.1.0`	2026-06-15	Skill-frontmatter validation cut over to the kernel `@intentsolutions/core` `authoring/v1` single source of truth; rollout decision logic extracted into the published `@intentsolutions/rollout-gate` package; provider measurement adapters added.
`2.0.0`	2026-06-12	Breaking: full predicate-body migration from the `v0.1.0-draft` shape to the kernel `gate-result/v1` normative shape (DR-018 Option α). See upstream `MIGRATION.md`.
`1.1.0` / `1.2.0`	2026-05-26 / 2026-06-08	Stub-provider explicit opt-in; emit-evidence produces signed `gate-result/v1` rows consumed by this site's results browser.
`1.0.0`	2026-05-19	Initial release. Seven layers locked. License relicensed Apache 2.0 from MIT (see PR #73).
`0.x`	2026-03 → 2026-05	Pre-1.0 iterations of the layer taxonomy. See upstream commit history.

Adversarial audit

Queued. An adversarial audit of this eval-set against a curated corpus of skills with known failure modes is scheduled as part of the puxu.3 follow-up cluster. When complete, the audit report will be linked here and the eval-set status badge will reflect the audit outcome.

Until the audit lands, the eval-set is active for engineering use but flagged with this note. The associated predicate URI evals.intentsolutions.io/skill-binary-eval/v1 remains reserved but not yet declared, per the methodology in methodology.

Source and references

jeremylongshore/j-rig-skill-binary-eval — upstream source repo (TypeScript pnpm monorepo)
Blueprint B — runtime architecture + 13-entity canonical domain model + the gate-result/v1 predicate contract
DR-010 — the binding unification thesis (every validator emits an Evidence Bundle)