j-rig binary-criteria skill evaluation, 7 layers active
Score every Claude skill yes/no across seven layers — package integrity, trigger quality, functional quality, regression protection, baseline value, model variance, rollout safety. Never gradients. The eval-set that gives the Intent Eval Platform its name.
The thesis: binary, not gradient
Skill evaluation produces a vector of yes/no findings, never a number. The reason is straightforward: a numerical "skill score" composites incommensurable failure modes (a broken trigger, a missed regression case, a model-variance flake) into a single dimension where the consumer cannot tell what failed. A binary vector per layer preserves the failure-mode taxonomy.
This is the same reason this dashboard refuses to publish an aggregate PASS% across heterogeneous predicates. NOT_APPLICABLE is not PASS. ADVISORY is not PASS. Composition across predicate semantics is metric laundering.
The seven layers
| Layer | What it checks |
|---|---|
| L1 — Package integrity | Does the skill load? Is the manifest valid? Are referenced files present? |
| L2 — Trigger quality | Does the skill activate when it should and stay silent when it shouldn't? Tested with positive + negative trigger cases. |
| L3 — Functional quality | When activated, does the skill produce the expected output shape? Are exact-match assertions satisfied? |
| L4 — Regression protection | Do prior-good behaviors still hold? Pinned reference outputs verified against current run. |
| L5 — Baseline value | Does the skill outperform the naive-prompt baseline? If the model can do the task as well without the skill, the skill's value is unproven. |
| L6 — Model variance | Multi-seed stability — does the skill produce consistent output across model temperature / sampling variation within the declared bound? |
| L7 — Rollout safety | Cost-bounded, refusal-rate-bounded, and safety-property-preserving under the declared deployment envelope. |
What an evaluation produces
A single skill run produces seven binary findings — one per layer — packaged inside an Evidence Bundle. Each finding is a separate predicate-attestation row, signed via sigstore, anchored in the Rekor transparency log. There is no aggregate "skill score." A consumer reading the bundle sees the seven yes/no findings and decides for themselves which combinations matter for their use.
Example reading: a skill that passes L1–L4 but fails L5 (no baseline lift) is a skill that works but isn't useful. A skill that passes L5 but fails L7 (cost or refusal envelope breach) is useful in principle but unshippable. The platform refuses to collapse those into one number.
Version history
| Version | Date | Change |
|---|---|---|
1.0.0 |
2026-05-21 | Initial release. Seven layers locked. License relicensed Apache 2.0 from MIT (see PR #73). |
0.x |
2026-03 → 2026-05 | Pre-1.0 iterations of the layer taxonomy. See upstream commit history. |
Adversarial audit
Queued. An adversarial audit of this eval-set against a curated corpus of skills with known failure modes is scheduled as part of the puxu.3 follow-up cluster. When complete, the audit report will be linked here and the eval-set status badge will reflect the audit outcome.
Until the audit lands, the eval-set is active for engineering use but flagged with this note. The associated predicate URI evals.intentsolutions.io/skill-binary-eval/v1 remains reserved but not yet declared, per the methodology in methodology.
Source and references
- jeremylongshore/j-rig-skill-binary-eval — upstream source repo (TypeScript pnpm monorepo)
- Blueprint B — runtime architecture + 13-entity canonical domain model + the
gate-result/v1predicate contract - DR-010 — the binding unification thesis (every validator emits an Evidence Bundle)