← Eval sets

j-rig binary-criteria skill evaluation, 7 layers active

Score every Claude skill yes/no across seven layers — package integrity, trigger quality, functional quality, regression protection, baseline value, model variance, rollout safety. Never gradients. The eval-set that gives the Intent Eval Platform its name.

id
j-rig-binary-eval-7-layer
version
1.1.0
status
active
predicate URI
evals.intentsolutions.io/skill-binary-eval/v1 (reserved; declared at Phase 2)
predecessor
none — initial baseline
upstream source
jeremylongshore/j-rig-skill-binary-eval
last changed
2026-05-31
adversarial audit
queued — placeholder for puxu.3 follow-up

The thesis: binary, not gradient

Skill evaluation produces a vector of yes/no findings, never a number. The reason is straightforward: a numerical "skill score" composites incommensurable failure modes (a broken trigger, a missed regression case, a model-variance flake) into a single dimension where the consumer cannot tell what failed. A binary vector per layer preserves the failure-mode taxonomy.

This is the same reason this dashboard refuses to publish an aggregate PASS% across heterogeneous predicates. NOT_APPLICABLE is not PASS. ADVISORY is not PASS. Composition across predicate semantics is metric laundering.

The seven layers

Layer What it checks
L1 — Package integrity Does the skill load? Is the manifest valid? Are referenced files present?
L2 — Trigger quality Does the skill activate when it should and stay silent when it shouldn't? Tested with positive + negative trigger cases.
L3 — Functional quality When activated, does the skill produce the expected output shape? Are exact-match assertions satisfied?
L4 — Regression protection Do prior-good behaviors still hold? Pinned reference outputs verified against current run.
L5 — Baseline value Does the skill outperform the naive-prompt baseline? If the model can do the task as well without the skill, the skill's value is unproven.
L6 — Model variance Multi-seed stability — does the skill produce consistent output across model temperature / sampling variation within the declared bound?
L7 — Rollout safety Cost-bounded, refusal-rate-bounded, and safety-property-preserving under the declared deployment envelope.

What an evaluation produces

A single skill run produces seven binary findings — one per layer — packaged inside an Evidence Bundle. Each finding is a separate predicate-attestation row, signed via sigstore, anchored in the Rekor transparency log. There is no aggregate "skill score." A consumer reading the bundle sees the seven yes/no findings and decides for themselves which combinations matter for their use.

Example reading: a skill that passes L1–L4 but fails L5 (no baseline lift) is a skill that works but isn't useful. A skill that passes L5 but fails L7 (cost or refusal envelope breach) is useful in principle but unshippable. The platform refuses to collapse those into one number.

Version history

VersionDateChange
1.0.0 2026-05-21 Initial release. Seven layers locked. License relicensed Apache 2.0 from MIT (see PR #73).
0.x 2026-03 → 2026-05 Pre-1.0 iterations of the layer taxonomy. See upstream commit history.

Adversarial audit

Queued. An adversarial audit of this eval-set against a curated corpus of skills with known failure modes is scheduled as part of the puxu.3 follow-up cluster. When complete, the audit report will be linked here and the eval-set status badge will reflect the audit outcome.

Until the audit lands, the eval-set is active for engineering use but flagged with this note. The associated predicate URI evals.intentsolutions.io/skill-binary-eval/v1 remains reserved but not yet declared, per the methodology in methodology.

Source and references