Evidence Bench — j-rig-bench scorecard draft

A scorecard where the row is not the point — the signature on the row is. Every result here is built to ship as a cryptographically signed, Rekor-anchored Evidence Bundle: a third party can verify that the measurement happened, on what input, producing exactly this output — without trusting us. That is a different axis from "how good is the score." It is whether the score is attestable.

How this differs from a memory-eval leaderboard

gbrain-evals measures retrieval quality of memory systems and earns reproducibility the right way — sealed qrels, pinned judges, seeded runs. That is reproducibility via transparency: re-run it yourself and you get the same number. Evidence Bench adds the orthogonal layer — reproducibility via signature: a tamper-evident attestation that this specific run produced this specific result, anchored in a public transparency log. We are not re-scoring what gbrain-evals scores, and we are not claiming to beat it. We sign what a board like that can only assert. The two compose.

id: j-rig-bench
kind: scorecard (signed results) — distinct from the j-rig 7-layer eval-set that defines what is measured
status: draft — first row landed and signed; predicate URI reserved, not yet declared
predicate URI: evals.intentsolutions.io/skill-binary-eval/v1 (reserved; declared at Phase 2 — never declared under labs.*)
upstream source: jeremylongshore/j-rig-skill-binary-eval
attestation: SIGNED — Rekor log index 1689291334 (public production transparency log) — see provenance

The board

Each row is one system measured by the platform. The attestation column is the load-bearing one: signed means an Evidence Bundle exists, is sigstore-signed, and is anchored in the Rekor transparency log at a citable index; pending means the measurement exists but the signing pipeline has not yet run against it; measuring means the run itself is in progress; queued means the measurement is planned but no run is in progress. There is deliberately no composite score column — composing incommensurable findings into one number is the failure mode this platform exists to refuse.

System	What is measured	Outcome	Attestation
Skill Refiner — Phase A.0 baseline	Does an eval-guided Refiner mechanism produce real lift over a naive best-of-K prompt baseline? Pre-registered null-hypothesis test.	PROCEED — refiner lift significant, naive below descope bar (full story + proof)	signed
gbrain-stack	j-rig 7-layer behavioral eval run against the gbrain memory stack on a shared fixture corpus	queued — run not yet started	queued
MemPalace	Same j-rig 7-layer run, same corpus — cross-system comparison row	queued — run not yet started	queued
Supermemory	Same j-rig 7-layer run, same corpus — cross-system comparison row	queued — run not yet started	queued
ICOS — intentional cognition OS	Functional eval surface emitting a signed Evidence Bundle on each eval run	queued — run not yet started	queued
QMD — governed team memory	Provenance-integrity, govern-decision (per-check precision/recall, fail-closed), and retrieval-ratchet (golden-query Recall@10) evaluators emitting signed gate-result bundles on every push to main since 2026-07-16	live — verified signed rows on the results browser	emitting

Cross-system rows (gbrain-stack, MemPalace, Supermemory) are listed as comparison targets. When their runs start, they will be measured by the same harness on the same corpus, published win or lose, and we link to each system's own published numbers rather than restating them.

Phase A.0 baseline — both arms, side by side

👉 New here? Read the plain-English walkthrough: what we did, how we did it, and the proof — the question, the method, the result, and how to verify it yourself, no jargon.

The first row is a pre-registered experiment: Arm A is a naive best-of-K prompt baseline; Arm B is the eval-guided Refiner mechanism. Both arms are shown with identical structure and no winner styling — the reader draws the conclusion from the numbers, not from the layout. A null result would have been rendered in this exact same shape.

Measure	Arm A — Naive (best-of-K)	Arm B — Refiner
Mean score change vs source (pp)	`−6.1`	`+0.58`
Paired specimens (n)	`60`	`60`
Fraction of pre-registered 1.5 pp projection	`0%` (clamped; negative)	`38.9%`
Per-stratum direction (A / B / C)	`B > A` in all three strata, each significant at α=0.05

Paired test (Arm B vs Arm A)	Value
Per-specimen mean difference, B − A	`+6.68 pp`
Primary test	Wilcoxon signed-rank, one-sided (B > A)
Test statistic	`W = 1755.5`
p-value	`2.56 × 10⁻¹⁰`
Effect size	Cohen's d = `1.49` (large)
Decision rule	DESCOPE iff naive ≥ 70% of projected refiner lift — not fired
Total API cost	`$0.00` of $20 ceiling (free-tier)
Outcome	PROCEED

Honest reading. The significance is unambiguous (d = 1.49, p ≈ 2.6 × 10⁻¹⁰): the Refiner reliably beats the naive baseline specimen-by-specimen. The magnitude is modest — the Refiner reached 38.9% of its own pre-registered projection, so downstream work budgets against an empirical anchor of ~0.6 pp, not the 1.5 pp projection. PROCEED means "the mechanism is real and worth building," not "the mechanism is finished." Full analysis: DR-036.

Provenance — signed, and verifiable by you

The whole point of Evidence Bench is that a row carries a signature, not just a number. The Phase A.0 Evidence Bundle is signed against the public production sigstore transparency log and anchored at Rekor log index 1689291334. Signing was keyless — no long-lived private key exists; the signer is the reproducible GitHub Actions workflow identity github.com/jeremylongshore/intent-eval-lab/.github/workflows/sign-evidence-bundle.yml, not a person's machine. This is the same public-good infrastructure (Fulcio + Rekor) behind npm provenance and SLSA.

Rekor log index: 1689291334 (raw transparency-log entry, JSON)
entry UUID: 108e9186…51503e2ef1… (canonical permanent link)
signed bundle: evidence-bundle.sigstore.json (Fulcio cert + signature + inclusion proof)
integrated: 2026-06-01 04:45 UTC

You do not have to trust us. Verify it yourself with cosign (no account, no keys) against the committed bundle:

cosign verify-blob \
  --new-bundle-format \
  --bundle evidence-bundle.sigstore.json \
  --certificate-identity 'https://github.com/jeremylongshore/intent-eval-lab/.github/workflows/sign-evidence-bundle.yml@refs/heads/main' \
  --certificate-oidc-issuer 'https://token.actions.githubusercontent.com' \
  evidence-bundle.json

What the signature attests — and what it does not. It proves the four Phase A.0 result artifacts (by SHA-256 digest) are authentic and unaltered, and that this workflow identity signed them at the logged time. It does not declare a predicate URI — predicate_uri_set is empty by design. The skill-binary-eval/v1 predicate stays reserved: production-Rekor predicate declaration was gated on three things — DNSSEC, CAA pinning, and a normative SPEC.md. Two of the three have cleared: DNSSEC validates end-to-end for evals.intentsolutions.io (DS record at the registry) and CAA is pinned. The one gate remaining is the normative SPEC.md. And it does not claim the statistics are "correct" — a signature attests authenticity, not the truth of the science. The science stands on the pre-registered analysis, not on the signature.

That distinction is the discipline that separates an evidence board from a marketing board: we sign exactly what we can honestly attest — integrity, identity, time — and we refuse to dress a signature up as a claim it cannot support. (For the record: this row read pending until the signature was real. We do not render attestation we do not have.)

Source and references

j-rig 7-layer eval-set — the specification of what the cross-system rows measure
DR-036 — Phase A.0 PROCEED decision record + full statistical analysis
gbrain-evals — the memory-retrieval eval board this scorecard complements (different dimension; we link, we do not restate)
DR-010 — the binding unification thesis: every validator emits an Evidence Bundle