← Eval sets

Evidence Bench — j-rig-bench scorecard draft

A scorecard where the row is not the point — the signature on the row is. Every result here is built to ship as a cryptographically signed, Rekor-anchored Evidence Bundle: a third party can verify that the measurement happened, on what input, producing exactly this output — without trusting us. That is a different axis from "how good is the score." It is whether the score is attestable.

How this differs from a memory-eval leaderboard

gbrain-evals measures retrieval quality of memory systems and earns reproducibility the right way — sealed qrels, pinned judges, seeded runs. That is reproducibility via transparency: re-run it yourself and you get the same number. Evidence Bench adds the orthogonal layer — reproducibility via signature: a tamper-evident attestation that this specific run produced this specific result, anchored in a public transparency log. We are not re-scoring what gbrain-evals scores, and we are not claiming to beat it. We sign what a board like that can only assert. The two compose.

id
j-rig-bench
kind
scorecard (signed results) — distinct from the j-rig 7-layer eval-set that defines what is measured
status
draft — first row landed; signature layer pending
predicate URI
evals.intentsolutions.io/skill-binary-eval/v1 (reserved; declared at Phase 2 — never declared under labs.*)
upstream source
jeremylongshore/j-rig-skill-binary-eval
attestation
SIGNED — Rekor log index 1689291334 (public production transparency log) — see provenance

The board

Each row is one system measured by the platform. The attestation column is the load-bearing one: signed means an Evidence Bundle exists, is sigstore-signed, and is anchored in the Rekor transparency log at a citable index; pending means the measurement exists but the signing pipeline has not yet run against it; measuring means the run itself is in progress. There is deliberately no composite score column — composing incommensurable findings into one number is the failure mode this platform exists to refuse.

System What is measured Outcome Attestation
Skill Refiner — Phase A.0 baseline Does an eval-guided Refiner mechanism produce real lift over a naive best-of-K prompt baseline? Pre-registered null-hypothesis test. PROCEED — refiner lift significant, naive below descope bar (full story + proof) signed
gbrain-stack j-rig 7-layer behavioral eval run against the gbrain memory stack on a shared fixture corpus measuring now measuring
MemPalace Same j-rig 7-layer run, same corpus — cross-system comparison row measuring now measuring
Supermemory Same j-rig 7-layer run, same corpus — cross-system comparison row measuring now measuring
ICOS — intentional cognition OS Functional eval surface emitting a signed Evidence Bundle on each eval run measuring now measuring
QMD — governed team memory Memory-utility, dedup-catch-rate, and provenance-integrity evaluators emitting signed bundles on each curation cycle measuring now measuring

Cross-system rows (gbrain-stack, MemPalace, Supermemory) are listed as comparison targets. They are measured by the same harness on the same corpus, published win or lose, and we link to each system's own published numbers rather than restating them.

Phase A.0 baseline — both arms, side by side

👉 New here? Read the plain-English walkthrough: what we did, how we did it, and the proof — the question, the method, the result, and how to verify it yourself, no jargon.

The first row is a pre-registered experiment: Arm A is a naive best-of-K prompt baseline; Arm B is the eval-guided Refiner mechanism. Both arms are shown with identical structure and no winner styling — the reader draws the conclusion from the numbers, not from the layout. A null result would have been rendered in this exact same shape.

Measure Arm A — Naive (best-of-K) Arm B — Refiner
Mean score change vs source (pp) −6.1 +0.58
Paired specimens (n) 60 60
Fraction of pre-registered 1.5 pp projection 0% (clamped; negative) 38.9%
Per-stratum direction (A / B / C) B > A in all three strata, each significant at α=0.05
Paired test (Arm B vs Arm A)Value
Per-specimen mean difference, B − A+6.68 pp
Primary testWilcoxon signed-rank, one-sided (B > A)
Test statisticW = 1755.5
p-value2.56 × 10⁻¹⁰
Effect sizeCohen's d = 1.49 (large)
Decision ruleDESCOPE iff naive ≥ 70% of projected refiner lift — not fired
Total API cost$0.00 of $20 ceiling (free-tier)
OutcomePROCEED

Honest reading. The significance is unambiguous (d = 1.49, p ≈ 2.6 × 10⁻¹⁰): the Refiner reliably beats the naive baseline specimen-by-specimen. The magnitude is modest — the Refiner reached 38.9% of its own pre-registered projection, so downstream work budgets against an empirical anchor of ~0.6 pp, not the 1.5 pp projection. PROCEED means "the mechanism is real and worth building," not "the mechanism is finished." Full analysis: DR-036.

Provenance — signed, and verifiable by you

The whole point of Evidence Bench is that a row carries a signature, not just a number. The Phase A.0 Evidence Bundle is signed against the public production sigstore transparency log and anchored at Rekor log index 1689291334. Signing was keyless — no long-lived private key exists; the signer is the reproducible GitHub Actions workflow identity github.com/jeremylongshore/intent-eval-lab/.github/workflows/sign-evidence-bundle.yml, not a person's machine. This is the same public-good infrastructure (Fulcio + Rekor) behind npm provenance and SLSA.

Rekor log index
1689291334 (raw transparency-log entry, JSON)
entry UUID
108e9186…51503e2ef1… (canonical permanent link)
signed bundle
evidence-bundle.sigstore.json (Fulcio cert + signature + inclusion proof)
integrated
2026-06-01 04:45 UTC

You do not have to trust us. Verify it yourself with cosign (no account, no keys) against the committed bundle:

cosign verify-blob \
  --new-bundle-format \
  --bundle evidence-bundle.sigstore.json \
  --certificate-identity 'https://github.com/jeremylongshore/intent-eval-lab/.github/workflows/sign-evidence-bundle.yml@refs/heads/main' \
  --certificate-oidc-issuer 'https://token.actions.githubusercontent.com' \
  evidence-bundle.json

What the signature attests — and what it does not. It proves the four Phase A.0 result artifacts (by SHA-256 digest) are authentic and unaltered, and that this workflow identity signed them at the logged time. It does not declare a predicate URI — predicate_uri_set is empty by design. The skill-binary-eval/v1 predicate stays reserved: production-Rekor predicate declaration is gated on a normative SPEC.md plus DNSSEC and CAA clearing for evals.intentsolutions.io, none of which have. And it does not claim the statistics are "correct" — a signature attests authenticity, not the truth of the science. The science stands on the pre-registered analysis, not on the signature.

That distinction is the discipline that separates an evidence board from a marketing board: we sign exactly what we can honestly attest — integrity, identity, time — and we refuse to dress a signature up as a claim it cannot support. (For the record: this row read pending until the signature was real. We do not render attestation we do not have.)

Source and references