Evidence Bench — j-rig-bench scorecard draft
A scorecard where the row is not the point — the signature on the row is. Every result here is built to ship as a cryptographically signed, Rekor-anchored Evidence Bundle: a third party can verify that the measurement happened, on what input, producing exactly this output — without trusting us. That is a different axis from "how good is the score." It is whether the score is attestable.
gbrain-evals measures retrieval quality of memory systems and earns reproducibility the right way — sealed qrels, pinned judges, seeded runs. That is reproducibility via transparency: re-run it yourself and you get the same number. Evidence Bench adds the orthogonal layer — reproducibility via signature: a tamper-evident attestation that this specific run produced this specific result, anchored in a public transparency log. We are not re-scoring what gbrain-evals scores, and we are not claiming to beat it. We sign what a board like that can only assert. The two compose.
The board
Each row is one system measured by the platform. The attestation
column is the load-bearing one: signed means an Evidence Bundle exists,
is sigstore-signed, and is anchored in the Rekor transparency log at a citable index;
pending means the measurement exists but the signing pipeline has not yet
run against it; measuring means the run itself is in progress. There is
deliberately no composite score column — composing incommensurable
findings into one number is the failure mode this platform exists to refuse.
| System | What is measured | Outcome | Attestation |
|---|---|---|---|
| Skill Refiner — Phase A.0 baseline | Does an eval-guided Refiner mechanism produce real lift over a naive best-of-K prompt baseline? Pre-registered null-hypothesis test. | PROCEED — refiner lift significant, naive below descope bar (full story + proof) | signed |
| gbrain-stack | j-rig 7-layer behavioral eval run against the gbrain memory stack on a shared fixture corpus | measuring now | measuring |
| MemPalace | Same j-rig 7-layer run, same corpus — cross-system comparison row | measuring now | measuring |
| Supermemory | Same j-rig 7-layer run, same corpus — cross-system comparison row | measuring now | measuring |
| ICOS — intentional cognition OS | Functional eval surface emitting a signed Evidence Bundle on each eval run | measuring now | measuring |
| QMD — governed team memory | Memory-utility, dedup-catch-rate, and provenance-integrity evaluators emitting signed bundles on each curation cycle | measuring now | measuring |
Cross-system rows (gbrain-stack, MemPalace, Supermemory) are listed as comparison targets. They are measured by the same harness on the same corpus, published win or lose, and we link to each system's own published numbers rather than restating them.
Phase A.0 baseline — both arms, side by side
👉 New here? Read the plain-English walkthrough: what we did, how we did it, and the proof — the question, the method, the result, and how to verify it yourself, no jargon.
The first row is a pre-registered experiment: Arm A is a naive best-of-K prompt baseline; Arm B is the eval-guided Refiner mechanism. Both arms are shown with identical structure and no winner styling — the reader draws the conclusion from the numbers, not from the layout. A null result would have been rendered in this exact same shape.
| Measure | Arm A — Naive (best-of-K) | Arm B — Refiner |
|---|---|---|
| Mean score change vs source (pp) | −6.1 |
+0.58 |
| Paired specimens (n) | 60 |
60 |
| Fraction of pre-registered 1.5 pp projection | 0% (clamped; negative) |
38.9% |
| Per-stratum direction (A / B / C) | B > A in all three strata, each significant at α=0.05 |
|
| Paired test (Arm B vs Arm A) | Value |
|---|---|
| Per-specimen mean difference, B − A | +6.68 pp |
| Primary test | Wilcoxon signed-rank, one-sided (B > A) |
| Test statistic | W = 1755.5 |
| p-value | 2.56 × 10⁻¹⁰ |
| Effect size | Cohen's d = 1.49 (large) |
| Decision rule | DESCOPE iff naive ≥ 70% of projected refiner lift — not fired |
| Total API cost | $0.00 of $20 ceiling (free-tier) |
| Outcome | PROCEED |
Honest reading. The significance is unambiguous (d = 1.49, p ≈ 2.6 × 10⁻¹⁰): the Refiner reliably beats the naive baseline specimen-by-specimen. The magnitude is modest — the Refiner reached 38.9% of its own pre-registered projection, so downstream work budgets against an empirical anchor of ~0.6 pp, not the 1.5 pp projection. PROCEED means "the mechanism is real and worth building," not "the mechanism is finished." Full analysis: DR-036.
Provenance — signed, and verifiable by you
The whole point of Evidence Bench is that a row carries a signature, not just a
number. The Phase A.0 Evidence Bundle is signed against the
public production sigstore transparency log and anchored at
Rekor log index
1689291334. Signing was keyless — no long-lived private key
exists; the signer is the reproducible GitHub Actions workflow identity
github.com/jeremylongshore/intent-eval-lab/.github/workflows/sign-evidence-bundle.yml,
not a person's machine. This is the same public-good infrastructure (Fulcio + Rekor)
behind npm provenance and SLSA.
You do not have to trust us. Verify it yourself with cosign (no account, no keys) against the committed bundle:
cosign verify-blob \
--new-bundle-format \
--bundle evidence-bundle.sigstore.json \
--certificate-identity 'https://github.com/jeremylongshore/intent-eval-lab/.github/workflows/sign-evidence-bundle.yml@refs/heads/main' \
--certificate-oidc-issuer 'https://token.actions.githubusercontent.com' \
evidence-bundle.json
What the signature attests — and what it does not. It proves the
four Phase A.0 result artifacts (by SHA-256 digest) are authentic and unaltered, and
that this workflow identity signed them at the logged time. It does
not declare a predicate URI — predicate_uri_set is empty by
design. The skill-binary-eval/v1 predicate stays reserved:
production-Rekor predicate declaration is gated on a normative SPEC.md plus
DNSSEC and CAA clearing for evals.intentsolutions.io, none of which have.
And it does not claim the statistics are "correct" — a signature attests
authenticity, not the truth of the science. The science stands on the pre-registered
analysis, not on the signature.
That distinction is the discipline that separates an evidence board from a marketing
board: we sign exactly what we can honestly attest — integrity, identity, time — and
we refuse to dress a signature up as a claim it cannot support. (For the record: this
row read pending until the signature was real. We do not render
attestation we do not have.)
Source and references
- j-rig 7-layer eval-set — the specification of what the cross-system rows measure
- DR-036 — Phase A.0 PROCEED decision record + full statistical analysis
- gbrain-evals — the memory-retrieval eval board this scorecard complements (different dimension; we link, we do not restate)
- DR-010 — the binding unification thesis: every validator emits an Evidence Bundle