← Evidence Bench scorecard

Phase A.0 — what we did, how we did it, and the proof

This is the human-readable trail behind one signed row on the Evidence Bench scorecard. Read it top to bottom and you will understand exactly what was tested, how, what happened, and how to confirm — without trusting us — that the result is genuine.

1. The question we asked

We are building a tool called the Skill Refiner — it takes a Claude "skill" file and automatically improves it through a propose-edit / score / accept-only-if-better loop. Before spending months building that machinery, we asked the honest, uncomfortable question:

Does the Refiner loop actually do anything a model can't already do if you just ask it nicely with some examples?

If "just ask the model" works almost as well, the Refiner isn't worth building — and we committed, in writing and ahead of time, to publish that result either way, win or lose. That pre-commitment is what keeps this honest: we couldn't quietly bury a disappointing answer.

2. How we tested it

A head-to-head, on the same 60 skill files, scored by the same judge:

Read this first — the exact scope of what we tested

Each skill has two parts: a short config header (the SKILL.md frontmatter — name, version, license, tags, and 5 other fields) and a longer instruction body below it. In this experiment, the AI was only allowed to edit the config header — the body was left untouched on both sides.

One honest wrinkle to know going in: the 100-point quality rubric scores the whole skill, and most of those points (~75 of 100) come from the body — structure, length, worked examples. Since the body never changed, those ~75 points were frozen and identical before and after, on both arms. So every score change you'll see below comes purely from the ~15–25 points the config header controls. The numbers measure "did editing the header help or hurt" — not "did the whole skill get better."

Why that's still a fair test: both approaches edited only the header and were scored identically, so the ~75 body points cancel out and the comparison between them is apples-to-apples. What it's NOT (yet): proof that the Refiner improves whole skills — testing a body-editing version on the full rubric is the natural next experiment (see the bottom of this page). We start with the cheapest, cleanest question first, and say plainly what it does and doesn't show.

Arm A — "just ask"Arm B — the Refiner
What it does One shot: hand the model the skill + K worked examples, take whatever it returns. The full loop: propose edits, score them, keep an edit only if every quality dimension improves, retry up to 3×.
Model Same model for both — Llama-3.1 (70B) via NVIDIA's free tier. Temperature 0 (repeatable).
Scored by The same independent marketplace-tier validator, run on every before/after pair.

Two design choices make this credible. The decision rule was written down before the run ("if 'just ask' gets within 70% of the Refiner's projected gain, kill the Refiner") — so we couldn't move the goalposts after seeing the numbers. And it cost $0.00 on a free model, which means the result can't be dismissed as "you just threw a bigger, pricier model at it."

The 60 skills we tested

These weren't 60 random, unrelated tools — they were 60 skill config files (SKILL.md), deliberately chosen in three groups of 20 to span the full quality range. That spread is what makes the result honest: we tested good skills, already-passing skills, AND broken ones — not a cherry-picked easy set. (20 more were held back entirely, never shown to either approach.)

Group A — Your own hand-authored skills (20 skills)

cleanup-code · agent-creator · contribute · sitrep · validate-mcp · implement-tests · validate-skillmd · repo-sweep · appaudit · validate-agent · release · deep-research · status-update-writer · product-researcher · doc-filing · repo-dress · validate-hook · validate-marketplace · ui-ux-pro-max · email

Group B — Marketplace skills that already passed (20 skills)

supabase-schema-from-requirements · onenote-upgrade-migration · collecting-infrastructure-metrics · vastai-webhooks-events · lokalise-incident-runbook · analyzing-logs · sentry-debug-bundle · notion-upgrade-migration · maintainx-core-workflow-a · langchain-prompt-engineering · twinmind-local-dev-loop · notion-ci-integration · plugin-creator · 000-jeremy-content-consistency-validator · notion-advanced-troubleshooting · langchain-content-blocks · firecrawl-local-dev-loop · splitting-datasets · notion-multi-env-setup · validating-api-schemas

Group C — Marketplace skills that were failing (the highest-value targets) (20 skills)

forge-diagnose · miro-observability · volt-recon · linear-deploy-integration · openevidence-prod-checklist · coderabbit-observability · intercom-debug-bundle · veeva-webhooks-events · flux-query · windsurf-flows-automation · groq-observability · lindy-sdk-patterns · runway-prod-checklist · abridge-deploy-integration · helm-plan · windsurf-performance-tuning · openevidence-ci-integration · hex-security-basics · fathom-ci-integration · canva-load-scale

2½. Watch it happen — one real skill, start to finish

Abstract numbers are hard to feel. So here is exactly what happened to one of the 60 skills — a real, public marketplace skill called notion-ci-integration — pulled straight from the run logs.

Step 1 — the starting point. The whole skill scored 86 / 100 on the rubric (most of that earned by its body, which stays fixed throughout).

Step 2 — the exact instruction we sent the AI (this is the literal prompt, trimmed):

You are improving a SKILL.md frontmatter file... The IS marketplace tier
requires 8 fields: name, description, allowed-tools, version, author,
license, compatibility, tags.

Here are 8 exemplar(s) of frontmatter that pass the validator:
  [8 good examples pasted in here]

Now improve the following SKILL.md to maximize its validator score.
Return ONLY the updated frontmatter block.

  [the notion-ci-integration header that needs improving]

In plain words: "Here are 8 good examples. Now make this one better. Go." That is "Arm A" — just ask the AI, take whatever it returns.

Step 3 — what the AI sent back. It rewrote the header: added two tags (automation, github-actions) and changed the compatibility line. It was confident it had improved the file.

Step 4 — we re-scored the AI's version with the same rubric:

started at
86 / 100
after the AI's header edit
72 / 100
result
−14 points — the header edit made it WORSE (the body never changed, so the whole drop is the AI's header edit)

That is the whole problem in one example: just asking the AI to "improve the header" often makes it worse — and nothing catches the mistake, because nobody checked the edit before keeping it.

Step 5 — now the Refiner, on a different skill (one that started at 53/100). Same AI, same kind of edit — but this time the edit is scored before it's kept:

started at
53 / 100
Refiner's scored edit
62 / 100 (+9)
the check's decision
ACCEPTED — the edit improved the score, so keep it

The Refiner's superpower isn't the AI — it's the check around the AI: score every proposed edit, and only keep it if it genuinely improved things, otherwise throw it away. When the AI degrades a skill (like Step 4), that safety net catches it and discards it. That is the entire difference between the two arms — and it's why "just ask the model" loses.

3. What happened

The Refiner won, clearly and measurably:

Arm A — just ask
header edits made scores worse on average (−6.1 points)
Arm B — Refiner
header edits made scores better on average (+0.58 points)
head-to-head gap
+6.68 points in the Refiner's favor, per skill
is it real or luck?
p < 0.000001 — about a one-in-a-million chance this is a fluke
how big an effect?
Cohen's d = 1.49 — a "large" effect by the standard yardstick
specimens
60 skill files, each tested both ways
cost
$0.00
verdict
PROCEED — build the Refiner; the loop earns its keep

The honest footnote. The Refiner won decisively, but it only reached about 39% of the gain we'd hoped for going in. So the verdict is "the idea works and is worth building" — not "the idea is finished." We say that plainly rather than rounding up, because a result you can't trust isn't worth signing.

4. The proof — verify it without trusting us

Anyone can claim a result. What makes this different: the result files are sealed into a tamper-proof public record. We took the exact result files, fingerprinted them, and signed that fingerprint into the sigstore public transparency log — the same public-good system behind npm and Google's supply-chain security. No account, no secret key; the signer is our automated lab pipeline, not a person who could fake it.

That gives you a permanent receipt proving these exact results existed, unaltered, at a known time:

the public receipt
Rekor log index 1689291334 — the tamper-proof entry, readable by anyone, forever
signed at
2026-06-01 04:45 UTC
the signed bundle
evidence-bundle.sigstore.json — signature + certificate + proof, in one file
verify it yourself
step-by-step verification guide (one cosign command, no account)

What the signature proves: these are the real, unaltered result files, signed by our lab pipeline at that time. What it does not claim: it does not assert the science is beyond critique — a signature attests authenticity, not correctness. The science stands on the pre-registered method below, open for anyone to challenge.

5. See everything — down to the raw logs

Every layer is public. Click down as deep as you want:

6. Why this matters — the one-line version

We tested whether our Refiner's "score-the-edit-and-only-keep-it-if-better" loop beats just-asking-the-model — on the config-header edit — it won clearly on a free model at zero cost, and we sealed the proof into a public log so nobody has to take our word for it.

That last part is the whole game. Plenty of tools publish a number. Almost none let you cryptographically confirm the number wasn't quietly edited after the fact. That confirmable-receipt is what "audit-first" actually means — and it's the layer a scoreboard of unsigned numbers can't offer.

7. What this does not prove yet — and the next experiment

Being honest about the edges is the point of an audit-first lab, so plainly:

We publish the limits next to the result on purpose. A number you can't trust isn't worth signing — and neither is a claim that hides its own scope.

← Back to the Evidence Bench scorecard