Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 37 additions & 0 deletions benchmark-leakage-audit-assistant/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Benchmark Leakage Audit Assistant

This self-contained module adds an AI research assistant slice for pre-release benchmark hygiene. It helps reviewers catch evaluation leakage before a paper, model, or scientific benchmark result is published.

## What It Checks

- Train/test overlap by record ID or normalized content fingerprint
- Benchmark contamination in the training corpus
- Final holdout or test set use during model selection
- Missing split provenance such as deterministic method, seed, or manifest hash
- Missing reproducibility packet evidence such as lockfiles, manifests, code archive, and preregistration

## Run It

```bash
node benchmark-leakage-audit-assistant/test.js
node benchmark-leakage-audit-assistant/demo.js
```

The module uses only Node.js standard library APIs.

## Public API

```js
const { auditBenchmarkLeakage } = require("./index.js");

const audit = auditBenchmarkLeakage(project);
console.log(audit.summary.releaseDecision);
console.log(audit.findings);
console.log(audit.reviewerPacket.tasks);
```

The audit returns a release decision of `pass`, `needs-remediation`, or `block`, plus reviewer-ready findings with evidence and remediation tasks.

## Demo

The included `demo.gif` shows the module blocking a release candidate with train/test overlap, benchmark contamination, held-out tuning, weak split provenance, and missing reproducibility artifacts.
Binary file added benchmark-leakage-audit-assistant/demo.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
84 changes: 84 additions & 0 deletions benchmark-leakage-audit-assistant/demo.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
const { auditBenchmarkLeakage } = require("./index.js");

const demoProject = {
title: "NeuroImaging Benchmark Release Candidate",
benchmark: {
name: "NeuroBench-26",
items: [
{
id: "nb26-hidden-17",
title: "Hidden fMRI cohort sample",
text: "Reserved NeuroBench-26 fMRI cohort with blinded diagnostic labels."
}
]
},
datasets: {
train: [
{
id: "train-cohort-13",
source: "NeuroBench-26 pre-release mirror",
text: "Reserved NeuroBench-26 fMRI cohort with blinded diagnostic labels."
},
{
id: "subject-0042",
source: "lab import",
text: "Subject 0042 resting-state network features with quality-control notes."
}
],
validation: [
{
id: "subject-1182",
source: "validation import",
text: "Subject 1182 task-state network features with adjudicated QC status."
}
],
test: [
{
id: "subject-0042",
source: "holdout import",
text: "Subject 0042 resting-state network features with quality-control notes."
}
]
},
split: {
method: "manual export",
seed: "",
manifestHash: null
},
experiments: [
{
id: "exp-neuro-7",
usedForSelection: "test",
notes: "Selected final model using best test AUROC after evaluating four checkpoints."
}
],
artifacts: {
rawDataManifest: true,
splitManifest: false,
environmentLock: false,
codeArchive: true,
preregistration: false
}
};

const audit = auditBenchmarkLeakage(demoProject);

console.log(`Benchmark leakage audit: ${audit.summary.projectTitle}`);
console.log(`Decision: ${audit.summary.releaseDecision}`);
console.log(`Reproducibility confidence: ${audit.summary.reproducibilityConfidence}`);
console.log(`Findings: ${audit.summary.findingCount}`);
console.log("");

for (const finding of audit.findings) {
console.log(`[${finding.severity.toUpperCase()}] ${finding.title}`);
for (const evidence of finding.evidence) {
console.log(` - ${evidence}`);
}
console.log(` Remediation: ${finding.remediation}`);
console.log("");
}

console.log("Reviewer tasks:");
for (const task of audit.reviewerPacket.tasks) {
console.log(`- ${task}`);
}
Loading