GitHub - JayFarei/datafetch: A Search-as-Code adaptive retrieval system that crystallises query shape from agent usage, per-tenant, over a polymorphic document store.

     _       _        __      _       _
  __| | __ _| |_ __ _ / _| ___| |_ ___| |__
 / _` |/ _` | __/ _` | |_ / _ \ __/ __| '_ \
| (_| | (_| | || (_| |  _|  __/ || (__| | | |
 \__,_|\__,_|\__\__,_|_|  \___|\__\___|_| |_|

  your queries / your interface
  a dataset harness for coding agents

Get it going

Paste this prompt into your coding agent (Claude Code, Codex, Pi, or OpenCode) and let it bootstrap datafetch and reproduce the SkillCraft result end to end:

Set up datafetch, a dataset harness for coding agents
(https://github.com/JayFarei/datafetch), and reproduce its SkillCraft result.
Do this, reading as you go:

1. Clone the repo and read README.md top to bottom.
2. Install it: `pnpm install`.
3. Read eval/skillcraft/README.md to understand the three arms
   (skillcraft-base, skillcraft-skill, datafetch-learned) and the auth/driver
   the harness expects.
4. Run a fast smoke of the datafetch code-mode arm:
     pnpm eval:skillcraft:synthetic:live:smoke
   This drives a coding agent over a few SkillCraft tasks and crystallises
   df.lib.* helpers from accepted trajectories.
5. Inspect what was learned: open the mounted workspace's lib/ and the run
   artifacts, and explain which accepted trajectory became a typed,
   replay-gated df.lib.* call that later tasks reuse instead of re-deriving.
6. For the full 126-task suite behind the 94.4% headline, follow the
   reproducible flow in eval/skillcraft/README.md and run `pnpm eval:skillcraft`.

Then tell me what crystallised into lib/, and how warm-path reuse moved
correctness, token cost, and tool work versus the cold run.

Want a different shape than SkillCraft? Swap step 4/6 for any of the harnesses in Dataset shapes we've tried, or use the generic product path below (server -> attach -> add -> mount -> run -> commit) over your own Hugging Face dataset.

datafetch is a dataset harness for coding agents. It exposes a mounted dataset as a bash-shaped workspace with typed TypeScript handles, writable intent scripts, structured run artifacts, and tenant-local learned interfaces.

The rule is deliberately narrow:

The system only learns from data-molding logic that was written into the
workspace and executed by datafetch.

Agents can inspect freely. Reusable learning comes from committed visible code that returns df.answer(...) with evidence, coverage, derivation, and lineage.

Dataset shapes we've tried

The thesis only bites when the same dataset is queried repeatedly with reusable intent structure. To find where that holds and where it doesn't, we built harnesses across deliberately different data shapes. Honest results, including the negatives:

Dataset	Data shape	What it probed	Outcome
SkillCraft	Synthetic tool-composition families (21 families × 6 difficulty levels, 126 tasks) fanning out over real tool APIs	Reuse rate, token amortisation, and a 7-arm governance / persistence ablation	94.4% pass (119/126) at ~172× lower token cost vs the vanilla ceiling; +7.9pp on the hard tier. Cross-session cost amortisation falsified on the hardest fan-out arm — reuse fires, but a one-shot inline rewrite was cheaper there.
FinChain	Parameterised symbolic financial reasoning chains (58 topics × 5 levels) with step-aligned grading	Correctness vs the published paper baseline; substrate-ON vs substrate-OFF	Matches/exceeds the paper baseline. Pure-compute trajectories give the crystallisation gate nothing to learn → substrate delta structurally ≈ 0.
CRAG	Open-domain web QA across 5 domains (2,706 rows, 8 question types, tri-state grading)	Governance-under-staleness; zero-source SDK onboarding	Corpus + grader built. Shape probe found tool-only trajectories collapse to a single fan-out signature, and within-session reuse = 0 — a correctness landmine, not a win.
FinQA	Tabular S&P 500 10-K filing QA (8,281 pairs) with compilable arithmetic gold programs	The cold `db/` → warm `lib/` arc; gold programs as the template for crystallised helpers	Seed library + original demo spine. The first proof that an accepted trajectory can become a typed, reusable `df.lib.*` call.
ProductFlow	3-episode micro-eval over a live REST API (jsonplaceholder)	The full crystallise → discover → reuse loop on a real product API outside SkillCraft	~1.7× token delta. Auto-crystallised helpers came out thinner than the model's inline rewrite — this set our adversarial baseline (inline-rewrite, no persistence).
OpenTraces	Private polymorphic event-log store (~11.6GB: 1,592 traces, 861k events, 13+ discriminated event types, 4 developer personas)	Correctness on a genuinely model-prior-free store; per-tenant library divergence	Corpus sealed, 200+ question pack built; spread probe passed (median ~55× amortisation surface). Current primary instrument for the correctness claim.

Also scouted but not yet harnessed: τ³-bench (multi-turn policy/transactional agent tasks), BIRD-SQL (cross-database text-to-SQL), and FinReflectKG-MultiHop / FinAgentBench (document-grounded financial KG retrieval).

Quickstart

pnpm install
npm link            # or: pnpm link --global

datafetch server --port 8080

In another shell:

datafetch attach http://localhost:8080 --tenant demo

datafetch add https://huggingface.co/datasets/OpenTraces/opentraces-devtime --json
datafetch list --json
datafetch inspect opentraces-devtime --json

datafetch mount opentraces-devtime \
  --tenant demo \
  --intent "Find traces about debugging and produce an evidence-backed summary"

The mount command creates an intent workspace. cd into it and work like a small code project:

cat AGENTS.md
cat df.d.ts
ls db lib scripts

datafetch run scripts/scratch.ts
datafetch commit scripts/answer.ts
cat result/answer.md
cat result/validation.json

Workspace Contract

Each mounted intent workspace is a worktree-shaped environment:

AGENTS.md
CLAUDE.md -> AGENTS.md
df.d.ts
db/
lib/
scripts/
  scratch.ts
  answer.ts
  helpers.ts
tmp/runs/
result/

The directories have stable meanings:

db/ is immutable dataset context and typed collection primitives.
lib/ is the tenant-local learned-interface surface.
scripts/ is writable user space for visible intent programs.
tmp/runs/ contains notebook-style exploratory run artifacts.
result/ contains the committed answer, lineage, validation, replay test, and worktree commit history.

datafetch run is exploratory. datafetch commit is the final answer path. Only committed visible code that passes validation is eligible for learning.

Dataset Initialization

The server owns dataset initialization. For the current prototype, supported datasets are registered from Hugging Face dataset URLs or a server whitelist. Initialization publishes the mount, samples the dataset, writes descriptors and typed handles, then creates source templates for future workspaces:

$DATAFETCH_HOME/sources/<source-id>/
  source.json
  manifest.json
  adapter-profile.json
  init-context.json
  init-agent.json
  templates/
    AGENTS.md
    CLAUDE.md
    scripts/scratch.ts
    scripts/answer.ts

The init template can be deterministic or authored through the Flue-backed datafetch_init_mount_template skill. The client agent does not need to know which path produced the template; it just receives a normal workspace.

CLI Surface

Server:
  datafetch server [--port 8080] [--base-dir <path>] [--datasets <file>]

Client/catalog:
  datafetch attach <server-url> --tenant <id>
  datafetch add <dataset-url> [--id <local-id>] [--json]
  datafetch list [--json]
  datafetch inspect <source-id> [--json]

Intent workspace:
  datafetch mount <source-id> --tenant <id> --intent '<intent>' [--path <dir>]
  datafetch run [scripts/scratch.ts]
  datafetch commit [scripts/answer.ts]

Discovery:
  datafetch apropos <query> [--json]
  datafetch man <df.lib.name>

Legacy/demo:
  datafetch session ...
  datafetch plan ...
  datafetch execute ...
  datafetch tsx ...
  datafetch publish <mount-id> --uri <atlas-uri> --db <db-name>
  datafetch demo [--mount finqa-2024] [--no-cache]

The default product path is server -> attach -> add/list/inspect -> mount -> run -> commit.

Seed Packs

Generic seed functions and skills live under:

seeds/generic/

Domain-specific demo/eval packs live under:

seeds/domains/<domain>/

By default the runtime mirrors only generic seeds into $DATAFETCH_HOME/lib/__seed__/. To expose a domain pack, pass seedDomains in code or set:

DATAFETCH_SEED_DOMAINS=finqa

The FinQA table helpers remain available for the historical demo and live acceptance scripts, but they are no longer part of every generic dataset mount.

Test Harnesses

Fast local verification:

pnpm typecheck
pnpm test

Acceptance harnesses:

bash tests/acceptance/run-all.sh

The default acceptance run covers no-LLM/no-Atlas flows plus the public Hugging Face catalog path. Live client-agent and Atlas/FinQA loops are opt-in:

RUN_AGENT_E2E=1 ATLAS_URI='mongodb+srv://...' bash tests/acceptance/run-all.sh

The harness matrix is documented in tests/acceptance/README.md.

Telemetry For Evals

Set these during benchmark runs:

DATAFETCH_TELEMETRY=1
DATAFETCH_TELEMETRY_LABEL=<scenario-or-benchmark-id>
DATAFETCH_SEARCH_MODE=<baseline|learned|adapter-name>

Telemetry is written under:

$DATAFETCH_HOME/telemetry/events.jsonl

Each event captures the snippet phase, trajectories, call primitives, cost signals, answer status, validation, and enough labels to compare datafetch against alternative agentic search baselines.

Environment

DATAFETCH_HOME - server/workspace state root. Defaults to <cwd>/.datafetch.
DATAFETCH_SERVER_URL - client default server URL.
DATAFETCH_SESSION - legacy snippet/session fallback.
DATAFETCH_SEED_DOMAINS - comma-separated optional seed packs.
DATAFETCH_INIT_MODEL - model for LLM-authored dataset init templates.
DATAFETCH_LLM_MODEL / DF_LLM_MODEL - fallback model for Flue agent bodies.
HF_DATASETS_SERVER_URL - override Hugging Face Dataset Viewer endpoint.
ATLAS_URI / MONGODB_URI - optional Atlas demo/eval connection string.
ATLAS_DB_NAME / MONGODB_DB_NAME - optional Atlas database override.
DATAFETCH_SKIP_ENV_FILE=1 - skip automatic .env loading.

Legacy ATLASFS_HOME and ATLASFS_SKIP_ENV_FILE are still honored for old local setups.

Source Layout

The substrate (src/) is dataset-neutral. Each dataset/benchmark lives under its own eval/<dataset>/ directory and plugs into the substrate through the documented contracts (tool bridge, adapter profile, answer kit). Adding a dataset should not require a src/ change. See architecture.md § the substrate / dataset boundary.

bin/                  CLI binary shim
kb/docs/              product, runtime, learning-loop, architecture, eval docs
kb/                   knowledge base (plans, prd, background research, archive)
skills/datafetch/     installable client-agent skill
tests/                vitest unit/integration tests (substrate)
tests/acceptance/     substrate CLI/server e2e acceptance harnesses
experiments/          experiment log by episode (episodes/ + log/)

eval/                 ALL eval work (depends on src/, never the reverse)
eval/harness/         eval drivers (skillcraft/finchain/sac/crag runners)
eval/seeds/           substrate seed library: generic + domain packs
                      (runtime-loaded via locateRepoSubdir("eval/seeds/..."))
eval/tests/           eval-specific vitest suites (sac-*, crag, planner)
eval/scripts/         cross-suite eval orchestration scripts
eval/skillcraft/      SkillCraft benchmark harness (21 families x 6 levels)
eval/productFlow/     non-benchmark product-flow cross-eval
eval/finchain/        FinChain benchmark harness
                      (each: configs, prepare/runner scripts,
                       results/ — gitignored)

src/runtime/          cross-cutting substrate utilities: answer-kit emitter
                      + generic syntax-slip rewriters, tool catalog types
src/snippet/          TypeScript snippet runtime + df.* binding + tool bridge
src/observer/         trajectory gate and learned-interface authoring
src/hooks/            VFS hook registry (df.lib.<name> contract surface)
src/adapter/          dataset substrate adapters
src/bootstrap/        sample, infer, synthesize, manifest emit
src/bash/             just-bash session integration
src/cli/              CLI command implementations
src/demo/             FinQA two-question demo
src/discovery/        library search / apropos
src/flue/             Flue dispatcher and skill loading
src/sdk/              public TypeScript SDK primitives
src/server/           Hono data plane and catalog routes
src/trajectory/       call-scope and lineage recording

Local generated state stays ignored: .datafetch/, .atlasfs/, .snippet-cache/, artifacts/, dist/, and every eval/<dataset>/results/.

Docs

Status

Prototype. The current useful slice is:

local server;
Hugging Face source registration;
dataset initialization templates;
intent workspace mount;
run/commit artifacts;
telemetry;
optional FinQA learned-interface demo.

Next step: run structured evals comparing normal agentic search against the dataset harness path over repeated intent families.

Name		Name	Last commit message	Last commit date
Latest commit History 203 Commits
bin		bin
eval		eval
experiments		experiments
kb		kb
reports		reports
skills/datafetch		skills/datafetch
src		src
tests		tests
web		web
.gitignore		.gitignore
EVAL.md		EVAL.md
README.md		README.md
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
tsconfig.json		tsconfig.json
vitest.config.ts		vitest.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Get it going

Dataset shapes we've tried

Quickstart

Workspace Contract

Dataset Initialization

CLI Surface

Seed Packs

Test Harnesses

Telemetry For Evals

Environment

Source Layout

Docs

Status

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Get it going

Dataset shapes we've tried

Quickstart

Workspace Contract

Dataset Initialization

CLI Surface

Seed Packs

Test Harnesses

Telemetry For Evals

Environment

Source Layout

Docs

Status

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages