Cost and accuracy of long-term graph memory in distributed LLM-based multi-agent systems

A distributed multi-agent system testbed for benchmarking five long-term conversational memory strategies under different network scenarios. See the arXiv paper for the research questions, methodology, statistical Pareto analysis, and full results. This README only covers what the repo contains and how to run it.

Backends compared

Approach	Library	Storage
mem0	mem0 v2.0.2	Qdrant
Graphiti	graphiti-core 0.29	Neo4j
Cognee	cognee 1.0.9	Neo4j + Qdrant
RAG	in-house `RagService`	Qdrant
Full Context	in-house `FullContextService`	(in-process)

Loading and retrieval mirror the upstream evaluation harnesses verbatim (mem0ai/memory-benchmarks, getzep/zep-papers). Evaluation uses the LOCOMO benchmark.

Quick start

Prereqs: Docker + Docker Compose, GNU Make, an OpenAI API key, ~12 GB RAM.

cd testbed
cp .env.example .env          # set OPENAI_API_KEY=sk-...
make build
make start

make experiment-test CONV=0   # smoke test
make experiment               # full publishable sweep

On Linux/macOS, sudo bash install.sh from testbed/ installs Make + Docker.

Smoke targets

Target	Messages	Questions	Duration
`make experiment-test-s`	5	1 (cat 2)	minutes
`make experiment-test`	119	3 × cats 1-4	~hour
`make experiment-test-l`	199	3 × cats 1-4	hours

All three sweep both network regimes for one CONV (default 0), reuse the load across regimes (KEEP_STATE=1), and run 1 judge call per answer. Narrow with BACKENDS="mem0 graphiti".

Make targets

Command	What it does
`make build`	Build every image from scratch. Aborts if `OPENAI_API_KEY` is missing.
`make start`	Bring the full stack up. Langfuse pk/sk auto-generated on first run.
`make stop`	Stop containers; volumes preserved.
`make clean`	Stop + drop only memory volumes (`qdrant-data`, `neo4j-data`, `neo4j-logs`).
`make reset`	Stop + drop every named volume. Compose-managed bridges are torn down by `stop` and re-created on the next `start`.
`make experiment`	Full sweep: 10 LOCOMO convs × {unconstrained, constrained} × 5 backends × 3-judge majority.
`make experiment-leg`	Single `(CONV, MODE)` × backends. Knobs: `CONV MODE BACKENDS QUESTIONS MESSAGES Q_PER_TYPE QUESTION_TYPES KEEP_STATE`.
`make logs` / `ps`	Tail logs / list containers.

Network partition

Three compose bridges with a single gateway:

edge-net — coordinator, ollama, litellm-edge
cloud-net — responder, memory, qdrant, neo4j, litellm-cloud
mgmt-net — benchmark, langfuse-* (observability + orchestration, never on the data plane)

toxiproxy is the sole container with a foot in both data subnets. Coordinator is the only caller that goes through it; responder↔memory and memory↔storage are direct on cloud-net.

make build auto-picks a non-colliding /16 from a candidate list (172.30, 172.40, 10.42, 192.168.220, …) and pins the chosen subnets + toxiproxy IPs into .env. Pin manually by uncommenting the block in .env.example.

Network fault injection

unconstrained clears all toxics; constrained applies CONSTRAINED_LATENCY / CONSTRAINED_JITTER / CONSTRAINED_BANDWIDTH (defaults 150 ms / 30 ms / 512 KB/s) to toxiproxy's memory + responder proxies — i.e. only the coordinator↔cloud flows. The bench re-verifies live toxic state mid-run and rejects with HTTP 412 on drift (dmas/benchmark/app/toxics.py).

Network metering

Cross-boundary traffic is read at toxiproxy's edge-net veth: rx_bytes = edge→cloud (network_edge_to_cloud_bytes), tx_bytes = cloud→edge (network_cloud_to_edge_bytes). One chokepoint, no double-count, no intra-cloud inflation. CPU/RAM/disk are still summed per group=edge|cloud label (dmas/benchmark/app/cgroup_metrics.py).

Project structure

dmas-memory/
├── paper/                # Manuscript — published on arXiv: 2601.07978
└── testbed/              # Runnable benchmark — `cd testbed` before any `make`
    ├── dmas/
    │   ├── benchmark/        # /experiment endpoint, judges, metrics, CSV writer
    │   ├── coordinator/      # /ask handler with ollama tool-calling
    │   ├── memory/           # mem0 + Graphiti + Cognee + RAG + FullContext
    │   ├── responder/        # Final-answer generator
    │   ├── shared/           # otel_init.py, litellm_usage.py, models.py
    │   ├── litellm/{config-edge,config-cloud}.yaml
    │   └── docker-compose.yml
    ├── experiments/
    │   ├── results.ipynb     # Reproduces every table and chart in the paper
    │   └── results/          # Per-experiment CSVs ({prefix}{backend}_{mode}.csv)
    ├── Makefile
    └── .env.example

Reproducing the paper

experiments/results.ipynb globs every *.csv under experiments/results/ into one DataFrame and rebuilds every table and chart in the paper end-to-end. Each chapter renders a constrained − unconstrained Δ panel underneath its bar chart so the regime impact per backend is always visible.

Citation

@misc{wolff2026costaccuracylongtermmemory,
      title={Cost and accuracy of long-term memory in Distributed Multi-Agent Systems based on Large Language Models},
      author={Benedict Wolff and Jacopo Bennati},
      year={2026},
      eprint={2601.07978},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2601.07978},
}

LOCOMO benchmark dataset:

@article{maharana2024evaluating,
  title={Evaluating very long-term conversational memory of llm agents},
  author={Maharana, Adyasha and Lee, Dong-Ho and Tulyakov, Sergey and Bansal, Mohit and Barbieri, Francesco and Fang, Yuwei},
  journal={arXiv preprint arXiv:2402.17753},
  year={2024}
}

License

See LICENSE.txt.

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
testbed		testbed
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cost and accuracy of long-term graph memory in distributed LLM-based multi-agent systems

Backends compared

Quick start

Smoke targets

Make targets

Network partition

Network fault injection

Network metering

Project structure

Reproducing the paper

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Cost and accuracy of long-term graph memory in distributed LLM-based multi-agent systems

Backends compared

Quick start

Smoke targets

Make targets

Network partition

Network fault injection

Network metering

Project structure

Reproducing the paper

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages