Skip to content

rustic-ai/dmas-memory

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

57 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Cost and accuracy of long-term graph memory in distributed LLM-based multi-agent systems

arXiv

A distributed multi-agent system testbed for benchmarking five long-term conversational memory strategies under different network scenarios. See the arXiv paper for the research questions, methodology, statistical Pareto analysis, and full results. This README only covers what the repo contains and how to run it.

Backends compared

Approach Library Storage
mem0 mem0 v2.0.2 Qdrant
Graphiti graphiti-core 0.29 Neo4j
Cognee cognee 1.0.9 Neo4j + Qdrant
RAG in-house RagService Qdrant
Full Context in-house FullContextService (in-process)

Loading and retrieval mirror the upstream evaluation harnesses verbatim (mem0ai/memory-benchmarks, getzep/zep-papers). Evaluation uses the LOCOMO benchmark.

Quick start

Prereqs: Docker + Docker Compose, GNU Make, an OpenAI API key, ~12 GB RAM.

cd testbed
cp .env.example .env          # set OPENAI_API_KEY=sk-...
make build
make start

make experiment-test CONV=0   # smoke test
make experiment               # full publishable sweep

On Linux/macOS, sudo bash install.sh from testbed/ installs Make + Docker.

Smoke targets

Target Messages Questions Duration
make experiment-test-s 5 1 (cat 2) minutes
make experiment-test 119 3 × cats 1-4 ~hour
make experiment-test-l 199 3 × cats 1-4 hours

All three sweep both network regimes for one CONV (default 0), reuse the load across regimes (KEEP_STATE=1), and run 1 judge call per answer. Narrow with BACKENDS="mem0 graphiti".

Make targets

Command What it does
make build Build every image from scratch. Aborts if OPENAI_API_KEY is missing.
make start Bring the full stack up. Langfuse pk/sk auto-generated on first run.
make stop Stop containers; volumes preserved.
make clean Stop + drop only memory volumes (qdrant-data, neo4j-data, neo4j-logs).
make reset Stop + drop every named volume. Compose-managed bridges are torn down by stop and re-created on the next start.
make experiment Full sweep: 10 LOCOMO convs × {unconstrained, constrained} × 5 backends × 3-judge majority.
make experiment-leg Single (CONV, MODE) × backends. Knobs: CONV MODE BACKENDS QUESTIONS MESSAGES Q_PER_TYPE QUESTION_TYPES KEEP_STATE.
make logs / ps Tail logs / list containers.

Network partition

Three compose bridges with a single gateway:

  • edge-netcoordinator, ollama, litellm-edge
  • cloud-netresponder, memory, qdrant, neo4j, litellm-cloud
  • mgmt-netbenchmark, langfuse-* (observability + orchestration, never on the data plane)

toxiproxy is the sole container with a foot in both data subnets. Coordinator is the only caller that goes through it; responder↔memory and memory↔storage are direct on cloud-net.

make build auto-picks a non-colliding /16 from a candidate list (172.30, 172.40, 10.42, 192.168.220, …) and pins the chosen subnets + toxiproxy IPs into .env. Pin manually by uncommenting the block in .env.example.

Network fault injection

unconstrained clears all toxics; constrained applies CONSTRAINED_LATENCY / CONSTRAINED_JITTER / CONSTRAINED_BANDWIDTH (defaults 150 ms / 30 ms / 512 KB/s) to toxiproxy's memory + responder proxies — i.e. only the coordinator↔cloud flows. The bench re-verifies live toxic state mid-run and rejects with HTTP 412 on drift (dmas/benchmark/app/toxics.py).

Network metering

Cross-boundary traffic is read at toxiproxy's edge-net veth: rx_bytes = edge→cloud (network_edge_to_cloud_bytes), tx_bytes = cloud→edge (network_cloud_to_edge_bytes). One chokepoint, no double-count, no intra-cloud inflation. CPU/RAM/disk are still summed per group=edge|cloud label (dmas/benchmark/app/cgroup_metrics.py).

Project structure

dmas-memory/
├── paper/                # Manuscript — published on arXiv: 2601.07978
└── testbed/              # Runnable benchmark — `cd testbed` before any `make`
    ├── dmas/
    │   ├── benchmark/        # /experiment endpoint, judges, metrics, CSV writer
    │   ├── coordinator/      # /ask handler with ollama tool-calling
    │   ├── memory/           # mem0 + Graphiti + Cognee + RAG + FullContext
    │   ├── responder/        # Final-answer generator
    │   ├── shared/           # otel_init.py, litellm_usage.py, models.py
    │   ├── litellm/{config-edge,config-cloud}.yaml
    │   └── docker-compose.yml
    ├── experiments/
    │   ├── results.ipynb     # Reproduces every table and chart in the paper
    │   └── results/          # Per-experiment CSVs ({prefix}{backend}_{mode}.csv)
    ├── Makefile
    └── .env.example

Reproducing the paper

experiments/results.ipynb globs every *.csv under experiments/results/ into one DataFrame and rebuilds every table and chart in the paper end-to-end. Each chapter renders a constrained − unconstrained Δ panel underneath its bar chart so the regime impact per backend is always visible.

Citation

@misc{wolff2026costaccuracylongtermmemory,
      title={Cost and accuracy of long-term memory in Distributed Multi-Agent Systems based on Large Language Models},
      author={Benedict Wolff and Jacopo Bennati},
      year={2026},
      eprint={2601.07978},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2601.07978},
}

LOCOMO benchmark dataset:

@article{maharana2024evaluating,
  title={Evaluating very long-term conversational memory of llm agents},
  author={Maharana, Adyasha and Lee, Dong-Ho and Tulyakov, Sergey and Bansal, Mohit and Barbieri, Francesco and Fang, Yuwei},
  journal={arXiv preprint arXiv:2402.17753},
  year={2024}
}

License

See LICENSE.txt.

About

A comparison of the accuracy and cost of long-context vector versus graph memory in distributed LLM-based multi-agent systems

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Jupyter Notebook 74.1%
  • Python 22.5%
  • Makefile 2.2%
  • Shell 1.1%
  • Dockerfile 0.1%