RagMemory gives Codex a local memory that survives across turns.
It stores your conversation locally, recalls useful context before a new prompt, and keeps a generated Obsidian mirror so you can inspect what was remembered.
- Remembers useful chat context across Codex sessions.
- Recalls relevant memory automatically through Codex hooks.
- Saves assistant replies after each turn.
- Extracts durable structured memory such as decisions, preferences, configs, code references, and open questions.
- Exports a readable Obsidian vault under
.data/obsidian_memory. - Uses decay-aware retrieval so stale memories naturally appear less often.
- Supports manual tombstone removal for wrong, private, or harmful memory.
Raw memory is not deleted during normal forgetting. Forgetting means old, unused memories rank lower during retrieval.
From the repo root:
uv venv
uv pip install -e .Create your local settings file:
Copy-Item .\ragmemory.example.ini .\ragmemory.local.iniEdit ragmemory.local.ini and add your API key:
[structured_memory]
api_key = your-nvidia-api-key
model = minimaxai/minimax-m2.7
max_chars = 6000
max_tokens = 900
[llm]
structured_provider = nvidia
compact_provider = nvidia
[llm.nvidia]
api_key = your-nvidia-api-key
base_url = https://integrate.api.nvidia.com/v1
model = minimaxai/minimax-m2.7
api_style = openai_chat
[embedding]
provider = chroma_default
[compact]
enable = true
model = minimaxai/minimax-m2.7
min_chars = 1500
max_chars = 30000
max_tokens = 1200
target_ratio = 0.35
mode = backgroundragmemory.local.ini is ignored by git. Do not commit it.
RagMemory can use different LLM providers for structured extraction and message compaction. Keep structured extraction on NVIDIA first, and try OpenCode Go for compaction:
[llm]
structured_provider = nvidia
compact_provider = opencode_go
[llm.opencode_go]
api_key = your-opencode-go-api-key
base_url = https://opencode.ai/zen/go/v1
model = deepseek-v4-flash
api_style = openai_chat
thinking = disabled
[compact]
enable = true
max_tokens = 4096Smoke-test a provider without writing to the DB:
uv run python scripts/test_llm_provider.py --provider opencode_goOnly openai_chat providers are supported for now. OpenCode Go models that use
/messages need a separate adapter later.
RagMemory uses local embeddings for Chroma vector search, then combines those results with BM25 keyword search. The default is still Chroma's built-in embedding function because it is the easiest clean install.
For better recall on a local Windows machine, try the small BGE model:
[embedding]
provider = sentence_transformers
model = BAAI/bge-small-en-v1.5
device = cpu
normalize_embeddings = trueModel tradeoffs:
all-MiniLM-L6-v2 fastest and lightest
BAAI/bge-small-en-v1.5 better recall, still CPU-friendly
BAAI/bge-m3 better multilingual recall, too heavy for many CPUs
After changing the embedding model, rebuild both chat and structured indexes:
uv run python scripts/rebuild_memory_index.py --db-path ./.data/chroma_dbRagMemory uses model-specific Chroma collection names for non-default embedding models, so old vectors are not mixed with new vectors.
Install the Codex hooks from:
ragm_mcp/hooks/README.md
After the hooks are installed:
UserPromptSubmitrecalls memory and injects it into Codex.UserPromptSubmitsaves your user prompt.Stopsaves the assistant response.Stoprefreshes the Obsidian mirror.- Long-running structured extraction and compaction jobs are handled by the
manual worker:
uv run python scripts/run_worker.py.
On Windows you can also start the worker by double-clicking:
start_worker.bat
Keep the worker running while chatting. Stop it before a large manual
compact_backfill.py run if you want to avoid duplicate API calls or NVIDIA
rate limits.
You should see injected context like this at the start of a turn:
=== RagMemory Context ===
...
=== End RagMemory Context ===
If this context takes too many tokens, tune the hook recall size in
ragmemory.local.ini:
[recall]
context_token_budget = 900
retrieve_top_k = 3
structured_top_k = 2
recent_messages = 4
include_recent = true
include_structured = trueMCP is optional when hooks are installed.
Recommended local MCP settings:
[mcp.tools]
enable_recall = false
enable_save = false
enable_tombstone = trueWith this split:
- Hooks own automatic recall and save.
- MCP recall is disabled to avoid duplicate token usage.
- MCP save is disabled to avoid duplicate writes.
- MCP remains useful for
memory_stats,remove_memory_preview, andremove_memory_confirm.
See the MCP details in:
ragm_mcp/README.md
Inspect recent events:
uv run python scripts/inspect_events.py --db-path ./.data/chroma_db --limit 20Inspect structured-object add events:
uv run python scripts/inspect_events.py --event structured_object_addedCheck that the Obsidian graph export stays clean:
uv run python scripts/check_obsidian_graph.pyCreate an animated GIF that shows the map forming over time:
uv run python scripts/animate_obsidian_graph.py --obsidian ./.data/obsidian_memory --output ./.data/graph_animation/ragmemory-map-formed.gifBy default this renders every memory graph node, including raw message notes.
For every Markdown note Obsidian can see, add --include-navigation. For a
smaller explainer GIF, add --exclude-messages --max-nodes 900.
If the graph shows isolated message dots, hide raw message notes that have no structured links with this Obsidian graph filter:
-["cssclasses":"memory-unlinked"]
File/path hubs are disabled by default because they can clutter the graph. Turn them on only if you want file-level nodes:
[obsidian.files]
enable = trueExport the Obsidian mirror manually:
uv run python scripts/export_obsidian.py --db-path ./.data/chroma_db --output ./.data/obsidian_memoryOpen this folder in Obsidian:
.data/obsidian_memory
The normal Obsidian export keeps leaf topic hubs under topics/. If the graph
gets too dense, run the topic regroup step to add an upper layer under
topic_groups/:
uv run python scripts/regroup_topics.py --run --db-path ./.data/chroma_db
uv run python scripts/export_obsidian.py --db-path ./.data/chroma_db --output ./.data/obsidian_memory --config ragmemory.local.iniThis writes .data/chroma_db/topic_taxonomy.json. The regroup step does not
delete or rewrite the existing leaf topics; it asks the LLM to create group
notes that link to related leaf topics.
Configure the token budget in ragmemory.local.ini:
[topic_regroup]
enable = true
max_input_topics = 150
min_groups = 10
max_tokens = 6000
thinking = disabledmax_input_topics limits how many leaf-topic summaries are sent to the LLM for
grouping. Topics outside that limit stay in topics/; they are just ungrouped
for that run.
To queue the same work for the worker instead of running it immediately:
uv run python scripts/regroup_topics.py --queue --db-path ./.data/chroma_db
uv run python scripts/run_worker.py --once --db-path ./.data/chroma_dbGenerate non-LLM wiki pages from the current graph:
uv run python scripts/generate_wiki.py --obsidian ./.data/obsidian_memoryAdd cached LLM summaries one page at a time:
uv run python scripts/generate_wiki.py --obsidian ./.data/obsidian_memory --config ragmemory.local.ini --llm --llm-limit 1Compacted messages may contain evidence references like:
evidence[text:874b3468c9f7]
These are deterministic pointers to exact structured evidence blocks. In the
Obsidian mirror, message notes include an Evidence References section that
links each marker to the matching structured note. Structured notes also expose
content_hash and evidence_ref in frontmatter so the hash is searchable.
Use removal only for memory that is wrong, private, harmful, or should not be used again.
Preview recent records:
uv run python scripts/remove_memory.py --recent 20Search for a bad record:
uv run python scripts/remove_memory.py --search "wrong remembered detail"Preview specific message IDs:
uv run python scripts/remove_memory.py --message-ids 12,13Confirm tombstone removal:
uv run python scripts/remove_memory.py --message-ids 12,13 --confirmThis is tombstone-only. It hides records from retrieval and moves them to
forgotten/ in the Obsidian mirror. It does not hard-delete raw storage.
Back up the active DB folder:
Copy-Item -Recurse .\.data\chroma_db .\.data\backup-chroma_dbThe DB folder contains:
state.sqlite
chroma.sqlite3
structured_memory.jsonl
ledger.json
events.jsonl
Raw messages stay as the source of truth. Compaction writes a smaller
compact_text beside the raw message in SQLite:
messages.text raw audit log
messages.compact_text compact retrieval/export text
messages.compact_status ok | failed | skipped_short | too_long
Retrieval and Obsidian prefer compact_text only when
compact_status = ok; otherwise they fall back to raw text.
Run the worker for new messages:
uv run python scripts/run_worker.pyBackfill old messages manually:
uv run python scripts/compact_backfill.py --limit 20If NVIDIA returns 429 Too Many Requests, wait a minute and retry with a
smaller limit. Stop the worker during large backfills to reduce overlap.
Reasoning-heavy providers may spend output tokens on internal reasoning before returning final compact text. For those providers, raise the compact output limit:
[compact]
max_tokens = 4096After changing compaction behavior, rebuild the retrieval index:
uv run python scripts/rebuild_memory_index.pyRun the small tracked smoke benchmark:
uv run python scripts/benchmark_retrieval.pyFor a more realistic local benchmark, generate cases from your current
structured memory. This writes under .data/, so private project details stay
out of git:
uv run python scripts/make_benchmark_cases.py --db-path ./.data/chroma_db --output ./.data/bench_retrieval/real_cases.json --limit 30Compare embedding models against the same local cases:
uv run python scripts/benchmark_retrieval.py --cases ./.data/bench_retrieval/real_cases.json --source-db ./.data/chroma_db --embedding-provider sentence_transformers --embedding-model BAAI/bge-small-en-v1.5 --bench-db ./.data/bench_retrieval/real_bge_small
uv run python scripts/benchmark_retrieval.py --cases ./.data/bench_retrieval/real_cases.json --source-db ./.data/chroma_db --embedding-provider sentence_transformers --embedding-model all-MiniLM-L6-v2 --bench-db ./.data/bench_retrieval/real_minilm
uv run python scripts/benchmark_retrieval.py --cases ./.data/bench_retrieval/real_cases.json --source-db ./.data/chroma_db --embedding-provider chroma_default --embedding-model='' --bench-db ./.data/bench_retrieval/real_chroma_defaultThe local benchmark reports recall@5, recall@10, MRR, rebuild time, and
query latency. Treat generated cases as a starting point; hand-written
paraphrase cases are still better for final model decisions.
Latest local run on 30 generated real-memory cases:
model recall@5 recall@10 MRR p50 latency
BAAI/bge-small-en-v1.5 96.7% 96.7% 0.894 85.9 ms
all-MiniLM-L6-v2 90.0% 93.3% 0.818 62.8 ms
chroma_default 90.0% 93.3% 0.818 582.3 ms
This result is machine- and corpus-specific. Re-run the benchmark after major memory, embedding, chunking, or retrieval changes.
The implementation details, architecture, scripts, and tests live in:
docs/technical.md
For a copy-paste command cheat sheet, see:
docs/commands.md
