A modular Python framework for building and experimenting with self-evolving LLM agents.
This project started as a research-oriented implementation of ideas from self-evolving agent surveys. It has now been reframed as an engineering-focused agent platform roadmap: a system that can run agents, log trajectories, evolve prompts and memory, learn tools, and grow toward a serviceable benchmarkable platform.
The current codebase already supports:
- a reusable
BaseAgentexecution loop - pluggable environments
- episodic memory with distilled lessons
- vector-based episodic memory retrieval with a local embedder fallback
- SQLite persistence for runs, steps, and memory
- a FastAPI service layer for executing runs, benchmarks, and inspection queries
- a benchmark runner that compares agent variants and writes JSON artifacts
- a Streamlit dashboard with a lightweight control plane for runs, memory, and benchmark artifacts
- prompt optimization with an OPRO-style loop
- verbal reflection with retry
- runtime tool generation and registration
- LLM-as-judge reward scoring
- evaluation metrics and basic unit tests
In practical terms, the repo is already a working prototype for:
- running small agent tasks
- recording trajectories
- trying different evolution mechanisms
- comparing simple behaviors across tasks
It is not yet a production system, training framework, or distributed runtime.
src/self_evolving/
├── core/
│ ├── agent.py
│ ├── environment.py
│ └── types.py
├── evolution/
│ ├── memory/episodic.py
│ ├── prompt/opro.py
│ └── tools/learner.py
├── mechanisms/
│ ├── reflection/
│ │ ├── base.py
│ │ ├── reflexion.py
│ │ └── self_refine.py
│ └── reward/scorer.py
└── evaluation/metrics.py
src/self_evolving/core/agent.py- Handles the main loop:
- reset environment
- generate action
- receive feedback
- record trajectory
- optionally use memory and reflection
src/self_evolving/core/environment.py- Includes:
SimpleQAEnvironmentToolUseEnvironment
src/self_evolving/evolution/memory/episodic.py- Stores lessons distilled from past trajectories.
- Retrieval now uses vector similarity as the primary score.
- Default setup uses a lightweight local hashing embedder so the project works without extra model downloads.
- The embedder backend is pluggable, so this can later be swapped to
sentence-transformersor an external embedding model.
src/self_evolving/evolution/memory/embedders.py- Includes:
HashingEmbedderfor zero-extra-dependency local vector retrievalSentenceTransformerEmbedderas an optional stronger local backend
src/self_evolving/persistence/sqlite_store.py- Persists:
- runs
- steps
- episodic memory entries
src/self_evolving/service/api.py- Exposes endpoints for:
- health
- run QA task
- list runs
- inspect run detail
- inspect agent memory
src/self_evolving/evolution/prompt/opro.py- Maintains prompt history and uses a meta-LLM to propose better prompts.
src/self_evolving/evolution/tools/learner.py- Generates Python tool functions with the LLM, validates them, and registers them for reuse.
src/self_evolving/mechanisms/reflection/reflexion.py- Adds post-failure reflection and retry behavior.
src/self_evolving/mechanisms/reward/scorer.py- Uses an LLM judge to turn outcomes into scalar rewards.
src/self_evolving/evaluation/metrics.py- Tracks:
- success rate
- evolution gain
- stability
- adaptation speed
src/self_evolving/evaluation/benchmark.py- Compares:
- baseline
- memory
- reflexion
- prompt optimization
- Writes JSON artifacts for each variant plus a session summary.
app.pysrc/self_evolving/dashboard/data.py- Shows:
- recent runs
- run detail
- agent memory
- benchmark sessions and per-variant artifacts
This repository is best described as:
"A working single-process LLM agent experimentation framework with memory, prompt evolution, reflection, tool learning, and evaluation."
That is strong enough for:
- research prototyping
- portfolio demonstration
- algorithmic experimentation
- framework design discussions
That is not yet enough for:
- production serving
- API deployment
- persistent run management
- serious benchmarking
- secure tool execution
- distributed execution
- no container delivery files
- no CI workflow
- no background job execution for long-running runs and benchmarks
- no safe tool sandbox
- no multi-agent orchestration runtime
- no experiment artifact store
- no observability / tracing layer
- multi-GPU execution
- distributed runtime
- LoRA / finetuning loop integration
Those can be added later, but the current focus is to make the repo excellent within a single-machine engineering scope first.
The immediate upgrade path is:
- add stronger tests and config loading
- add persistence for runs, memory, prompts, and tools
- upgrade memory from keyword retrieval to vector retrieval
- add a benchmark runner
- harden the FastAPI service and add async job execution
- add a lightweight dashboard and control plane
- add a safer tool sandbox
- add Docker and CI
The detailed execution breakdown is in:
git clone https://github.com/MemoryWorld/self-evolving-agents.git
cd self-evolving-agents
pip install -e ".[dev]"
cp .env.example .env
# Edit .env and set your API keyfrom dotenv import load_dotenv
load_dotenv()
from self_evolving.core.agent import BaseAgent
from self_evolving.core.environment import SimpleQAEnvironment
agent = BaseAgent()
env = SimpleQAEnvironment([("What is the capital of France?", "Paris")])
trajectory = agent.run(env, goal="What is the capital of France?")
print("Success:", trajectory.success)Run the API:
uvicorn self_evolving.service.api:create_app --factory --reloadCreate a QA run job:
curl -X POST http://127.0.0.1:8000/runs/qa \
-H "Content-Type: application/json" \
-d '{
"goal": "What is the capital of France?",
"reference_answer": "Paris",
"agent_id": "demo-agent",
"use_memory": true
}'The API now returns a job record immediately. Poll the job until it reaches completed:
curl http://127.0.0.1:8000/jobs/<job_id>Create a benchmark job:
curl -X POST http://127.0.0.1:8000/benchmarks/qa \
-H "Content-Type: application/json" \
-d '{
"tasks": [
{"goal": "What is the capital of France?", "reference_answer": "Paris"},
{"goal": "What is 12 * 7?", "reference_answer": "84"}
],
"variants": ["baseline", "memory", "reflexion"]
}'Inspect recent jobs:
curl http://127.0.0.1:8000/jobsInspect persisted runs:
curl http://127.0.0.1:8000/runsInspect stored memory for one agent:
curl http://127.0.0.1:8000/agents/demo-agent/memorypython examples/01_basic_agent.pypython examples/02_memory_evolution.pypython examples/03_reflexion.pypython examples/04_prompt_optimization.pypython examples/05_tool_learning.pypython examples/06_benchmark_runner.pystreamlit run app.pypython examples/07_generate_demo_data.pyThis populates:
- SQLite runs
- agent memory entries
- benchmark JSON artifacts
Use it when you want the dashboard to have data immediately without calling a real external model.
The next practical step after the current background-job control plane is:
- persist job state and execution logs beyond the API process
That would let the system:
- survive API restarts without losing in-flight job history
- retain richer execution traces for debugging and demos
- support retries, cancellation, and queued scheduling more cleanly
- prepare the project for container deployment and CI smoke tests
In other words, the next upgrade is turning the current local control plane into a more durable single-node agent platform.
pytest tests/ -vThe target outcome for this repo is no longer just:
"Implement the survey ideas."
The target outcome is:
"Turn self-evolving agents into a portfolio-grade LLM systems project with persistent experiments, semantic memory, benchmark automation, API serving, safer tool execution, and engineering-grade delivery."
MIT