Self-Evolving Agents Framework

A modular Python framework for building and experimenting with self-evolving LLM agents.

This project started as a research-oriented implementation of ideas from self-evolving agent surveys. It has now been reframed as an engineering-focused agent platform roadmap: a system that can run agents, log trajectories, evolve prompts and memory, learn tools, and grow toward a serviceable benchmarkable platform.

What It Does Today

The current codebase already supports:

a reusable BaseAgent execution loop
pluggable environments
episodic memory with distilled lessons
vector-based episodic memory retrieval with a local embedder fallback
SQLite persistence for runs, steps, and memory
a FastAPI service layer for executing runs, benchmarks, and inspection queries
a benchmark runner that compares agent variants and writes JSON artifacts
a Streamlit dashboard with a lightweight control plane for runs, memory, and benchmark artifacts
prompt optimization with an OPRO-style loop
verbal reflection with retry
runtime tool generation and registration
LLM-as-judge reward scoring
evaluation metrics and basic unit tests

In practical terms, the repo is already a working prototype for:

running small agent tasks
recording trajectories
trying different evolution mechanisms
comparing simple behaviors across tasks

It is not yet a production system, training framework, or distributed runtime.

Current Architecture

src/self_evolving/
├── core/
│   ├── agent.py
│   ├── environment.py
│   └── types.py
├── evolution/
│   ├── memory/episodic.py
│   ├── prompt/opro.py
│   └── tools/learner.py
├── mechanisms/
│   ├── reflection/
│   │   ├── base.py
│   │   ├── reflexion.py
│   │   └── self_refine.py
│   └── reward/scorer.py
└── evaluation/metrics.py

Module Summary

Core agent

src/self_evolving/core/agent.py
Handles the main loop:
- reset environment
- generate action
- receive feedback
- record trajectory
- optionally use memory and reflection

Environments

src/self_evolving/core/environment.py
Includes:
- SimpleQAEnvironment
- ToolUseEnvironment

Episodic memory

src/self_evolving/evolution/memory/episodic.py
Stores lessons distilled from past trajectories.
Retrieval now uses vector similarity as the primary score.
Default setup uses a lightweight local hashing embedder so the project works without extra model downloads.
The embedder backend is pluggable, so this can later be swapped to sentence-transformers or an external embedding model.

Embedders

src/self_evolving/evolution/memory/embedders.py
Includes:
- HashingEmbedder for zero-extra-dependency local vector retrieval
- SentenceTransformerEmbedder as an optional stronger local backend

Persistence

src/self_evolving/persistence/sqlite_store.py
Persists:
- runs
- steps
- episodic memory entries

API service

src/self_evolving/service/api.py
Exposes endpoints for:
- health
- run QA task
- list runs
- inspect run detail
- inspect agent memory

Prompt evolution

src/self_evolving/evolution/prompt/opro.py
Maintains prompt history and uses a meta-LLM to propose better prompts.

Tool learning

src/self_evolving/evolution/tools/learner.py
Generates Python tool functions with the LLM, validates them, and registers them for reuse.

Reflection

src/self_evolving/mechanisms/reflection/reflexion.py
Adds post-failure reflection and retry behavior.

Reward scoring

src/self_evolving/mechanisms/reward/scorer.py
Uses an LLM judge to turn outcomes into scalar rewards.

Evaluation

src/self_evolving/evaluation/metrics.py
Tracks:
- success rate
- evolution gain
- stability
- adaptation speed

Benchmark runner

src/self_evolving/evaluation/benchmark.py
Compares:
- baseline
- memory
- reflexion
- prompt optimization
Writes JSON artifacts for each variant plus a session summary.

Dashboard

app.py
src/self_evolving/dashboard/data.py
Shows:
- recent runs
- run detail
- agent memory
- benchmark sessions and per-variant artifacts

What This Repo Is Right Now

This repository is best described as:

"A working single-process LLM agent experimentation framework with memory, prompt evolution, reflection, tool learning, and evaluation."

That is strong enough for:

research prototyping
portfolio demonstration
algorithmic experimentation
framework design discussions

That is not yet enough for:

production serving
API deployment
persistent run management
serious benchmarking
secure tool execution
distributed execution

Major Gaps

Engineering gaps

no container delivery files
no CI workflow
no background job execution for long-running runs and benchmarks

Systems / platform gaps

no safe tool sandbox
no multi-agent orchestration runtime
no experiment artifact store
no observability / tracing layer

Intentionally out of scope for now

multi-GPU execution
distributed runtime
LoRA / finetuning loop integration

Those can be added later, but the current focus is to make the repo excellent within a single-machine engineering scope first.

Priority Roadmap

The immediate upgrade path is:

add stronger tests and config loading
add persistence for runs, memory, prompts, and tools
upgrade memory from keyword retrieval to vector retrieval
add a benchmark runner
harden the FastAPI service and add async job execution
add a lightweight dashboard and control plane
add a safer tool sandbox
add Docker and CI

The detailed execution breakdown is in:

TASKS.md

Installation

git clone https://github.com/MemoryWorld/self-evolving-agents.git
cd self-evolving-agents
pip install -e ".[dev]"

cp .env.example .env
# Edit .env and set your API key

Quick Start

from dotenv import load_dotenv
load_dotenv()

from self_evolving.core.agent import BaseAgent
from self_evolving.core.environment import SimpleQAEnvironment

agent = BaseAgent()
env = SimpleQAEnvironment([("What is the capital of France?", "Paris")])

trajectory = agent.run(env, goal="What is the capital of France?")
print("Success:", trajectory.success)

API Quick Start

Run the API:

uvicorn self_evolving.service.api:create_app --factory --reload

Create a QA run job:

curl -X POST http://127.0.0.1:8000/runs/qa \
  -H "Content-Type: application/json" \
  -d '{
    "goal": "What is the capital of France?",
    "reference_answer": "Paris",
    "agent_id": "demo-agent",
    "use_memory": true
  }'

The API now returns a job record immediately. Poll the job until it reaches completed:

curl http://127.0.0.1:8000/jobs/<job_id>

Create a benchmark job:

curl -X POST http://127.0.0.1:8000/benchmarks/qa \
  -H "Content-Type: application/json" \
  -d '{
    "tasks": [
      {"goal": "What is the capital of France?", "reference_answer": "Paris"},
      {"goal": "What is 12 * 7?", "reference_answer": "84"}
    ],
    "variants": ["baseline", "memory", "reflexion"]
  }'

Inspect recent jobs:

curl http://127.0.0.1:8000/jobs

Inspect persisted runs:

curl http://127.0.0.1:8000/runs

Inspect stored memory for one agent:

curl http://127.0.0.1:8000/agents/demo-agent/memory

Examples

1. Basic agent

python examples/01_basic_agent.py

2. Memory evolution

python examples/02_memory_evolution.py

3. Reflexion retry

python examples/03_reflexion.py

4. Prompt optimization

python examples/04_prompt_optimization.py

5. Tool learning

python examples/05_tool_learning.py

6. Benchmark runner

python examples/06_benchmark_runner.py

7. Dashboard

streamlit run app.py

8. Generate offline demo data

python examples/07_generate_demo_data.py

This populates:

SQLite runs
agent memory entries
benchmark JSON artifacts

Use it when you want the dashboard to have data immediately without calling a real external model.

Next Step

The next practical step after the current background-job control plane is:

persist job state and execution logs beyond the API process

That would let the system:

survive API restarts without losing in-flight job history
retain richer execution traces for debugging and demos
support retries, cancellation, and queued scheduling more cleanly
prepare the project for container deployment and CI smoke tests

In other words, the next upgrade is turning the current local control plane into a more durable single-node agent platform.

Tests

pytest tests/ -v

Project Direction

The target outcome for this repo is no longer just:

"Implement the survey ideas."

The target outcome is:

"Turn self-evolving agents into a portfolio-grade LLM systems project with persistent experiments, semantic memory, benchmark automation, API serving, safer tool execution, and engineering-grade delivery."

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
configs		configs
examples		examples
src/self_evolving		src/self_evolving
tests		tests
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
TASKS.md		TASKS.md
app.py		app.py
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Self-Evolving Agents Framework

What It Does Today

Current Architecture

Module Summary

Core agent

Environments

Episodic memory

Embedders

Persistence

API service

Prompt evolution

Tool learning

Reflection

Reward scoring

Evaluation

Benchmark runner

Dashboard

What This Repo Is Right Now

Major Gaps

Engineering gaps

Systems / platform gaps

Intentionally out of scope for now

Priority Roadmap

Installation

Quick Start

API Quick Start

Examples

1. Basic agent

2. Memory evolution

3. Reflexion retry

4. Prompt optimization

5. Tool learning

6. Benchmark runner

7. Dashboard

8. Generate offline demo data

Next Step

Tests

Project Direction

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages