LLM-based Python code summarization with AST-aware evaluation.
This project fine-tunes small code LLMs (1-3B parameters) via LoRA to generate docstrings for Python functions, and evaluates them using an AST-aware benchmark that tests structural understanding beyond surface-level text metrics.
Seed Dataset (C2NL, 92k examples)
|
v
[convert_seed.py] --> HuggingFace Dataset
|
v
[expand_with_distilabel.py] --> Expanded Dataset (teacher LLM generates more examples)
|
v
[train_lora.py] --> LoRA-adapted Code LLM
|
v
[serve.py] --> FastAPI Inference Server (localhost:8000)
|
v
VS Code Extension (calls /generate endpoint)
Evaluation runs independently via the AST-aware benchmark:
Test Dataset + Model Predictions --> [benchmark.py] --> Metrics Report
|
Standard (BLEU, ROUGE) + AST-aware metrics
-
convert_seed.py- Converts the C2NL parallel-file dataset (code.original + javadoc.original) into HuggingFace instruction-tuning format. Applies heuristic detokenization to make code readable for LLMs. -
expand_with_distilabel.py- Uses distilabel to expand the seed dataset by sending code to a teacher LLM for higher-quality docstring generation.
-
train_lora.py- LoRA fine-tuning using HuggingFace Trainer + PEFT. Supports QLoRA (4-bit quantization) for training on 1-2 A100 GPUs. -
serve.py- FastAPI inference server that uses ollama API to generate docstrings. Supports multiple Qwen Coder models with model-specific configurations. -
models.py- Model configuration registry with sampling parameters for Qwen 2.5 Coder and Qwen3 Coder variants.
-
benchmark.py- Benchmark runner that evaluates docstring quality using both standard and AST-aware metrics. -
metrics/standard.py- BLEU and ROUGE-L wrappers via HuggingFace evaluate. -
metrics/ast_aware.py- Novel metrics that parse the source code's AST and check whether generated docstrings correctly reference identifiers, control-flow patterns, and function parameters.
Migrated from the original Python150k preprocessing pipeline:
parse_python3.py- Converts Python source code to a JSON AST representation.ast_conversion.py- Transforms AST with value-node splitting and DFS traversal.processor_ast.py- Text preprocessing for code, comments, and docstrings.
# Install dependencies
pip install -e ".[dev]"
# Convert to HuggingFace format (requires dataset access, see below)
python -m src.data.convert_seed \
--input-dir data/raw/python-method \
--output-dir data/processed/python-methodThe FastAPI inference server provides HTTP endpoints for docstring generation using
ollama as the backend. The server uses a system prompt stored in
src/training/prompts/system_prompt.md to generate NumPy-style docstrings.
- Install ollama: Make sure ollama is installed and running locally
- Pull a model: Download one of the supported code models:
# Qwen 2.5 Coder (dense models) ollama pull qwen2.5-coder:32b # Default, ~18GB Q4 ollama pull qwen2.5-coder:14b # Mid-size, ~8GB Q4 ollama pull qwen2.5-coder:7b # Fast, ~4GB Q4 # Qwen3 Coder (MoE model) ollama pull qwen3-coder:30b-a3b # Best quality, ~18GB Q4, 256K context
Start the FastAPI server using uvicorn:
Linux/macOS:
# Using uvicorn directly
uvicorn src.training.serve:app --host 0.0.0.0 --port 8000
# Or run the module directly
python -m src.training.serveWindows (PowerShell):
uvicorn src.training.serve:app --host 0.0.0.0 --port 8000The server will start on http://localhost:8000 by default.
The server can be configured using environment variables:
OLLAMA_URL- Ollama API endpoint (default:http://localhost:11434/api/chat)OLLAMA_MODEL- Model key or Ollama model name (default:qwen2.5-coder-32b)REQUEST_TIMEOUT- Request timeout in seconds (default:120.0)
Linux/macOS:
OLLAMA_MODEL=qwen3-coder-30b uvicorn src.training.serve:app --port 8000Windows (PowerShell):
$env:OLLAMA_MODEL="qwen3-coder-30b"; uvicorn src.training.serve:app --port 8000Windows (CMD):
set OLLAMA_MODEL=qwen3-coder-30b && uvicorn src.training.serve:app --port 8000| Model Key | Ollama Model | Architecture | Memory (Q4) | Context | Description |
|---|---|---|---|---|---|
qwen2.5-coder-32b |
qwen2.5-coder:32b |
Dense | ~18GB | 32K | Default, balanced quality/speed |
qwen2.5-coder-14b |
qwen2.5-coder:14b |
Dense | ~8GB | 32K | Mid-size, good performance |
qwen2.5-coder-7b |
qwen2.5-coder:7b |
Dense | ~4GB | 32K | Fast inference |
qwen3-coder-30b |
qwen3-coder:30b-a3b |
MoE | ~18GB | 256K | Best quality, 3.3B active params |
Each model has optimized sampling parameters:
- Qwen 2.5 Coder: temperature=0.7, top_p=0.9, top_k=40
- Qwen3 Coder: temperature=1.0, top_p=0.95, top_k=40 (per official recommendations)
You can select a model in two ways:
-
Environment variable (applies to all requests):
OLLAMA_MODEL=qwen3-coder-30b uvicorn src.training.serve:app
-
Per-request (via API):
curl -X POST http://localhost:8000/generate \ -H "Content-Type: application/json" \ -d '{"code": "def add(x, y): return x + y", "model": "qwen3-coder-30b"}'
Via CLI:
python scripts/run_ollama.py --list-modelsVia API:
curl http://localhost:8000/modelsCheck if the service is healthy and ollama is accessible:
curl http://localhost:8000/healthResponse (200 OK):
{
"status": "healthy",
"service": "ollama",
"active_model": "Qwen 2.5 Coder 32B",
"ollama_model": "qwen2.5-coder:32b"
}Response (503 Service Unavailable):
{
"detail": "Service unhealthy: ollama is not running or not accessible"
}Generate a docstring for a Python function:
curl -X POST http://localhost:8000/generate \
-H "Content-Type: application/json" \
-d '{
"code": "def add(x, y):\n return x + y",
"max_new_tokens": 256
}'Request Body:
code(required): Python function code as a stringmax_new_tokens(optional): Maximum number of tokens to generate (uses model default if not specified)model(optional): Model key or Ollama model name to use for this request
Response (200 OK):
{
"docstring": "\"\"\"Compute the sum of two numbers.\n\nParameters\n----------\nx : int\n First number.\ny : int\n Second number.\n\nReturns\n-------\nint\n Sum of x and y.\"\"\"",
"model": "qwen2.5-coder:32b"
}Response (500 Internal Server Error):
{
"detail": "Failed to generate docstring: <error message>"
}Get available model configurations:
curl http://localhost:8000/modelsResponse (200 OK):
{
"default": "qwen2.5-coder-32b",
"active": "qwen2.5-coder-32b",
"models": [
{
"key": "qwen2.5-coder-32b",
"name": "Qwen 2.5 Coder 32B",
"ollama_model": "qwen2.5-coder:32b",
"context_window": 32768,
"architecture": "dense",
"memory_q4": "~18GB",
"description": "Dense 32B model, good balance of quality and speed"
}
]
}The CLI tool allows testing docstring generation directly:
# Use default model
python scripts/run_ollama.py --user "def add(x, y): return x + y"
# Use specific model by key
python scripts/run_ollama.py --model-key qwen3-coder-30b --user "def foo(): pass"
# Use raw Ollama model name
python scripts/run_ollama.py --model qwen2.5-coder:7b --user "def bar(): pass"
# List available models
python scripts/run_ollama.py --list-modelsRun the test suite to verify the API endpoints:
pytest tests/test_serve.py tests/test_models.py -vThe seed dataset comes from the NeuralCodeSum project (ACL 2020): 92,545 Python function-docstring pairs split into train/dev/test.
The python-method dataset was previously available via a Google Drive download script
(data/raw/python-method/get_data.sh). This script has been removed as the Google Drive
link (file ID: 1XPE1txk9VI0aOT_TdqbAeI58Q8puKVl2) is no longer accessible.
To obtain the dataset, you can:
- Contact the NeuralCodeSum authors
- Download from the original source if available at the project repository
- Use the alternative python150k dataset from ETH Zurich SRI Lab
- Original C2NL dataset: A Transformer-based Approach for Source Code Summarization
- Python150k dataset: ETH Zurich SRI Lab
- Tree Transformer: nxphi47/tree_transformer