Skip to content

Transformer-based approaches for an efficient docstrings generation on a piece of Python's code.

Notifications You must be signed in to change notification settings

martysai/source-code-summarization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

141 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Source Code Summarization

LLM-based Python code summarization with AST-aware evaluation.

Overview

This project fine-tunes small code LLMs (1-3B parameters) via LoRA to generate docstrings for Python functions, and evaluates them using an AST-aware benchmark that tests structural understanding beyond surface-level text metrics.

Architecture

Seed Dataset (C2NL, 92k examples)
        |
        v
[convert_seed.py] --> HuggingFace Dataset
        |
        v
[expand_with_distilabel.py] --> Expanded Dataset (teacher LLM generates more examples)
        |
        v
[train_lora.py] --> LoRA-adapted Code LLM
        |
        v
[serve.py] --> FastAPI Inference Server (localhost:8000)
        |
        v
    VS Code Extension (calls /generate endpoint)

Evaluation runs independently via the AST-aware benchmark:

Test Dataset + Model Predictions --> [benchmark.py] --> Metrics Report
                                        |
                          Standard (BLEU, ROUGE) + AST-aware metrics

Components

Data Preparation (src/data/)

  • convert_seed.py - Converts the C2NL parallel-file dataset (code.original + javadoc.original) into HuggingFace instruction-tuning format. Applies heuristic detokenization to make code readable for LLMs.

  • expand_with_distilabel.py - Uses distilabel to expand the seed dataset by sending code to a teacher LLM for higher-quality docstring generation.

Training (src/training/)

  • train_lora.py - LoRA fine-tuning using HuggingFace Trainer + PEFT. Supports QLoRA (4-bit quantization) for training on 1-2 A100 GPUs.

  • serve.py - FastAPI inference server that uses ollama API to generate docstrings. Supports multiple Qwen Coder models with model-specific configurations.

  • models.py - Model configuration registry with sampling parameters for Qwen 2.5 Coder and Qwen3 Coder variants.

Evaluation (src/evaluation/)

  • benchmark.py - Benchmark runner that evaluates docstring quality using both standard and AST-aware metrics.

  • metrics/standard.py - BLEU and ROUGE-L wrappers via HuggingFace evaluate.

  • metrics/ast_aware.py - Novel metrics that parse the source code's AST and check whether generated docstrings correctly reference identifiers, control-flow patterns, and function parameters.

AST Utilities (src/ast_utils/)

Migrated from the original Python150k preprocessing pipeline:

  • parse_python3.py - Converts Python source code to a JSON AST representation.
  • ast_conversion.py - Transforms AST with value-node splitting and DFS traversal.
  • processor_ast.py - Text preprocessing for code, comments, and docstrings.

Quick Start

# Install dependencies
pip install -e ".[dev]"

# Convert to HuggingFace format (requires dataset access, see below)
python -m src.data.convert_seed \
    --input-dir data/raw/python-method \
    --output-dir data/processed/python-method

Serving

The FastAPI inference server provides HTTP endpoints for docstring generation using ollama as the backend. The server uses a system prompt stored in src/training/prompts/system_prompt.md to generate NumPy-style docstrings.

Prerequisites

  1. Install ollama: Make sure ollama is installed and running locally
  2. Pull a model: Download one of the supported code models:
    # Qwen 2.5 Coder (dense models)
    ollama pull qwen2.5-coder:32b  # Default, ~18GB Q4
    ollama pull qwen2.5-coder:14b  # Mid-size, ~8GB Q4
    ollama pull qwen2.5-coder:7b   # Fast, ~4GB Q4
    
    # Qwen3 Coder (MoE model)
    ollama pull qwen3-coder:30b-a3b  # Best quality, ~18GB Q4, 256K context

Starting the Server

Start the FastAPI server using uvicorn:

Linux/macOS:

# Using uvicorn directly
uvicorn src.training.serve:app --host 0.0.0.0 --port 8000

# Or run the module directly
python -m src.training.serve

Windows (PowerShell):

uvicorn src.training.serve:app --host 0.0.0.0 --port 8000

The server will start on http://localhost:8000 by default.

Configuration

The server can be configured using environment variables:

  • OLLAMA_URL - Ollama API endpoint (default: http://localhost:11434/api/chat)
  • OLLAMA_MODEL - Model key or Ollama model name (default: qwen2.5-coder-32b)
  • REQUEST_TIMEOUT - Request timeout in seconds (default: 120.0)

Linux/macOS:

OLLAMA_MODEL=qwen3-coder-30b uvicorn src.training.serve:app --port 8000

Windows (PowerShell):

$env:OLLAMA_MODEL="qwen3-coder-30b"; uvicorn src.training.serve:app --port 8000

Windows (CMD):

set OLLAMA_MODEL=qwen3-coder-30b && uvicorn src.training.serve:app --port 8000

Available Models

Model Key Ollama Model Architecture Memory (Q4) Context Description
qwen2.5-coder-32b qwen2.5-coder:32b Dense ~18GB 32K Default, balanced quality/speed
qwen2.5-coder-14b qwen2.5-coder:14b Dense ~8GB 32K Mid-size, good performance
qwen2.5-coder-7b qwen2.5-coder:7b Dense ~4GB 32K Fast inference
qwen3-coder-30b qwen3-coder:30b-a3b MoE ~18GB 256K Best quality, 3.3B active params

Each model has optimized sampling parameters:

  • Qwen 2.5 Coder: temperature=0.7, top_p=0.9, top_k=40
  • Qwen3 Coder: temperature=1.0, top_p=0.95, top_k=40 (per official recommendations)

Model Selection

You can select a model in two ways:

  1. Environment variable (applies to all requests):

    OLLAMA_MODEL=qwen3-coder-30b uvicorn src.training.serve:app
  2. Per-request (via API):

    curl -X POST http://localhost:8000/generate \
      -H "Content-Type: application/json" \
      -d '{"code": "def add(x, y): return x + y", "model": "qwen3-coder-30b"}'

List Available Models

Via CLI:

python scripts/run_ollama.py --list-models

Via API:

curl http://localhost:8000/models

API Endpoints

Health Check

Check if the service is healthy and ollama is accessible:

curl http://localhost:8000/health

Response (200 OK):

{
  "status": "healthy",
  "service": "ollama",
  "active_model": "Qwen 2.5 Coder 32B",
  "ollama_model": "qwen2.5-coder:32b"
}

Response (503 Service Unavailable):

{
  "detail": "Service unhealthy: ollama is not running or not accessible"
}

Generate Docstring

Generate a docstring for a Python function:

curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{
    "code": "def add(x, y):\n    return x + y",
    "max_new_tokens": 256
  }'

Request Body:

  • code (required): Python function code as a string
  • max_new_tokens (optional): Maximum number of tokens to generate (uses model default if not specified)
  • model (optional): Model key or Ollama model name to use for this request

Response (200 OK):

{
  "docstring": "\"\"\"Compute the sum of two numbers.\n\nParameters\n----------\nx : int\n    First number.\ny : int\n    Second number.\n\nReturns\n-------\nint\n    Sum of x and y.\"\"\"",
  "model": "qwen2.5-coder:32b"
}

Response (500 Internal Server Error):

{
  "detail": "Failed to generate docstring: <error message>"
}

List Models

Get available model configurations:

curl http://localhost:8000/models

Response (200 OK):

{
  "default": "qwen2.5-coder-32b",
  "active": "qwen2.5-coder-32b",
  "models": [
    {
      "key": "qwen2.5-coder-32b",
      "name": "Qwen 2.5 Coder 32B",
      "ollama_model": "qwen2.5-coder:32b",
      "context_window": 32768,
      "architecture": "dense",
      "memory_q4": "~18GB",
      "description": "Dense 32B model, good balance of quality and speed"
    }
  ]
}

CLI Tool

The CLI tool allows testing docstring generation directly:

# Use default model
python scripts/run_ollama.py --user "def add(x, y): return x + y"

# Use specific model by key
python scripts/run_ollama.py --model-key qwen3-coder-30b --user "def foo(): pass"

# Use raw Ollama model name
python scripts/run_ollama.py --model qwen2.5-coder:7b --user "def bar(): pass"

# List available models
python scripts/run_ollama.py --list-models

Testing

Run the test suite to verify the API endpoints:

pytest tests/test_serve.py tests/test_models.py -v

Dataset

The seed dataset comes from the NeuralCodeSum project (ACL 2020): 92,545 Python function-docstring pairs split into train/dev/test.

Dataset Access

The python-method dataset was previously available via a Google Drive download script (data/raw/python-method/get_data.sh). This script has been removed as the Google Drive link (file ID: 1XPE1txk9VI0aOT_TdqbAeI58Q8puKVl2) is no longer accessible.

To obtain the dataset, you can:

  1. Contact the NeuralCodeSum authors
  2. Download from the original source if available at the project repository
  3. Use the alternative python150k dataset from ETH Zurich SRI Lab

Acknowledgments

About

Transformer-based approaches for an efficient docstrings generation on a piece of Python's code.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors