Benchmarking Diarization Models 📜Paper

An open-source evaluation platform for comparing speaker diarization systems out-of-the-box, across multiple datasets, languages, and acoustic conditions.

This repository accompanies the bachelor's thesis "Benchmarking Diarization Models" (ETH Zürich, Distributed Computing Group, 2025) and provides a unified pipeline to generate predictions and evaluate performance for any speaker diarization model — without parameter tuning or domain-specific modifications.

Why This Project?

Speaker diarization, answering "who spoke when", is a critical preprocessing step for meeting transcription, call analytics, and speech recognition. While many diarization models exist, comparing them fairly is difficult: different papers use different datasets, metrics, evaluation collars, and post-processing steps.

This platform solves that by providing:

Standardized evaluation across four datasets and five languages
Out-of-the-box testing — models are evaluated as a practitioner would deploy them
Modular architecture — add new models or datasets with minimal code changes
Reproducible results — all predictions saved as JSON for inspection and re-evaluation

Results

We evaluated five diarization systems across 196.6 hours of multilingual audio. PyannoteAI (commercial) achieved the best overall DER of 11.2%, while DiariZen provided the strongest open-source performance at 13.3% DER.

DER Performance Overview

Diarization error rate across models showing missed speech (red), false alarm (orange), and speaker confusion (blue) components. Lower is better.

Per-Language DER (%)

Model	Zho	Eng	Deu	Jpn	Spa
DiariZen	10.1	7.0	11.6	15.6	19.1
Sortformer	13.1	15.5	11.1	16.5	21.9
Sortformer v2	9.2	15.3	9.6	12.7	21.1
SF v2-streaming	9.4	14.1	9.6	12.7	21.1
pyannote	19.8	11.5	19.0	28.8	27.3
PyannoteAI	10.0	6.6	8.3	13.8	14.3

Per-Speaker Count DER (%)

Model	1 spk	2 spk	3 spk	4 spk	5+ spk
DiariZen	2.3	11.4	10.3	12.7	7.1
Sortformer	1.5	11.5	14.7	21.3	23.9
Sortformer v2	4.7	10.1	14.3	16.7	22.7
SF v2-streaming	4.7	10.4	14.1	13.2	22.7
pyannote	3.2	19.9	19.8	17.1	10.6
PyannoteAI	2.7	9.9	9.1	10.1	6.6

Key findings:

Missed speech is the dominant error mode across all specialized models
No single model wins across all languages — model choice depends on deployment scenario
Sortformer v2 achieves exceptional computational efficiency (214.3x real-time) while maintaining competitive accuracy

For full analysis, see the paper.

How to Use

1. Set Up Environments

Each model requires its own conda environment. Create them from the provided environment files and install the corresponding requirements:

# pyannote
conda env create -f environments/environment_pyannote.yml
conda activate pyannote
pip install -r requirements/requirements_pyannote.txt

# NeMo (Sortformer)
conda env create -f environments/environment_nemo.yml
conda activate nemo
pip install -r requirements/requirements_nemo.txt

# DiariZen
conda env create -f environments/environment_diarizen.yml
conda activate diarizen
pip install -r requirements/requirements_diarizen.txt

# PyannoteAI (API-based, lightweight)
conda env create -f environments/environment_pyannoteai.yml
conda activate pyannoteai
pip install -r requirements/requirements_pyannoteai.txt

For model-specific setup details, refer to the official documentation:

pyannote.audio — requires a HuggingFace token with access to pyannote/speaker-diarization-3.1
NeMo Sortformer — available via the NVIDIA NeMo toolkit
DiariZen — from BUTSpeechFIT/DiariZen
PyannoteAI — commercial API, requires an API key

2. Configure `config.yaml`

Create a config.yaml with your dataset paths and model settings:

datasets:
  paths:
    callhome: "/path/to/dataset/callhome"
    voxconverse: "/path/to/dataset/voxconverse"
    ami: "/path/to/dataset/ami"
    ali: "/path/to/dataset/ali"

models:
  pyannote:
    model_path: "pyannote/speaker-diarization-3.1"
    device: "cuda"
  nemo:
    model_path: "nvidia/diar_streaming_sortformer_4spk-v2"
    device: "cuda"
  diarizen:
    model_path: "BUT-FIT/diarizen-wavlm-large-s80-md"
    device: "cuda"
  pyannoteai:
    api_endpoint: "https://api.pyannote.ai"

paths:
  results_base: "predictions"
  evaluation_results: "evaluation_results"

evaluation:
  collar_seconds: 0.25

processing:
  override_predictions: false
  batch_size: 1

API keys are read from environment variables — never stored in config:

export HF_TOKEN="your_huggingface_token"
export PYANNOTEAI_API_KEY="your_pyannoteai_key"

3. Run the Pipeline

Generate predictions for a model/dataset combination (activate the correct conda environment first):

conda activate pyannote
python -m src.gen --model pyannote --dataset callhome

conda activate nemo
python -m src.gen --model nemo --dataset ami

conda activate diarizen
python -m src.gen --model diarizen --dataset voxconverse

Evaluate predictions (requires only pyannote.metrics):

python -m src.eval --model pyannote --dataset callhome
python -m src.eval --model nemo --dataset ami

Automate with run_pipeline.sh: Update the script with your conda paths and dataset/model combinations, then run:

bash run_pipeline.sh

The shell script handles activating the correct conda environment for each model before calling gen.py, then runs eval.py for all models in a single environment.

4. Prediction Output Format

All predictions are saved as JSON in {results_base}/{model}/{dataset}/{split}/{audio_id}_prediction.json:

{
  "audio_id": "eng_0001",
  "dataset": "callhome",
  "split": "eng",
  "language": "eng",
  "model_used": "pyannote/speaker-diarization-3.1",
  "segments": [
    {"start": 0.0, "end": 3.456, "speaker_id": "SPEAKER_00"},
    {"start": 3.789, "end": 7.123, "speaker_id": "SPEAKER_01"}
  ],
  "timestamp": "2025-09-15T14:30:00"
}

Evaluation reports (per-file JSON metrics and text summaries) are saved to {evaluation_results}/{model}/{dataset}/{timestamp}/.

Project Structure

.
├── src/
│   ├── data_structures.py    # Shared dataclasses (AudioDataItem, DiarizationResponse, etc.)
│   ├── data_loader.py        # Audio discovery and ground truth loading
│   ├── model_loader.py       # Lazy model imports (one per conda environment)
│   ├── gen.py                # Unified prediction generation script
│   └── eval.py               # Unified evaluation script
├── requirements/
│   ├── requirements_pyannote.txt
│   ├── requirements_nemo.txt
│   ├── requirements_diarizen.txt
│   └── requirements_pyannoteai.txt
├── environments/
│   ├── environment_pyannote.yml
│   ├── environment_nemo.yml
│   ├── environment_diarizen.yml
│   └── environment_pyannoteai.yml
├── config.yaml               # Dataset paths, model settings, evaluation parameters
├── run_pipeline.sh           # Shell orchestrator for conda environments
└── README.md

Architecture Overview

The pipeline is designed around dependency isolation: each diarization model requires its own conda environment with potentially conflicting packages. The architecture handles this through lazy imports and shell-based orchestration.

┌──────────────────┐     ┌──────────────────┐     ┌──────────────────────┐
│ run_pipeline.sh  │────▶│      gen.py     │────▶│       eval.py        │
│  (activates the  │     │   (per conda     │     │   (single env w/     │
│   conda env)     │     │   environment)   │     │   pyannote.metrics)  │
└──────────────────┘     └────────┬─────────┘     └──────────────────────┘
                                  │
                         ┌────────┴────────┐
                         ▼                 ▼
                  ┌─────────────┐   ┌──────────────┐
                  │ DataLoader  │   │ ModelLoader  │
                  │ (std lib +  │   │ (lazy import │
                  │  librosa)   │   │  per model)  │
                  └─────────────┘   └──────────────┘

data_structures.py — All shared types. Every component converts to/from these formats.
data_loader.py — Discovers audio files and loads ground truth. Uses only standard library + librosa. No model-specific imports.
model_loader.py — Each model's imports happen inside its method, so gen.py runs in any conda environment without import errors.
gen.py — Loads audio via DataLoader, loads a model via ModelLoader, runs inference, saves predictions as JSON.
eval.py — Loads saved predictions + ground truth, computes DER/JER via pyannote.metrics, generates reports.

Extending the Platform

Adding a New Model

1. Add the model to model_loader.py — create a new lazy-import method:

# In model_loader.py

SUPPORTED_MODELS = ["pyannote", "nemo", "diarizen", "pyannoteai", "your_model"]

def _load_your_model(self) -> Any:
    """Load YourModel. All imports inside this method."""
    try:
        from your_model_package import YourPipeline
        import torch

        model_cfg = self.config.get("models", {}).get("your_model", {})
        model_path = model_cfg.get("model_path", "default/model-path")

        pipeline = YourPipeline.from_pretrained(model_path)
        if torch.cuda.is_available():
            pipeline = pipeline.to(torch.device("cuda"))

        return pipeline
    except ImportError as e:
        raise ImportError(f"your_model not available. Error: {e}")

2. Add a processing function in gen.py — convert the model's output to DiarizationResponse:

# In gen.py

def your_model_output_to_response(
    raw_output, item: AudioDataItem, model_path: str
) -> DiarizationResponse:
    """Convert YourModel output to the standardized format."""
    segments = []
    for seg in raw_output:
        segments.append(DiarizationSegment(
            start=round(seg.start, 3),
            end=round(seg.end, 3),
            speaker_id=str(seg.speaker),
        ))
    segments.sort(key=lambda x: x.start)

    return DiarizationResponse(
        audio_id=item.audio_id,
        dataset=item.dataset,
        split=item.split,
        language=item.language,
        model_used=model_path,
        segments=segments,
        timestamp=datetime.now().isoformat(),
    )

def process_your_model(model, audio_batch, config) -> ProcessingStats:
    """Process audio with YourModel."""
    # ... iterate over audio_batch.items, call model, convert output, save

3. Register in the dispatch maps in gen.py and eval.py:

# gen.py main()
processors = {
    "pyannote": process_pyannote,
    "nemo": process_nemo,
    "diarizen": process_diarizen,
    "your_model": process_your_model,  # Add here
}

# Also add to argparse choices
parser.add_argument("--model", choices=[..., "your_model"])

4. Add config and environment:

# config.yaml
models:
  your_model:
    model_path: "org/your-model-name"
    device: "cuda"

No changes to eval.py are needed — evaluation works on prediction JSON files regardless of the model that produced them.

Adding a New Dataset

1. Add dataset loading to data_loader.py:

# In data_loader.py

DATASET_SPLITS = {
    # ...existing...
    "your_dataset": ["train", "dev", "test"],
}

DATASET_LANGUAGES = {
    # ...existing...
    "your_dataset": "eng",  # or None if language varies by split
}

def _load_your_dataset_audio(self, model_dir=None, override=False) -> AudioDataBatch:
    """
    Load YourDataset.

    Structure: dataset/your_dataset/{split}/audio/*.wav
    """
    base_dir = Path(self.dataset_paths["your_dataset"])
    items_by_split = {}

    for split in self.DATASET_SPLITS["your_dataset"]:
        audio_dir = base_dir / split / "audio"
        split_items = []

        for wav_file in sorted(audio_dir.glob("*.wav")):
            audio_id = wav_file.stem

            if self._should_skip(audio_id, "your_dataset", split, model_dir, override):
                continue
            if not self._validate_audio(str(wav_file)):
                continue

            duration = self._get_duration(str(wav_file))
            if duration is None:
                continue

            split_items.append(AudioDataItem(
                audio_id=audio_id,
                audio_path=str(wav_file),
                split=split,
                language="eng",
                duration=duration,
                dataset="your_dataset",
            ))

        if split_items:
            items_by_split[split] = split_items

    return AudioDataBatch(dataset="your_dataset", items_by_split=items_by_split)

2. Add ground truth loading:

def _load_your_dataset_gt(self, audio_id: str, split: str) -> GroundTruthAnnotation:
    """Load ground truth — adapt to your GT format (RTTM, JSON, TextGrid, etc.)."""
    gt_path = Path(self.dataset_paths["your_dataset"]) / split / "gt" / f"{audio_id}.rttm"

    segments = []
    # ... parse your GT format into GroundTruthSegment objects ...

    return GroundTruthAnnotation(
        audio_id=audio_id,
        dataset="your_dataset",
        split=split,
        language="eng",
        segments=segments,
    )

3. Register in the loader dispatch dicts (both load_audio and load_groundtruth in data_loader.py) and add "your_dataset" to the argparse choices in both gen.py and eval.py.

4. Add the dataset path to config.yaml:

datasets:
  paths:
    your_dataset: "/path/to/your_dataset"

Expected Dataset Layouts

Dataset	Audio files	Ground truth
CallHome	`callhome/{lang}/{lang}_DDDD.wav`	`callhome/{lang}/{lang}_metadata.json`
VoxConverse	`voxconverse/{split}/eng/*.wav`	`voxconverse/{split}/gt/{audio_id}.rttm`
AMI	`ami/{split}/audio/*.Mix-Headset.wav`	`ami/{split}/gt/{meeting_id}.json`
ALI	`ali/{split}/audio/*.wav`	`ali/{split}/gt/{gt_id}.json`

Evaluation Details

Evaluation is performed with pyannote.metrics using the following settings:

DER (Diarization Error Rate) with 0.25s collar, skip_overlap=False
JER (Jaccard Error Rate) with 0.25s collar
DER components reported individually: missed speech, false alarm, speaker confusion

All metrics are computed as rates (not absolute durations) per file, then aggregated. This ensures files of different lengths contribute proportionally.

Citation

@article{lanzendorfer2025benchmarking,
  title={Benchmarking Diarization Models},
  author={Lanzend{\"o}rfer, Luca A and Gr{\"o}tschla, Florian and Blaser, Cesare and Wattenhofer, Roger},
  journal={arXiv preprint arXiv:2509.26177},
  year={2025}
}

License

This project is released for academic and research purposes. Please refer to the individual model licenses for usage restrictions:

pyannote.audio (MIT)
NeMo (Apache 2.0)
DiariZen (MIT)
PyannoteAI (commercial, API terms apply)

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
figures		figures
requirements		requirements
src		src
.gitignore		.gitignore
README.md		README.md
config.yaml		config.yaml
export_req.sh		export_req.sh
run_all.sh		run_all.sh
run_pipeline.sh		run_pipeline.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Benchmarking Diarization Models 📜Paper

Why This Project?

Results

DER Performance Overview

Per-Language DER (%)

Per-Speaker Count DER (%)

How to Use

1. Set Up Environments

2. Configure `config.yaml`

3. Run the Pipeline

4. Prediction Output Format

Project Structure

Architecture Overview

Extending the Platform

Adding a New Model

Adding a New Dataset

Expected Dataset Layouts

Evaluation Details

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Benchmarking Diarization Models 📜Paper

Why This Project?

Results

DER Performance Overview

Per-Language DER (%)

Per-Speaker Count DER (%)

How to Use

1. Set Up Environments

2. Configure config.yaml

3. Run the Pipeline

4. Prediction Output Format

Project Structure

Architecture Overview

Extending the Platform

Adding a New Model

Adding a New Dataset

Expected Dataset Layouts

Evaluation Details

Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

2. Configure `config.yaml`

Packages