AFDB Integration Toolkit

A comprehensive toolkit for integrating structural models into the AlphaFold Database (AFDB). This toolkit provides essential tools and workflows to prepare, validate, and format molecular structure data for seamless integration with AFDB infrastructure.

Features

ModelCIF Generation: Convert PDB files to mmCIF format with metadata integration
Binary CIF Conversion: Efficient conversion from mmCIF to Binary CIF (BCIF) format
Secondary Structure Assignment: DSSP-based secondary structure annotation
Metadata Schema Validation: Validate model and provider metadata JSONs against AFDB-defined schemas
UniProt Metadata Tooling: Streamline UniProt subset extraction and AF metadata generation (see uniprot/README.md)
Automated Workflows: Nextflow-based end-to-end processing pipelines
Production Pipeline: Standalone Python pipeline with logging, caching, resume capability, structure analysis (clash detection, interface residues), iPSAE quality scoring, and mmCIF QA metric embedding
Docker Support: Containerized execution for reproducible results
Validation Tools: Built-in testing and validation utilities

Prerequisites

Python 3.12+
Node.js 18+ (for Mol* CLI)
Docker (optional, for containerized execution)
Nextflow (optional, for workflow automation)

Installation

1. Clone the Repository

git clone https://github.com/PDBeurope/AFDB-Integration-Kit
cd AFDB-Integration-Kit

2. Install UV (Python Package Manager)

UV is used to manage Python dependencies and virtual environments.

macOS/Linux:

curl -LsSf https://astral.sh/uv/install.sh | sh

Windows:

powershell -c "irm https://astral.sh/uv/install.ps1 | iex"

Alternative installation methods:

# Using pip
pip install uv

# Using conda
conda install -c conda-forge uv

3. Install Core Python Dependencies

Install the default dependency set from the locked project environment:

uv sync --locked --no-dev

The core install is intended for normal CLI usage, help output, metadata and schema validation, UniProt metadata tooling, ColabFold conversion, ModelCIF/PDB generation, CIF to BCIF conversion through the Mol* CLI fallback, and non-production helper scripts. It intentionally does not install the heavier production structure-analysis packages.

Contributors who need development tools and tests can install the full locked environment instead:

uv sync --locked

4. Install Mol* CLI

If you use nvm (Node Version Manager):

nvm use  # Uses the version specified in .nvmrc
npm install -g molstar

Without nvm:

npm install -g molstar

5. Install DSSP

The default run-dssp and batch-dssp commands use the external mkdssp binary, so install DSSP when using the default secondary-structure path. DSSP is also needed for Nextflow workflows.

The standalone production pipeline defaults to the built-in pydssp algorithm and does not require an external DSSP binary unless you select --dssp-algorithm mkdssp.

We use the modern DSSP implementation by the PDB-REDO team:

# Clone and build DSSP
git clone https://github.com/PDB-REDO/dssp.git
cd dssp
mkdir build
cd build
cmake ..
make
sudo make install

For detailed installation instructions, visit: https://github.com/PDB-REDO/dssp

6. Download mmCIF Dictionary (Optional)

The ModelCIF tool has an additional option to validate the mmCIF files against the updated model cif dictionary. This is an optional parameter, but it is recommended to validate the output files when first setting up the tool.

Download the modelcif dictionary to your project directory:

# Download the mmCIF dictionary
curl -o mmcif_ma.dic https://raw.githubusercontent.com/ihmwg/ModelCIF/refs/heads/master/dist/mmcif_ma.dic

Note: This step is automatically handled in the Docker environment, but is required for local installations.

7. Install Production Pipeline Dependencies (Optional)

The production pipeline (scripts/production_pipeline.py) requires additional dependencies for structure analysis, DSSP algorithms, clash detection, and interface residues.

Install the project production extra into the uv environment:

uv pip install '.[production]'

This installs the production Python packages declared by the project, including biotite, pydssp, torch, and fastpdb.

Install torch_cluster separately after PyTorch is installed. Its wheel must match the installed PyTorch version and CUDA runtime. Pick the CUDA suffix from the PyTorch Geometric wheel index for your environment (cpu, cu118, cu121, cu124, cu126, cu128, etc.):

# Check the installed PyTorch build first
python -c "import torch; print(torch.__version__, torch.version.cuda)"

# Example: CPU wheel for PyTorch 2.8.0
uv pip install torch_cluster -f https://data.pyg.org/whl/torch-2.8.0+cpu.html

# Example: CUDA 12.8 wheel for PyTorch 2.8.0
uv pip install torch_cluster -f https://data.pyg.org/whl/torch-2.8.0+cu128.html

If uv pip install '.[production]' resolves a different PyTorch version, change the torch-<version>+<cuda> part of the torch_cluster URL to match that installed build. For available torch_cluster wheels, see https://data.pyg.org/whl/.

Verify installation:

python -c "import torch; from torch_cluster import radius_graph; print('torch_cluster OK')"

8. Install Nextflow (Optional)

For workflow automation:

# Using curl
curl -s https://get.nextflow.io | bash

# Make executable and add to PATH
chmod +x nextflow
sudo mv nextflow /usr/local/bin/

9. Install Docker (Optional)

For containerized execution:

macOS/Windows: Download Docker Desktop from https://www.docker.com/products/docker-desktop
Linux: Follow instructions at https://docs.docker.com/engine/install/

Quick Start

Verify Installation

Verify the core Python install:

uv run main.py --help
uv run main.py list-validations

To check the optional external toolchain as well, install Mol*, DSSP, and any other workflow tools you need, then run:

uv run main.py test

This command reports missing external executables such as cif2bcif or mkdssp.

Basic Usage Example

# Generate ModelCIF
uv run main.py run-modelcif-gen \
    -p input/AF-0000000000000001-model-v1.pdb \
    -m input/AF-0000000000000001-v1.cif.json \
    -o output/AF-0000000000000001-model-v1.cif

# Convert to BCIF
uv run main.py run-cif2bcif \
    -i input/AF-0000000000000001-model-v1.cif \
    -o output/AF-0000000000000001-model-v1.bcif

# Add secondary structure annotation
uv run main.py run-dssp \
    -i input/AF-0000000000000001-model-v1.cif \
    -o output/AF-0000000000000001-model-v1.cif

Validate Example Outputs

The committed end-to-end examples under examples/ can be validated directly from the repo root. Use model-summary for committed e2e model_jsons/*.json; the canonical model schema remains reserved for full model metadata entries and AF-metadata-*-of-*.json batches.

# Summary and provider metadata JSONs
.venv/bin/python main.py run-schema-validation \
  -i examples/colabfold_monomer_e2e/model_jsons/AF-0000000300000001.json \
  -t model-summary
.venv/bin/python main.py run-schema-validation \
  -i examples/colabfold_monomer_e2e/config/provider.json \
  -t provider

# Score JSONs and confidence/PAE relationship
.venv/bin/python main.py validate-plddt-file \
  --file examples/colabfold_monomer_e2e/scores/AF-0000000300000001-confidence_v1.json
.venv/bin/python main.py validate-pae-file \
  --file examples/colabfold_monomer_e2e/scores/AF-0000000300000001-predicted_aligned_error_v1.json
.venv/bin/python main.py validate-relationships-pair \
  --plddt-file examples/colabfold_monomer_e2e/scores/AF-0000000300000001-confidence_v1.json \
  --pae-file examples/colabfold_monomer_e2e/scores/AF-0000000300000001-predicted_aligned_error_v1.json

# ModelCIF dictionary validation
gemmi validate -p -d mmcif_ma.dic \
  examples/colabfold_monomer_e2e/modelcif/AF-0000000300000001-model_v1.cif

For manual coordinate-file sanity checks, open representative PDB, ModelCIF, and BCIF files in the Mol* web viewer at https://molstar.org/viewer/. Drag and drop the files into the browser window, or use Open Files in the left panel. The structure should open correctly, no error messages should be shown in the viewer, and the structure should look structurally correct by eye. The same representative files can also be opened in ChimeraX or PyMOL; expect a clean import with no parser errors.

Usage

ColabFold conversion

Convert ColabFold score JSON + PDB to AFDB ingest JSONs (pLDDT/PAE) and optional UniProt-style manifests.

Requirements: orjson, duckdb, a chain manifest (model_entity_id,entity_id,chain_id,uniprot_ac at minimum), and a DuckDB built from the UniProt subset.

Example (per model, safer for many parallel jobs):

afdb-colabfold-convert \
  /path/to/<AC>_scores_rank_001_alphafold2_multimer_v3_model_1_seed_000.json \
  /path/to/<AC>_unrelaxed_rank_001_alphafold2_multimer_v3_model_1_seed_000.pdb \
  --manifest /mnt/disks/data/sample/config/uniprot_afid_mapping.csv \
  --duckdb /mnt/disks/data/sample/db/uniprot_2025_04.duckdb \
  --model-entity-id AF-0000000000001201 \
  --outdir /mnt/disks/data/sample/colabfold_output/<AC>-model_v4 \
  --chain-manifest-dir /mnt/disks/data/sample/per_accession/manifests/chains \
  --model-manifest-dir /mnt/disks/data/sample/per_accession/manifests/models

Outputs:

AFDB JSONs: <model_entity_id>-confidence_v1.json and <model_entity_id>-predicted_aligned_error_v1.json in --outdir.
Per-model manifests:
- Chains: <model_entity_id>_afid_mapping.csv with pLDDT averages/fractions and local 1..N residue ranges.
- Models: <model_entity_id>_model_metadata.csv with average pLDDT and ipTM (if present in scores JSON).

Merge per-model manifests when needed (keep the header, append rows):

# Chain manifest (uniprot_afid_mapping.csv)
head -n1 /mnt/disks/data/sample/per_accession/manifests/chains/*_afid_mapping.csv \
  > /mnt/disks/data/sample/config/uniprot_afid_mapping.csv
tail -n +2 -q /mnt/disks/data/sample/per_accession/manifests/chains/*_afid_mapping.csv \
  >> /mnt/disks/data/sample/config/uniprot_afid_mapping.csv

# Model manifest (uniprot_model_metadata.csv)
head -n1 /mnt/disks/data/sample/per_accession/manifests/models/*_model_metadata.csv \
  > /mnt/disks/data/sample/config/uniprot_model_metadata.csv
tail -n +2 -q /mnt/disks/data/sample/per_accession/manifests/models/*_model_metadata.csv \
  >> /mnt/disks/data/sample/config/uniprot_model_metadata.csv

Build DuckDB (once per release) from the chain manifest and UniProt flat files:

afdb-uniprot-extract --mapping <chain_manifest.csv> -o <parquet_dir> -r 2025_04 \
  uniprot/data/uniprot_sprot.dat.gz uniprot/data/uniprot_trembl.dat.gz

afdb-uniprot-build-db --parquet-dir <parquet_dir> --db <db_path> --force

ModelCIF Generator

Converts PDB files to mmCIF format with integrated metadata.

Requirements:

Input PDB file
Metadata JSON file conforming to the schema: afdb_integration_kit/modelcif/resources/schema.json
Optional: ModelCIF dictionary (mmcif_ma.dic) if you intend to run --validate

Optional validation dictionary: Only needed when you pass --validate (or --validate "", which defaults to mmcif_ma.dic). Download it once and keep it in the project directory:

curl -o mmcif_ma.dic https://raw.githubusercontent.com/ihmwg/ModelCIF/refs/heads/master/dist/mmcif_ma.dic

Command:

uv run main.py run-modelcif-gen -p <pdb_file> -m <metadata_json> -o <output_cif>

Parameters:

-p, --pdb: Input PDB file path
-m, --metadata: Metadata JSON file path
-o, --output: Output mmCIF file path

ModelPDB Generator

Adds AFDB-specific header information from the generated mmCIF back into the PDB file (so downstream consumers get consistent metadata in both formats).

Requirements:

Input mmCIF file (from run-modelcif-gen)
Input PDB file containing ATOM coordinates
Provider metadata JSON file (provider.json) describing who generated the entry

Command:

uv run main.py run-modelpdb-gen \
    -c <input_cif> \
    -p <input_pdb> \
    -r <provider_json> \
    -o <output_pdb>

Parameters:

-c, --cif: Input mmCIF file path
-p, --pdb: Input PDB file path
-r, --provider: Provider metadata JSON path
-o, --output: Output PDB file path with enriched headers

CIF to BCIF Converter

Converts mmCIF files to Binary CIF format for efficient storage and transmission. The default backend preserves the original toolkit behavior by using the external Mol* cif2bcif command. Biotite remains optional and can be selected explicitly or used as an auto fallback.

Command:

uv run main.py run-cif2bcif -i <input_cif> -o <output_bcif>

Parameters:

-i, --input: Input mmCIF file path
-o, --output: Output BCIF file path
-b, --backend: molstar (default), biotite, or auto

DSSP Secondary Structure Assignment

Assigns secondary structure annotations based on atomic coordinates. The default uses the external DSSP binary, preserving the historical CLI behavior. Python algorithms are available as opt-in 3-state alternatives:

mkdssp (default) — external DSSP binary
pydssp — hydrogen-bond based assignment
psea — geometry-based assignment using CA coordinates
tmalign — CA-CA distance-based assignment

Command:

uv run main.py run-dssp -i <input_cif> -o <output_cif>

Parameters:

-i, --input: Input mmCIF file path
-o, --output: Output annotated mmCIF file path
-a, --algorithm: mkdssp (default), pydssp, psea, or tmalign
-d, --device: cpu (default) or cuda for PyDSSP

Validation Toolkit

Use these commands to sanity-check individual artifacts or entire datasets before handing results to collaborators.

Schema Validation

Validate metadata JSON files against the required JSON schemas to ensure data consistency and compliance.

Schemas:

Model: afdb_integration_kit/metadata/resources/model_schema.json for full model metadata entries and batches
Model summary: afdb_integration_kit/metadata/resources/model_summary_schema.json for e2e model_jsons/*.json and search summary documents
Collection doc: afdb_integration_kit/metadata/resources/collection_doc_schema.json for e2e chain_jsons/*.json and collection documents
Provider: afdb_integration_kit/metadata/resources/provider_schema.json

Command:

uv run main.py run-schema-validation -i <metadata_json_file> -t <type>

Parameters:

-i, --input: Path to the metadata JSON file to validate
-t, --type: Type of metadata to validate (model, model-summary, collection-doc, or provider)

Examples:

uv run main.py run-schema-validation -i model.json -t model
uv run main.py run-schema-validation -i model_summary.json -t model-summary
uv run main.py run-schema-validation -i collection_doc.json -t collection-doc
uv run main.py run-schema-validation -i provider.json -t provider

Dataset-Level Validators

Run multiple checks across an input directory (the same layout expected by the workflow):

# Run all enabled validators using defaults.yaml
uv run main.py run-validations --root input/

# Run a subset with custom config and JSON output
uv run main.py run-validations \
    --root input/ \
    --checks naming plddt pae \
    --config my-validations.yaml \
    --out reports/validation.json

run-validations respects validation/defaults.yaml but you can override settings via --config.
Use --summary, --errors-only, and --fail-on warn to tailor CLI output/exit codes.
run-naming-check provides a lightweight naming/required-file audit with simplified flags:

uv run main.py run-naming-check --root input/ --errors-only

plddt-check focuses on pLDDT JSONs (value ranges, counts, optional structure cross-checks):

uv run main.py plddt-check --root input/ --verbose

Single-File Validators

Ideal for workflow steps (e.g., Nextflow processes) that emit one artifact at a time:

# Metadata (batch or per-accession JSON)
uv run main.py validate-metadata-file --file path/to/metadata.json

# pLDDT confidence JSON
uv run main.py validate-plddt-file --file path/to/AF-...-confidence_v1.json

# PAE JSON
uv run main.py validate-pae-file --file path/to/AF-...-predicted_aligned_error_v1.json

# Check a matching pLDDT/PAE pair
uv run main.py validate-relationships-pair \
    --plddt-file path/to/AF-...-confidence_v1.json \
    --pae-file path/to/AF-...-predicted_aligned_error_v1.json

# FASTA sequences file
uv run main.py validate-sequences-file --file path/to/sequences.fasta

Each command exits with code 1 if it encounters validation errors, making them easy to embed in automated pipelines.

Production Pipeline

The production pipeline (scripts/production_pipeline.py) provides a standalone alternative to the Nextflow workflow with comprehensive logging, caching, and resume capability. It processes models through 16 stages (executed in this order):

Prepare assets – symlink PDB + meta JSON to staging
Validate assets – check PDB/JSON consistency
Convert ColabFold – produce AFDB-format confidence & PAE JSONs
Merge manifests – merge per-model chain/model manifests
Calculate ipSAE scores – interface quality metrics (ipSAE, pDockQ, LIS)
Analyze clashes/interfaces – VDW clashes, interface residues
Export model metadata – generate per-model metadata JSONs (enriched with iPSAE/clash metrics)
Export chain metadata – generate per-chain metadata JSONs (enriched with iPSAE metrics)
Combine model metadata – batch into chunked JSONs
Combine chain metadata – batch into chunked JSONs
Export ModelCIF input – prepare ModelCIF metadata from template
Generate ModelCIF – PDB → mmCIF with full metadata and optional QA metrics
DSSP – secondary structure annotation (3-state: helix/strand/coil)
Enrich PDB – add AFDB headers to PDB files
CIF → BCIF – BinaryCIF conversion
Cleanup – optional intermediate file cleanup (skipped by default)

Note: ipSAE and clash analysis (stages 5-6) run before metadata export (stages 7-8) so that quality metrics are available for JSON enrichment and CIF embedding.

Prerequisites: Install production dependencies first (see Installation section 7). For clash/interface analysis, also install a torch_cluster wheel that matches your PyTorch and CUDA build.

uv pip install '.[production]'
uv pip install torch_cluster -f https://data.pyg.org/whl/torch-<torch-version>+<cuda>.html

Homodimer mode (default)

All config files are provided up front — no API calls, no manifest resolution:

python scripts/production_pipeline.py \
    --output-dir /path/to/output \
    --input-dir /path/to/input \
    --mapping-file /path/to/mapping.tsv \
    --chain-mapping /path/to/manifest.csv \
    --dataset-config /path/to/config.json \
    --provider-json /path/to/provider.json \
    --uniprot-db /path/to/uniprot.duckdb \
    --workers 30 \
    --cif-qa-metrics auto

Heterodimer mode

Enable with --heterodimers. Requires --chain-mapping and --uniprot-db. Config files (mapping TSV, dataset config, provider JSON) are auto-generated if not provided. Model IDs are derived from the chain mapping CSV.

python scripts/production_pipeline.py \
    --output-dir /path/to/output \
    --input-dir /path/to/raw_colabfold \
    --heterodimers \
    --chain-mapping /path/to/manifest.csv \
    --uniprot-db /path/to/uniprot.duckdb \
    --workers 4 \
    --cif-qa-metrics auto

The --input-dir may contain raw ColabFold outputs (long suffixes like _unrelaxed_rank_001_alphafold2_multimer_v3_model_1_seed_000.pdb are detected automatically).

Key options

Flag	Description
`--resume`	Resume from previous run (skip completed stages)
`--skip-stages stage_12,stage_13`	Skip specific stages (comma-separated)
`--dry-run`	Show what would be executed without running
`--dssp-algorithm`	Production pipeline secondary structure algorithm: `mkdssp`, `psea`, `pydssp` (production default), or `tmalign`
`--workers N`	Parallel workers (default: all CPUs)
`--pae-cutoff` / `--dist-cutoff`	ipSAE thresholds (default: 10.0 / 15.0)
`--clash-cutoff` / `--interface-cutoff`	Clash/interface thresholds (default: 0.4 / 8.0 Å)
`--analysis-batch-size N`	Batch size for clash/interface GPU analysis (default: 4)
`--cif-qa-metrics`	QA metrics to embed in mmCIF: `auto` (default, all metrics) or comma-separated list (e.g. `ipsae_AB,iptm_af,N_clash_backbone`)
`--enrichment-metrics`	iPSAE/clash metric names to include in model/chain metadata JSONs (default: all known metrics)
`--interface-clash-analysis`	Which analyses to run: `interface`, `backbone_clashes`, `heavy_atom_clashes` (default: all three)
`--modelcif-template`	Path to ModelCIF metadata template JSON (default: `uniprot/templates/colabfold_example_modelcif_metadata.json`)

Output: Results are written to the output directory with logs in logs/, cache in .pipeline_cache.json, and a results summary in pipeline_results.json.

Run python scripts/production_pipeline.py --help for full documentation.

Prepare Inputs (Standalone)

scripts/prepare_inputs.py can also be used independently (outside the production pipeline) to prepare ColabFold outputs into the canonical layout the pipeline expects. It scans for matched PDB + scores-JSON pairs, builds config files, and symlinks inputs.

Production mode (pre-built assets, no network):

python scripts/prepare_inputs.py \
    --input-dir /data/colabfold/gpu0 \
    --output-dir /data/workdir \
    --chain-mapping /data/prebuilt_manifest.csv \
    --uniprot-db /data/uniprot.duckdb \
    --provider-id afcdb-heterodimers \
    --provider-name "AFCDB Heterodimers"

Dev mode (resolves AF-IDs from the AFCDB manifest + fetches from UniProt API):

python scripts/prepare_inputs.py \
    --input-dir ./gpu0 \
    --output-dir ./workdir \
    --build-from-api /data/afdb_toolkit_manifest_file.csv \
    --provider-id afcdb-heterodimers \
    --provider-name "AFCDB Heterodimers"

By default, scores files are symlinked as meta JSONs (zero I/O). Pass --extract-meta to parse and re-write leaner JSONs, or --copy to copy instead of symlink.

Docker Usage

The Dockerfile installs the core Python dependency set from requirements.txt, plus Mol*, DSSP, Nextflow, and the ModelCIF dictionary. It is intended for the core CLI, validation, ModelCIF/PDB, CIF/BCIF, and Nextflow workflows. It does not install the production extra or torch_cluster; build a derived image with a PyTorch/CUDA-compatible torch_cluster wheel if you need the standalone production pipeline inside Docker.

Use Prebuilt Docker Image (Recommended)

You can skip building the image locally by using the prebuilt image available on Docker Hub:

docker pull pdbegroup/afdb-integration-toolkit

Use it in the same way as the locally built image. For example:

docker run \
    -v "$PWD/input:/input" \
    -v "$PWD/output:/output" \
    -w /workspace \
    -v "$PWD:/workspace" \
    pdbegroup/afdb-integration-toolkit uv run main.py run-modelcif-gen \
        -p /input/AF-0000000000000001-model-v1.pdb \
        -m /input/AF-0000000000000001-v1.cif.json \
        -o /output/AF-0000000000000001-model-v1.cif

Build Docker Image (Optional)

If you prefer to build the image yourself:

docker build -t afdb-toolkit .

Run Tools in Docker

ModelCIF Generator:

docker run \
    -v "$PWD/input:/input" \
    -v "$PWD/output:/output" \
    -w /workspace \
    -v "$PWD:/workspace" \
    afdb-toolkit uv run main.py run-modelcif-gen \
        -p /input/AF-0000000000000001-model-v1.pdb \
        -m /input/AF-0000000000000001-v1.cif.json \
        -o /output/AF-0000000000000001-model-v1.cif

CIF to BCIF Converter:

docker run \
    -v "$PWD/input:/input" \
    -v "$PWD/output:/output" \
    -w /workspace \
    -v "$PWD:/workspace" \
    afdb-toolkit uv run main.py run-cif2bcif \
        -i /input/AF-0000000000000001-model-v1.cif \
        -o /output/AF-0000000000000001-model-v1.bcif

DSSP Processing:

docker run \
    -v "$PWD/input:/input" \
    -v "$PWD/output:/output" \
    -w /workspace \
    -v "$PWD:/workspace" \
    afdb-toolkit uv run main.py run-dssp \
        -i /input/AF-0000000000000001-model-v1.cif \
        -o /output/AF-0000000000000001-model-v1.cif

Schema Validation

Run schema validation in Docker:

docker run \
    -v "$PWD/input:/input" \
    -v "$PWD/output:/output" \
    -w /workspace \
    -v "$PWD:/workspace" \
    afdb-toolkit uv run main.py run-schema-validation -i model.json -t model

Replace model.json with the actual path to your metadata file. For provider metadata:

afdb-toolkit uv run main.py run-schema-validation -i provider.json -t provider

Nextflow Workflow

The nextflow scripts are placed in the workflow directory. The main workflow script is workflow.nf, which orchestrates the end-to-end processing of the model files (except metadata JSON validation). validate.nf is used for schema validation of model and provider metadata files.

End-to-End Processing

Run the complete workflow using the provided script:

docker run \
    -v "$PWD/nf_workspace/.nextflow:/workspace/.nextflow" \
    -v "$PWD/output:/output" \
    -v "$PWD/input:/input" \
    -w /workspace \
    -v "$PWD/nf_workspace:/workspace" \
    afdb-toolkit nextflow run /app/workflow/workflow.nf -resume

This will process all the model files in the input directory and place the output files in the output directory.

Workflow Structure

---
config:
  layout: elk
---
flowchart TD
    A[".pdb file"] --> C["ModelCIF Generator"] & J["ModelPDB Generator"]
    B["CIF metadata JSON"] --> C
    C --> D[".cif file (mmCIF)"]
    D --> E["DSSP"]
    E --> F[".cif file (mmCIF, with DSSP annotations)"]
    F --> J & G["CIF to BCIF Generator"]
    I["Provider JSON"] --> J
    J --> K[".pdb file (with AFDB headers)"]
    G --> H[".bcif file (Binary CIF)"]
    style A fill:#fff3e0
    style C fill:#f3e5f5
    style J fill:#f3e5f5
    style B fill:#e1f5fe
    style D fill:#fff3e0
    style E fill:#f3e5f5
    style F fill:#e8f5e8
    style G fill:#f3e5f5
    style K fill:#e8f5e8
    style H fill:#e8f5e8

Schema Validation

Run the schema validation workflow using the provided script. This workflow performs two tasks:

Validate Metadata: Ensures that the model metadata JSON files conform to the required schema.
Batch Processing: If validation is successful, the workflow concatenates the JSON files into a list of JSONs for further processing based on a configurable chunk size, which defaults to 100.

To adjust the chunk size, update the params.metadata_chunk_size parameter in the workflow/validate.nf script or pass it as a command-line argument when executing the workflow. For example:

--metadata_chunk_size 100

docker run \
    -v "$PWD/nf_workspace/.nextflow:/workspace/.nextflow" \
    -v "$PWD/input:/input" \
    -v "$PWD/output:/output" \
    -w /workspace \
    -v "$PWD/nf_workspace:/workspace" \
    afdb-toolkit nextflow run /app/workflow/validate.nf -resume

The output will be stored in the output/metadata directory, containing the batched validated model metadata JSON files.

Input Requirements

The Nextflow workflow requires an input list file at input/input.txt containing the entries to process. Each entry should be on a new line:

AF-0001234567890123
AF-0001234567890124
AF-0001234567890125
AF-0001234567890126

Example input.txt:

# Create the input list file
cat > input/input.txt << EOF
AF-0001234567890123
AF-0001234567890124
AF-0001234567890125
EOF

Workflow Features

Resumable: Uses -resume flag to continue from previous checkpoints
Cached: Maintains state in .nextflow directory
Dependency Management: Automatically handles tool dependencies
Parallel Processing: Processes multiple files concurrently

Important Notes

Mount the .nextflow directory to preserve workflow state
Ensure proper input/output directory mounting
The workflow runs in resume mode by default

File Structure Requirements

Input Directory Structure

The toolkit expects files to be organized in a specific hierarchical structure:

input/
├── 0001/
│   ├── 2345/
│   │   ├── 6789/
│   │   │   ├── 0123/
│   │   │   │   ├── AF-0001234567890123-model-v1.pdb
│   │   │   │   └── AF-0001234567890123-v1.cif.json

Directory Structure Rules

Extract 16-digit numeric ID: From AF-0001234567890123-model-v1.pdb → 0001234567890123
Split into 4-digit segments: 0001, 2345, 6789, 0123
Create nested directories: 0001/2345/6789/0123/
Place files in final directory: Both PDB and JSON files

Output Structure

The workflow automatically creates corresponding output directories following the same structure:

output/
├── 0001/
│   ├── 2345/
│   │   ├── 6789/
│   │   │   ├── 0123/
│   │   │   │   ├── AF-0001234567890123-model-v1.cif
│   │   │   │   └── AF-0001234567890123-model-v1.bcif

Troubleshooting

Common Issues

Missing Dependencies: Run uv run main.py test to identify missing components
Permission Errors: Ensure Docker has proper access to mounted directories
File Not Found: Verify input files follow the required directory structure
Memory Issues: For large datasets, consider adjusting Docker memory limits
ModelCIF Validation Errors: Ensure mmcif_ma.dic is present in the project directory (automatically handled in Docker)
Nextflow Workflow Errors: Ensure input/input.txt exists and contains valid entry IDs

Getting Help

Check the Issues page
Validate your metadata JSON against the provided schema

License

This project is licensed under the CC0 1.0 Universal - see the LICENSE file for details.

Support

For support and questions:

Issues: GitHub Issues
Email: afdbhelp@ebi.ac.uk

Name		Name	Last commit message	Last commit date
Latest commit History 178 Commits
.github/workflows		.github/workflows
afdb_integration_kit		afdb_integration_kit
docs		docs
examples		examples
interface_annotations		interface_annotations
scripts		scripts
tests		tests
uniprot		uniprot
workflow		workflow
.gitignore		.gitignore
.nvmrc		.nvmrc
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
detailed.txt		detailed.txt
dev-requirements.txt		dev-requirements.txt
main.py		main.py
nextflow.config		nextflow.config
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

AFDB Integration Toolkit

Table of Contents

Features

Prerequisites

Installation

1. Clone the Repository

2. Install UV (Python Package Manager)

3. Install Core Python Dependencies

4. Install Mol* CLI

5. Install DSSP

6. Download mmCIF Dictionary (Optional)

7. Install Production Pipeline Dependencies (Optional)

8. Install Nextflow (Optional)

9. Install Docker (Optional)

Quick Start

Verify Installation

Basic Usage Example

Validate Example Outputs

Usage

ColabFold conversion

ModelCIF Generator

ModelPDB Generator

CIF to BCIF Converter

DSSP Secondary Structure Assignment

Validation Toolkit

Schema Validation

Dataset-Level Validators

Single-File Validators

Production Pipeline

Homodimer mode (default)

Heterodimer mode

Key options

Prepare Inputs (Standalone)

Docker Usage

Use Prebuilt Docker Image (Recommended)

Build Docker Image (Optional)

Run Tools in Docker

Nextflow Workflow

End-to-End Processing

Workflow Structure

Schema Validation

Input Requirements

Workflow Features

Important Notes

File Structure Requirements

Input Directory Structure

Directory Structure Rules

Output Structure

Troubleshooting

Common Issues

Getting Help

License

Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages