A comprehensive toolkit for integrating structural models into the AlphaFold Database (AFDB). This toolkit provides essential tools and workflows to prepare, validate, and format molecular structure data for seamless integration with AFDB infrastructure.
- AFDB Integration Toolkit
- ModelCIF Generation: Convert PDB files to mmCIF format with metadata integration
- Binary CIF Conversion: Efficient conversion from mmCIF to Binary CIF (BCIF) format
- Secondary Structure Assignment: DSSP-based secondary structure annotation
- Metadata Schema Validation: Validate model and provider metadata JSONs against AFDB-defined schemas
- UniProt Metadata Tooling: Streamline UniProt subset extraction and AF metadata generation (see uniprot/README.md)
- Automated Workflows: Nextflow-based end-to-end processing pipelines
- Production Pipeline: Standalone Python pipeline with logging, caching, resume capability, structure analysis (clash detection, interface residues), iPSAE quality scoring, and mmCIF QA metric embedding
- Docker Support: Containerized execution for reproducible results
- Validation Tools: Built-in testing and validation utilities
- Python 3.12+
- Node.js 18+ (for Mol* CLI)
- Docker (optional, for containerized execution)
- Nextflow (optional, for workflow automation)
git clone https://github.com/PDBeurope/AFDB-Integration-Kit
cd AFDB-Integration-KitUV is used to manage Python dependencies and virtual environments.
macOS/Linux:
curl -LsSf https://astral.sh/uv/install.sh | shWindows:
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"Alternative installation methods:
# Using pip
pip install uv
# Using conda
conda install -c conda-forge uvInstall the default dependency set from the locked project environment:
uv sync --locked --no-devThe core install is intended for normal CLI usage, help output, metadata and schema validation, UniProt metadata tooling, ColabFold conversion, ModelCIF/PDB generation, CIF to BCIF conversion through the Mol* CLI fallback, and non-production helper scripts. It intentionally does not install the heavier production structure-analysis packages.
Contributors who need development tools and tests can install the full locked environment instead:
uv sync --lockedIf you use nvm (Node Version Manager):
nvm use # Uses the version specified in .nvmrc
npm install -g molstarWithout nvm:
npm install -g molstarThe default run-dssp and batch-dssp commands use the external mkdssp
binary, so install DSSP when using the default secondary-structure path. DSSP is
also needed for Nextflow workflows.
The standalone production pipeline defaults to the built-in pydssp algorithm
and does not require an external DSSP binary unless you select
--dssp-algorithm mkdssp.
We use the modern DSSP implementation by the PDB-REDO team:
# Clone and build DSSP
git clone https://github.com/PDB-REDO/dssp.git
cd dssp
mkdir build
cd build
cmake ..
make
sudo make installFor detailed installation instructions, visit: https://github.com/PDB-REDO/dssp
The ModelCIF tool has an additional option to validate the mmCIF files against the updated model cif dictionary. This is an optional parameter, but it is recommended to validate the output files when first setting up the tool.
Download the modelcif dictionary to your project directory:
# Download the mmCIF dictionary
curl -o mmcif_ma.dic https://raw.githubusercontent.com/ihmwg/ModelCIF/refs/heads/master/dist/mmcif_ma.dicNote: This step is automatically handled in the Docker environment, but is required for local installations.
The production pipeline (scripts/production_pipeline.py) requires additional dependencies for structure analysis, DSSP algorithms, clash detection, and interface residues.
Install the project production extra into the uv environment:
uv pip install '.[production]'This installs the production Python packages declared by the project, including
biotite, pydssp, torch, and fastpdb.
Install torch_cluster separately after PyTorch is installed. Its wheel must
match the installed PyTorch version and CUDA runtime. Pick the CUDA suffix
from the PyTorch Geometric wheel index for your environment (cpu, cu118,
cu121, cu124, cu126, cu128, etc.):
# Check the installed PyTorch build first
python -c "import torch; print(torch.__version__, torch.version.cuda)"
# Example: CPU wheel for PyTorch 2.8.0
uv pip install torch_cluster -f https://data.pyg.org/whl/torch-2.8.0+cpu.html
# Example: CUDA 12.8 wheel for PyTorch 2.8.0
uv pip install torch_cluster -f https://data.pyg.org/whl/torch-2.8.0+cu128.htmlIf uv pip install '.[production]' resolves a different PyTorch version, change
the torch-<version>+<cuda> part of the torch_cluster URL to match that
installed build. For available torch_cluster wheels, see
https://data.pyg.org/whl/.
Verify installation:
python -c "import torch; from torch_cluster import radius_graph; print('torch_cluster OK')"For workflow automation:
# Using curl
curl -s https://get.nextflow.io | bash
# Make executable and add to PATH
chmod +x nextflow
sudo mv nextflow /usr/local/bin/For containerized execution:
- macOS/Windows: Download Docker Desktop from https://www.docker.com/products/docker-desktop
- Linux: Follow instructions at https://docs.docker.com/engine/install/
Verify the core Python install:
uv run main.py --help
uv run main.py list-validationsTo check the optional external toolchain as well, install Mol*, DSSP, and any other workflow tools you need, then run:
uv run main.py testThis command reports missing external executables such as cif2bcif or
mkdssp.
# Generate ModelCIF
uv run main.py run-modelcif-gen \
-p input/AF-0000000000000001-model-v1.pdb \
-m input/AF-0000000000000001-v1.cif.json \
-o output/AF-0000000000000001-model-v1.cif
# Convert to BCIF
uv run main.py run-cif2bcif \
-i input/AF-0000000000000001-model-v1.cif \
-o output/AF-0000000000000001-model-v1.bcif
# Add secondary structure annotation
uv run main.py run-dssp \
-i input/AF-0000000000000001-model-v1.cif \
-o output/AF-0000000000000001-model-v1.cifThe committed end-to-end examples under examples/
can be validated directly from the repo root. Use model-summary for committed
e2e model_jsons/*.json; the canonical model schema remains reserved for
full model metadata entries and AF-metadata-*-of-*.json batches.
# Summary and provider metadata JSONs
.venv/bin/python main.py run-schema-validation \
-i examples/colabfold_monomer_e2e/model_jsons/AF-0000000300000001.json \
-t model-summary
.venv/bin/python main.py run-schema-validation \
-i examples/colabfold_monomer_e2e/config/provider.json \
-t provider
# Score JSONs and confidence/PAE relationship
.venv/bin/python main.py validate-plddt-file \
--file examples/colabfold_monomer_e2e/scores/AF-0000000300000001-confidence_v1.json
.venv/bin/python main.py validate-pae-file \
--file examples/colabfold_monomer_e2e/scores/AF-0000000300000001-predicted_aligned_error_v1.json
.venv/bin/python main.py validate-relationships-pair \
--plddt-file examples/colabfold_monomer_e2e/scores/AF-0000000300000001-confidence_v1.json \
--pae-file examples/colabfold_monomer_e2e/scores/AF-0000000300000001-predicted_aligned_error_v1.json
# ModelCIF dictionary validation
gemmi validate -p -d mmcif_ma.dic \
examples/colabfold_monomer_e2e/modelcif/AF-0000000300000001-model_v1.cifFor manual coordinate-file sanity checks, open representative PDB, ModelCIF, and BCIF files in the Mol* web viewer at https://molstar.org/viewer/. Drag and drop the files into the browser window, or use Open Files in the left panel. The structure should open correctly, no error messages should be shown in the viewer, and the structure should look structurally correct by eye. The same representative files can also be opened in ChimeraX or PyMOL; expect a clean import with no parser errors.
Convert ColabFold score JSON + PDB to AFDB ingest JSONs (pLDDT/PAE) and optional UniProt-style manifests.
Requirements: orjson, duckdb, a chain manifest (model_entity_id,entity_id,chain_id,uniprot_ac at minimum), and a DuckDB built from the UniProt subset.
Example (per model, safer for many parallel jobs):
afdb-colabfold-convert \
/path/to/<AC>_scores_rank_001_alphafold2_multimer_v3_model_1_seed_000.json \
/path/to/<AC>_unrelaxed_rank_001_alphafold2_multimer_v3_model_1_seed_000.pdb \
--manifest /mnt/disks/data/sample/config/uniprot_afid_mapping.csv \
--duckdb /mnt/disks/data/sample/db/uniprot_2025_04.duckdb \
--model-entity-id AF-0000000000001201 \
--outdir /mnt/disks/data/sample/colabfold_output/<AC>-model_v4 \
--chain-manifest-dir /mnt/disks/data/sample/per_accession/manifests/chains \
--model-manifest-dir /mnt/disks/data/sample/per_accession/manifests/models
Outputs:
- AFDB JSONs:
<model_entity_id>-confidence_v1.jsonand<model_entity_id>-predicted_aligned_error_v1.jsonin--outdir. - Per-model manifests:
- Chains:
<model_entity_id>_afid_mapping.csvwith pLDDT averages/fractions and local 1..N residue ranges. - Models:
<model_entity_id>_model_metadata.csvwith average pLDDT and ipTM (if present in scores JSON).
- Chains:
Merge per-model manifests when needed (keep the header, append rows):
# Chain manifest (uniprot_afid_mapping.csv)
head -n1 /mnt/disks/data/sample/per_accession/manifests/chains/*_afid_mapping.csv \
> /mnt/disks/data/sample/config/uniprot_afid_mapping.csv
tail -n +2 -q /mnt/disks/data/sample/per_accession/manifests/chains/*_afid_mapping.csv \
>> /mnt/disks/data/sample/config/uniprot_afid_mapping.csv
# Model manifest (uniprot_model_metadata.csv)
head -n1 /mnt/disks/data/sample/per_accession/manifests/models/*_model_metadata.csv \
> /mnt/disks/data/sample/config/uniprot_model_metadata.csv
tail -n +2 -q /mnt/disks/data/sample/per_accession/manifests/models/*_model_metadata.csv \
>> /mnt/disks/data/sample/config/uniprot_model_metadata.csv
Build DuckDB (once per release) from the chain manifest and UniProt flat files:
afdb-uniprot-extract --mapping <chain_manifest.csv> -o <parquet_dir> -r 2025_04 \
uniprot/data/uniprot_sprot.dat.gz uniprot/data/uniprot_trembl.dat.gz
afdb-uniprot-build-db --parquet-dir <parquet_dir> --db <db_path> --force
Converts PDB files to mmCIF format with integrated metadata.
Requirements:
- Input PDB file
- Metadata JSON file conforming to the schema:
afdb_integration_kit/modelcif/resources/schema.json - Optional: ModelCIF dictionary (
mmcif_ma.dic) if you intend to run--validate
Optional validation dictionary: Only needed when you pass --validate (or --validate "", which defaults to mmcif_ma.dic). Download it once and keep it in the project directory:
curl -o mmcif_ma.dic https://raw.githubusercontent.com/ihmwg/ModelCIF/refs/heads/master/dist/mmcif_ma.dicCommand:
uv run main.py run-modelcif-gen -p <pdb_file> -m <metadata_json> -o <output_cif>Parameters:
-p, --pdb: Input PDB file path-m, --metadata: Metadata JSON file path-o, --output: Output mmCIF file path
Adds AFDB-specific header information from the generated mmCIF back into the PDB file (so downstream consumers get consistent metadata in both formats).
Requirements:
- Input mmCIF file (from
run-modelcif-gen) - Input PDB file containing ATOM coordinates
- Provider metadata JSON file (
provider.json) describing who generated the entry
Command:
uv run main.py run-modelpdb-gen \
-c <input_cif> \
-p <input_pdb> \
-r <provider_json> \
-o <output_pdb>Parameters:
-c, --cif: Input mmCIF file path-p, --pdb: Input PDB file path-r, --provider: Provider metadata JSON path-o, --output: Output PDB file path with enriched headers
Converts mmCIF files to Binary CIF format for efficient storage and transmission.
The default backend preserves the original toolkit behavior by using the
external Mol* cif2bcif command. Biotite remains optional and can be selected
explicitly or used as an auto fallback.
Command:
uv run main.py run-cif2bcif -i <input_cif> -o <output_bcif>Parameters:
-i, --input: Input mmCIF file path-o, --output: Output BCIF file path-b, --backend:molstar(default),biotite, orauto
Assigns secondary structure annotations based on atomic coordinates. The default uses the external DSSP binary, preserving the historical CLI behavior. Python algorithms are available as opt-in 3-state alternatives:
- mkdssp (default) — external DSSP binary
- pydssp — hydrogen-bond based assignment
- psea — geometry-based assignment using CA coordinates
- tmalign — CA-CA distance-based assignment
Command:
uv run main.py run-dssp -i <input_cif> -o <output_cif>Parameters:
-i, --input: Input mmCIF file path-o, --output: Output annotated mmCIF file path-a, --algorithm:mkdssp(default),pydssp,psea, ortmalign-d, --device:cpu(default) orcudafor PyDSSP
Use these commands to sanity-check individual artifacts or entire datasets before handing results to collaborators.
Validate metadata JSON files against the required JSON schemas to ensure data consistency and compliance.
Schemas:
- Model:
afdb_integration_kit/metadata/resources/model_schema.jsonfor full model metadata entries and batches - Model summary:
afdb_integration_kit/metadata/resources/model_summary_schema.jsonfor e2emodel_jsons/*.jsonand search summary documents - Collection doc:
afdb_integration_kit/metadata/resources/collection_doc_schema.jsonfor e2echain_jsons/*.jsonand collection documents - Provider:
afdb_integration_kit/metadata/resources/provider_schema.json
Command:
uv run main.py run-schema-validation -i <metadata_json_file> -t <type>Parameters:
-i, --input: Path to the metadata JSON file to validate-t, --type: Type of metadata to validate (model,model-summary,collection-doc, orprovider)
Examples:
uv run main.py run-schema-validation -i model.json -t model
uv run main.py run-schema-validation -i model_summary.json -t model-summary
uv run main.py run-schema-validation -i collection_doc.json -t collection-doc
uv run main.py run-schema-validation -i provider.json -t providerRun multiple checks across an input directory (the same layout expected by the workflow):
# Run all enabled validators using defaults.yaml
uv run main.py run-validations --root input/
# Run a subset with custom config and JSON output
uv run main.py run-validations \
--root input/ \
--checks naming plddt pae \
--config my-validations.yaml \
--out reports/validation.jsonrun-validationsrespectsvalidation/defaults.yamlbut you can override settings via--config.- Use
--summary,--errors-only, and--fail-on warnto tailor CLI output/exit codes. run-naming-checkprovides a lightweight naming/required-file audit with simplified flags:
uv run main.py run-naming-check --root input/ --errors-onlyplddt-checkfocuses on pLDDT JSONs (value ranges, counts, optional structure cross-checks):
uv run main.py plddt-check --root input/ --verboseIdeal for workflow steps (e.g., Nextflow processes) that emit one artifact at a time:
# Metadata (batch or per-accession JSON)
uv run main.py validate-metadata-file --file path/to/metadata.json
# pLDDT confidence JSON
uv run main.py validate-plddt-file --file path/to/AF-...-confidence_v1.json
# PAE JSON
uv run main.py validate-pae-file --file path/to/AF-...-predicted_aligned_error_v1.json
# Check a matching pLDDT/PAE pair
uv run main.py validate-relationships-pair \
--plddt-file path/to/AF-...-confidence_v1.json \
--pae-file path/to/AF-...-predicted_aligned_error_v1.json
# FASTA sequences file
uv run main.py validate-sequences-file --file path/to/sequences.fastaEach command exits with code 1 if it encounters validation errors, making them easy to embed in automated pipelines.
The production pipeline (scripts/production_pipeline.py) provides a standalone alternative to the Nextflow workflow with comprehensive logging, caching, and resume capability. It processes models through 16 stages (executed in this order):
- Prepare assets – symlink PDB + meta JSON to staging
- Validate assets – check PDB/JSON consistency
- Convert ColabFold – produce AFDB-format confidence & PAE JSONs
- Merge manifests – merge per-model chain/model manifests
- Calculate ipSAE scores – interface quality metrics (ipSAE, pDockQ, LIS)
- Analyze clashes/interfaces – VDW clashes, interface residues
- Export model metadata – generate per-model metadata JSONs (enriched with iPSAE/clash metrics)
- Export chain metadata – generate per-chain metadata JSONs (enriched with iPSAE metrics)
- Combine model metadata – batch into chunked JSONs
- Combine chain metadata – batch into chunked JSONs
- Export ModelCIF input – prepare ModelCIF metadata from template
- Generate ModelCIF – PDB → mmCIF with full metadata and optional QA metrics
- DSSP – secondary structure annotation (3-state: helix/strand/coil)
- Enrich PDB – add AFDB headers to PDB files
- CIF → BCIF – BinaryCIF conversion
- Cleanup – optional intermediate file cleanup (skipped by default)
Note: ipSAE and clash analysis (stages 5-6) run before metadata export (stages 7-8) so that quality metrics are available for JSON enrichment and CIF embedding.
Prerequisites: Install production dependencies first (see Installation section 7). For clash/interface analysis, also install a torch_cluster wheel that matches your PyTorch and CUDA build.
uv pip install '.[production]'
uv pip install torch_cluster -f https://data.pyg.org/whl/torch-<torch-version>+<cuda>.htmlAll config files are provided up front — no API calls, no manifest resolution:
python scripts/production_pipeline.py \
--output-dir /path/to/output \
--input-dir /path/to/input \
--mapping-file /path/to/mapping.tsv \
--chain-mapping /path/to/manifest.csv \
--dataset-config /path/to/config.json \
--provider-json /path/to/provider.json \
--uniprot-db /path/to/uniprot.duckdb \
--workers 30 \
--cif-qa-metrics autoEnable with --heterodimers. Requires --chain-mapping and --uniprot-db. Config files (mapping TSV, dataset config, provider JSON) are auto-generated if not provided. Model IDs are derived from the chain mapping CSV.
python scripts/production_pipeline.py \
--output-dir /path/to/output \
--input-dir /path/to/raw_colabfold \
--heterodimers \
--chain-mapping /path/to/manifest.csv \
--uniprot-db /path/to/uniprot.duckdb \
--workers 4 \
--cif-qa-metrics autoThe --input-dir may contain raw ColabFold outputs (long suffixes like _unrelaxed_rank_001_alphafold2_multimer_v3_model_1_seed_000.pdb are detected automatically).
| Flag | Description |
|---|---|
--resume |
Resume from previous run (skip completed stages) |
--skip-stages stage_12,stage_13 |
Skip specific stages (comma-separated) |
--dry-run |
Show what would be executed without running |
--dssp-algorithm |
Production pipeline secondary structure algorithm: mkdssp, psea, pydssp (production default), or tmalign |
--workers N |
Parallel workers (default: all CPUs) |
--pae-cutoff / --dist-cutoff |
ipSAE thresholds (default: 10.0 / 15.0) |
--clash-cutoff / --interface-cutoff |
Clash/interface thresholds (default: 0.4 / 8.0 Å) |
--analysis-batch-size N |
Batch size for clash/interface GPU analysis (default: 4) |
--cif-qa-metrics |
QA metrics to embed in mmCIF: auto (default, all metrics) or comma-separated list (e.g. ipsae_AB,iptm_af,N_clash_backbone) |
--enrichment-metrics |
iPSAE/clash metric names to include in model/chain metadata JSONs (default: all known metrics) |
--interface-clash-analysis |
Which analyses to run: interface, backbone_clashes, heavy_atom_clashes (default: all three) |
--modelcif-template |
Path to ModelCIF metadata template JSON (default: uniprot/templates/colabfold_example_modelcif_metadata.json) |
Output: Results are written to the output directory with logs in logs/, cache in .pipeline_cache.json, and a results summary in pipeline_results.json.
Run python scripts/production_pipeline.py --help for full documentation.
scripts/prepare_inputs.py can also be used independently (outside the production pipeline) to prepare ColabFold outputs into the canonical layout the pipeline expects. It scans for matched PDB + scores-JSON pairs, builds config files, and symlinks inputs.
Production mode (pre-built assets, no network):
python scripts/prepare_inputs.py \
--input-dir /data/colabfold/gpu0 \
--output-dir /data/workdir \
--chain-mapping /data/prebuilt_manifest.csv \
--uniprot-db /data/uniprot.duckdb \
--provider-id afcdb-heterodimers \
--provider-name "AFCDB Heterodimers"Dev mode (resolves AF-IDs from the AFCDB manifest + fetches from UniProt API):
python scripts/prepare_inputs.py \
--input-dir ./gpu0 \
--output-dir ./workdir \
--build-from-api /data/afdb_toolkit_manifest_file.csv \
--provider-id afcdb-heterodimers \
--provider-name "AFCDB Heterodimers"By default, scores files are symlinked as meta JSONs (zero I/O). Pass --extract-meta to parse and re-write leaner JSONs, or --copy to copy instead of symlink.
The Dockerfile installs the core Python dependency set from requirements.txt,
plus Mol*, DSSP, Nextflow, and the ModelCIF dictionary. It is intended for the
core CLI, validation, ModelCIF/PDB, CIF/BCIF, and Nextflow workflows. It does
not install the production extra or torch_cluster; build a derived image
with a PyTorch/CUDA-compatible torch_cluster wheel if you need the standalone
production pipeline inside Docker.
You can skip building the image locally by using the prebuilt image available on Docker Hub:
docker pull pdbegroup/afdb-integration-toolkitUse it in the same way as the locally built image. For example:
docker run \
-v "$PWD/input:/input" \
-v "$PWD/output:/output" \
-w /workspace \
-v "$PWD:/workspace" \
pdbegroup/afdb-integration-toolkit uv run main.py run-modelcif-gen \
-p /input/AF-0000000000000001-model-v1.pdb \
-m /input/AF-0000000000000001-v1.cif.json \
-o /output/AF-0000000000000001-model-v1.cifIf you prefer to build the image yourself:
docker build -t afdb-toolkit .ModelCIF Generator:
docker run \
-v "$PWD/input:/input" \
-v "$PWD/output:/output" \
-w /workspace \
-v "$PWD:/workspace" \
afdb-toolkit uv run main.py run-modelcif-gen \
-p /input/AF-0000000000000001-model-v1.pdb \
-m /input/AF-0000000000000001-v1.cif.json \
-o /output/AF-0000000000000001-model-v1.cifCIF to BCIF Converter:
docker run \
-v "$PWD/input:/input" \
-v "$PWD/output:/output" \
-w /workspace \
-v "$PWD:/workspace" \
afdb-toolkit uv run main.py run-cif2bcif \
-i /input/AF-0000000000000001-model-v1.cif \
-o /output/AF-0000000000000001-model-v1.bcifDSSP Processing:
docker run \
-v "$PWD/input:/input" \
-v "$PWD/output:/output" \
-w /workspace \
-v "$PWD:/workspace" \
afdb-toolkit uv run main.py run-dssp \
-i /input/AF-0000000000000001-model-v1.cif \
-o /output/AF-0000000000000001-model-v1.cifSchema Validation
Run schema validation in Docker:
docker run \
-v "$PWD/input:/input" \
-v "$PWD/output:/output" \
-w /workspace \
-v "$PWD:/workspace" \
afdb-toolkit uv run main.py run-schema-validation -i model.json -t modelReplace model.json with the actual path to your metadata file. For provider metadata:
afdb-toolkit uv run main.py run-schema-validation -i provider.json -t providerThe nextflow scripts are placed in the workflow directory. The main workflow script is workflow.nf, which orchestrates the end-to-end processing of the model files (except metadata JSON validation). validate.nf is used for schema validation of model and provider metadata files.
Run the complete workflow using the provided script:
docker run \
-v "$PWD/nf_workspace/.nextflow:/workspace/.nextflow" \
-v "$PWD/output:/output" \
-v "$PWD/input:/input" \
-w /workspace \
-v "$PWD/nf_workspace:/workspace" \
afdb-toolkit nextflow run /app/workflow/workflow.nf -resumeThis will process all the model files in the input directory and place the output files in the output directory.
---
config:
layout: elk
---
flowchart TD
A[".pdb file"] --> C["ModelCIF Generator"] & J["ModelPDB Generator"]
B["CIF metadata JSON"] --> C
C --> D[".cif file (mmCIF)"]
D --> E["DSSP"]
E --> F[".cif file (mmCIF, with DSSP annotations)"]
F --> J & G["CIF to BCIF Generator"]
I["Provider JSON"] --> J
J --> K[".pdb file (with AFDB headers)"]
G --> H[".bcif file (Binary CIF)"]
style A fill:#fff3e0
style C fill:#f3e5f5
style J fill:#f3e5f5
style B fill:#e1f5fe
style D fill:#fff3e0
style E fill:#f3e5f5
style F fill:#e8f5e8
style G fill:#f3e5f5
style K fill:#e8f5e8
style H fill:#e8f5e8
Run the schema validation workflow using the provided script. This workflow performs two tasks:
- Validate Metadata: Ensures that the model metadata JSON files conform to the required schema.
- Batch Processing: If validation is successful, the workflow concatenates the JSON files into a list of JSONs for further processing based on a configurable chunk size, which defaults to 100.
To adjust the chunk size, update the params.metadata_chunk_size parameter in the workflow/validate.nf script or pass it as a command-line argument when executing the workflow. For example:
--metadata_chunk_size 100docker run \
-v "$PWD/nf_workspace/.nextflow:/workspace/.nextflow" \
-v "$PWD/input:/input" \
-v "$PWD/output:/output" \
-w /workspace \
-v "$PWD/nf_workspace:/workspace" \
afdb-toolkit nextflow run /app/workflow/validate.nf -resumeThe output will be stored in the output/metadata directory, containing the batched validated model metadata JSON files.
The Nextflow workflow requires an input list file at input/input.txt containing the entries to process. Each entry should be on a new line:
AF-0001234567890123
AF-0001234567890124
AF-0001234567890125
AF-0001234567890126
Example input.txt:
# Create the input list file
cat > input/input.txt << EOF
AF-0001234567890123
AF-0001234567890124
AF-0001234567890125
EOF- Resumable: Uses
-resumeflag to continue from previous checkpoints - Cached: Maintains state in
.nextflowdirectory - Dependency Management: Automatically handles tool dependencies
- Parallel Processing: Processes multiple files concurrently
- Mount the
.nextflowdirectory to preserve workflow state - Ensure proper input/output directory mounting
- The workflow runs in resume mode by default
The toolkit expects files to be organized in a specific hierarchical structure:
input/
├── 0001/
│ ├── 2345/
│ │ ├── 6789/
│ │ │ ├── 0123/
│ │ │ │ ├── AF-0001234567890123-model-v1.pdb
│ │ │ │ └── AF-0001234567890123-v1.cif.json
- Extract 16-digit numeric ID: From
AF-0001234567890123-model-v1.pdb→0001234567890123 - Split into 4-digit segments:
0001,2345,6789,0123 - Create nested directories:
0001/2345/6789/0123/ - Place files in final directory: Both PDB and JSON files
The workflow automatically creates corresponding output directories following the same structure:
output/
├── 0001/
│ ├── 2345/
│ │ ├── 6789/
│ │ │ ├── 0123/
│ │ │ │ ├── AF-0001234567890123-model-v1.cif
│ │ │ │ └── AF-0001234567890123-model-v1.bcif
- Missing Dependencies: Run
uv run main.py testto identify missing components - Permission Errors: Ensure Docker has proper access to mounted directories
- File Not Found: Verify input files follow the required directory structure
- Memory Issues: For large datasets, consider adjusting Docker memory limits
- ModelCIF Validation Errors: Ensure
mmcif_ma.dicis present in the project directory (automatically handled in Docker) - Nextflow Workflow Errors: Ensure
input/input.txtexists and contains valid entry IDs
- Check the Issues page
- Validate your metadata JSON against the provided schema
This project is licensed under the CC0 1.0 Universal - see the LICENSE file for details.
For support and questions:
- Issues: GitHub Issues
- Email: afdbhelp@ebi.ac.uk