`ImmunoStruct`

ImmunoStruct enables multimodal deep learning for immunogenicity prediction

In case you don't have access to Nature, here are the main paper and the supplementary materials.

Table of Contents

News
About The Project
Citation
Getting Started
Usage
Troubleshooting
License
Contact

News

News in English

News in Chinese

☐ TODO: create and release an end-to-end tool.
✅ Feb 20, 2026: The datasets and model weights are now open-sourced on huggingface. See instructions.
✅ Dec 31, 2025: Published in Nature Machine Intelligence.
✅ Dec 04, 2025: Informally presented at NeurIPS 2025 (did not submit, no dual-submission concern).
✅ Aug 18, 2025: Received the Colton Innovation Fund from Colton Center for Autoimmunity at Yale University.
✅ May 06, 2025: Submitted to Nature Machine Intelligence.
✅ Nov 05, 2024: Presented at MoML@MIT 2024 (non-archival abstract & poster).
✅ Nov 01, 2024: Preprint released.

About The Project

ImmunoStruct is a multimodal deep learning framework that integrates sequence, structural, and biochemical information to predict multi-allele class-I peptide-MHC immunogenicity. By leveraging multimodal data from 26,049 peptide-MHCs and jointly modeling sequence and structure, ImmunoStruct significantly improves immunogenicity prediction performance for both infectious disease epitopes and cancer neoepitopes.

(back to top)

Key Features

Multimodal Integration: Combines peptide-MHC protein sequence, structure, and biochemical properties
Novel Cancer-Wildtype Contrastive Learning: Enhances specificity for cancer neoepitope detection
Enhanced Interpretability: Provides insights into the substructural basis of immunogenicity

(back to top)

Citation

If you use ImmunoStruct in your research, please cite our paper:

BibTeX:

@article{givechian2026immunostruct,
  title={ImmunoStruct enables multimodal deep learning for immunogenicity prediction},
  author={Givechian, Kevin Bijan and Rocha, Jo{\~a}o Felipe and Liu, Chen and Yang, Edward and Tyagi, Sidharth and Greene, Kerrie and Ying, Rex and Caron, Etienne and Iwasaki, Akiko and Krishnaswamy, Smita},
  journal={Nature Machine Intelligence},
  volume={8},
  pages={70--83},
  year={2026},
  publisher={Nature Publishing Group UK London}
}

Nature format:
Givechian, K.B., Rocha, J.F., Liu, C. et al. ImmunoStruct enables multimodal deep learning for immunogenicity prediction. Nat Mach Intell 8, 70–83 (2026). https://doi.org/10.1038/s42256-025-01163-y

(back to top)

Getting Started

To get ImmunoStruct up and running locally, follow these steps.

Pre-requisites

Before installation, ensure you have:

CUDA-compatible GPU (recommended)
Conda package manager
Weights & Biases account for experiment tracking

Installation

Clone the repository

git clone https://github.com/KrishnaswamyLab/ImmunoStruct.git
cd ImmunoStruct

Create conda environment and install dependencies

conda create --name immuno python=3.8 -c anaconda -c conda-forge -y
conda activate immuno
conda install cudatoolkit=11.2 wandb pydantic -c conda-forge -y
conda install scikit-image pillow matplotlib seaborn tqdm -c anaconda -y
python -m pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu118
python -m pip install dgl -f https://data.dgl.ai/wheels/torch-2.1/cu118/repo.html
python -m pip install torchdata==0.7.1
python -m pip install torch-scatter==2.1.2+pt21cu118 torch-sparse==0.6.18+pt21cu118 torch-cluster==1.6.3+pt21cu118 torch-spline-conv==1.2.2+pt21cu118 torch_geometric==2.5.3 numpy==1.21.1 -f https://data.pyg.org/whl/torch-2.1.2+cu118.html
python -m pip install jax==0.2.25 jaxlib==0.1.69+cuda111 -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
python -m pip install "alphafold-colabfold==2.0.0" "colabfold==1.2.0" "dm-haiku==0.0.4"
python -m pip install "biopython==1.78"
python -m pip install graphein[extras]
python -m pip install lifelines
python -m pip install huggingface_hub
python -m pip install ipykernel
python -m pip install ipywidgets

The following steps might be necessary if you encounter problems running the inference. These are some package incompatibilities that we managed to resolve in a manual way:

Go to /path/to/environment/lib/python3.8/site-packages/jaxlib/xla_client.py: change np.object to object.
Go to /path/to/environment/lib/python3.8/site-packages/alphafold/common/residue_constants.py: change np.int to np.int32.
Go to /path/to/environment/lib/python3.8/site-packages/alphafold/data/templates.py: change np.object to object.

Create and build another environment for obtaining MSAs locally. Only relevant if you want to run your own protein folding.

conda create --name local_msa python=3.10 -c anaconda -c conda-forge -y
conda activate local_msa
conda install cudatoolkit=11.2 wandb pydantic -c conda-forge -y
conda install scikit-image pillow matplotlib seaborn tqdm -c anaconda -y
python -m pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu118
pip install colabfold[alphafold]==1.5.5
pip install jax==0.4.23 jaxlib==0.4.23+cuda11.cudnn86 -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
conda install -c bioconda mmseqs2 -y

(back to top)

Usage

Data Preparation

Download the dataset from huggingface.

conda activate immuno
cd ./data/
hf download ChenLiu1996/ImmunoStruct --repo-type dataset --local-dir ./

Move the pre-trained model weights.

mkdir ../results/
mv IEDB_model_seed1.pt ../results/
mv CEDAR_model_seed2.pt ../results/

Make sure the following files are in the data folder:
- ImmunoStruct_IEDB_data.csv
- ImmunoStruct_CEDAR_data_cancer.csv
- ImmunoStruct_CEDAR_data_wildtype.csv
- ImmunoStruct_clinical_data.csv
- ImmunoStruct_clinical_data_survival.csv
- HLA_allele_sequences.csv
Unzip the graph structure PyTorch files.
```
unzip graph_pyg_IEDB.zip
unzip graph_pyg_CEDAR_cancer.zip
unzip graph_pyg_CEDAR_wildtype.zip
unzip graph_pyg_clinical.zip
```
Now the following folders should be under the data folder:
- graph_pyg_IEDB
- graph_pyg_CEDAR_cancer
- graph_pyg_CEDAR_wildtype
- graph_pyg_clinical
If you want to customize the graph-building logic, the graph structure PDB files produced by AlphaFold2 are already made available by the same huggingface download command. Unzip the corresponding zip files and you will have the following folders.
- alphafold2_pdb_IEDB
- alphafold2_pdb_CEDAR_cancer
- alphafold2_pdb_CEDAR_wildtype
- alphafold2_pdb_clinical

AlphaFold2 Structure Data

We have provided the structure data encoded as PyTorch Geometric (PyG) graphs on huggingface. You just need to follow the instruction in the previous Data Preparation section. You can skip this section if you are not planning to fold your own data.

How the PyG graphs are generated

The PyG graphs are generated using a three-step process under immunostruct/preprocessing. The generation scripts are available in case you ever need to run some or all of them.

Option 1 is easy to perform, but it's slow and rate-limited.
Option 2 involves more steps, but it is more suitable to larger datasets (>2000 sequences).

Option 1: Using the online MSA server (slow, rate-limited, not recommended for >2000 sequences). Starting at ImmunoStruct root folder.

# [CPU] Step 1-3. Prepare MSA for AlphaFold.
# Download colabfold and REMEMBER where it is downloaded to. Likely default to `~/.cache/colabfold`.
python -m colabfold.download

# Prepare MSA.
conda activate immuno
cd immunostruct/preprocessing
python step1-3_server_sequence_to_msa.py \
    --input-csv ../../data/ImmunoStruct_IEDB_data.csv \
    --output-dir ../../data/pdb_files/IEDB/ \
    --tmp-dir /tmp/ \
    --allele-col-name allele \
    --peptide-col-name peptide

python step1-3_server_sequence_to_msa.py \
    --input-csv ../../data/ImmunoStruct_CEDAR_data_cancer.csv \
    --output-dir ../../data/pdb_files/CEDAR_cancer/ \
    --tmp-dir /tmp/ \
    --allele-col-name allele \
    --peptide-col-name mut_pep

python step1-3_server_sequence_to_msa.py \
    --input-csv ../../data/ImmunoStruct_CEDAR_data_wildtype.csv \
    --output-dir ../../data/pdb_files/CEDAR_wildtype/ \
    --tmp-dir /tmp/ \
    --allele-col-name allele \
    --peptide-col-name wt_pep

python step1-3_server_sequence_to_msa.py \
    --input-csv ../../data/ImmunoStruct_clinical_data.csv \
    --output-dir ../../data/pdb_files/clinical/ \
    --tmp-dir /tmp/ \
    --allele-col-name allele \
    --peptide-col-name mut_pep

# [GPU] Step 4. AlphaFold2.
# start/end help run multiple jobs in parallel.
python step4_msa_to_pdb.py \
    --input-csv ../../data/ImmunoStruct_IEDB_data.csv \
    --output-dir ../../data/pdb_files/IEDB/ \
    --start 0 --end 24540 \
    --params-loc /path/to/colabfold \
    --allele-col-name allele \
    --peptide-col-name peptide

python step4_msa_to_pdb.py \
    --input-csv ../../data/ImmunoStruct_CEDAR_data_cancer.csv \
    --output-dir ../../data/pdb_files/CEDAR_cancer/ \
    --start 0 --end 2801 \
    --params-loc /path/to/colabfold \
    --allele-col-name allele \
    --peptide-col-name mut_pep

python step4_msa_to_pdb.py \
    --input-csv ../../data/ImmunoStruct_CEDAR_data_wildtype.csv \
    --output-dir ../../data/pdb_files/CEDAR_wildtype/ \
    --start 0 --end 2801 \
    --params-loc /path/to/colabfold \
    --allele-col-name allele \
    --peptide-col-name wt_pep

python step4_msa_to_pdb.py \
    --input-csv ../../data/ImmunoStruct_clinical_data.csv \
    --output-dir ../../data/pdb_files/clinical/ \
    --start 0 --end 29485 \
    --params-loc /path/to/colabfold \
    --allele-col-name allele \
    --peptide-col-name mut_pep

# [CPU] Step 5. Moving and renaming the structure data in PDB files.
python step5_rename_pdb.py \
    --input-dir ../../data/pdb_files/IEDB/ \
    --output-dir ../../data/alphafold2_pdb_IEDB/

python step5_rename_pdb.py \
    --input-dir ../../data/pdb_files/CEDAR_cancer/ \
    --output-dir ../../data/alphafold2_pdb_CEDAR_cancer/

python step5_rename_pdb.py \
    --input-dir ../../data/pdb_files/CEDAR_wildtype/ \
    --output-dir ../../data/alphafold2_pdb_CEDAR_wildtype/

python step5_rename_pdb.py \
    --input-dir ../../data/pdb_files/clinical/ \
    --output-dir ../../data/alphafold2_pdb_clinical/

# [CPU] Step 6. Generating PyG graphs (structures in PDB files to structures in PyTorch .pt files).
python step6_pdb_to_pyg.py \
    --input-dir ../../data/alphafold2_pdb_IEDB/ \
    --output-dir ../../data/graph_pyg_IEDB/

python step6_pdb_to_pyg.py \
    --input-dir ../../data/alphafold2_pdb_CEDAR_cancer/ \
    --output-dir ../../data/graph_pyg_CEDAR_cancer/

python step6_pdb_to_pyg.py \
    --input-dir ../../data/alphafold2_pdb_CEDAR_wildtype/ \
    --output-dir ../../data/graph_pyg_CEDAR_wildtype/

python step6_pdb_to_pyg.py \
    --input-dir ../../data/alphafold2_pdb_clinical/ \
    --output-dir ../../data/graph_pyg_clinical/

Option 2: Performing MSA locally (what we did). Starting at ImmunoStruct root folder.

# [CPU] Step 1-3. Prepare MSA for AlphaFold.
# Download colabfold and REMEMBER where it is downloaded to.
python -m colabfold.download

# Download the MSA database locally.
mkdir ./database_msa/
cd ./database_msa/
wget https://wwwuser.gwdg.de/~compbiol/colabfold/uniref30_2302.tar.gz
tar -xzvf uniref30_2302.tar.gz
mmseqs tsv2exprofiledb uniref30_2302 uniref30_2302_db
wget https://wwwuser.gwdg.de/~compbiol/colabfold/colabfold_envdb_202108.tar.gz
tar -xzvf colabfold_envdb_202108.tar.gz
mmseqs tsv2exprofiledb colabfold_envdb_202108 colabfold_envdb_202108_db
cd ..

# Prepare MSA.
conda activate local_msa
cd immunostruct/preprocessing
python step1_local_sequence_to_fasta.py \
    --input-csv ../../data/ImmunoStruct_IEDB_data.csv \
    --output-fasta ../../data/fasta/ImmunoStruct_IEDB_data.fasta \
    --allele-col-name allele \
    --peptide-col-name peptide
python step2_local_fasta_to_a3m.py \
    --input-fasta ../../data/fasta/ImmunoStruct_IEDB_data.fasta \
    --msa-database-dir ../../database_msa/ \
    --output-dir ../../data/a3m/IEDB/
python step3_local_a3m_to_msa.py \
    --input-dir ../../data/a3m/IEDB/ \
    --output-dir ../../data/pdb_files/IEDB/ \
    --input-csv ../../data/ImmunoStruct_IEDB_data.csv \
    --allele-col-name allele \
    --peptide-col-name peptide

python step1_local_sequence_to_fasta.py \
    --input-csv ../../data/ImmunoStruct_CEDAR_data_cancer.csv \
    --output-fasta ../../data/fasta/ImmunoStruct_CEDAR_data_cancer.fasta \
    --allele-col-name allele \
    --peptide-col-name mut_pep
python step2_local_fasta_to_a3m.py \
    --input-fasta ../../data/fasta/ImmunoStruct_CEDAR_data_cancer.fasta \
    --msa-database-dir ../../database_msa/ \
    --output-dir ../../data/a3m/CEDAR_cancer/
python step3_local_a3m_to_msa.py \
    --input-dir ../../data/a3m/CEDAR_cancer/ \
    --output-dir ../../data/pdb_files/CEDAR_cancer/ \
    --input-csv ../../data/ImmunoStruct_CEDAR_data_cancer.csv \
    --allele-col-name allele \
    --peptide-col-name mut_pep

python step1_local_sequence_to_fasta.py \
    --input-csv ../../data/ImmunoStruct_CEDAR_data_wildtype.csv \
    --output-fasta ../../data/fasta/ImmunoStruct_CEDAR_data_wildtype.fasta \
    --allele-col-name allele \
    --peptide-col-name wt_pep
python step2_local_fasta_to_a3m.py \
    --input-fasta ../../data/fasta/ImmunoStruct_CEDAR_data_wildtype.fasta \
    --msa-database-dir ../../database_msa/ \
    --output-dir ../../data/a3m/CEDAR_wildtype/
python step3_local_a3m_to_msa.py \
    --input-dir ../../data/a3m/CEDAR_wildtype/ \
    --output-dir ../../data/pdb_files/CEDAR_wildtype/ \
    --input-csv ../../data/ImmunoStruct_CEDAR_data_wildtype.csv \
    --allele-col-name allele \
    --peptide-col-name wt_pep

python step1_local_sequence_to_fasta.py \
    --input-csv ../../data/ImmunoStruct_clinical_data.csv \
    --output-fasta ../../data/fasta/ImmunoStruct_clinical_data.fasta \
    --allele-col-name allele \
    --peptide-col-name mut_pep
python step2_local_fasta_to_a3m.py \
    --input-fasta ../../data/fasta/ImmunoStruct_clinical_data.fasta \
    --msa-database-dir ../../database_msa/ \
    --output-dir ../../data/a3m/clinical/
python step3_local_a3m_to_msa.py \
    --input-dir ../../data/a3m/clinical/ \
    --output-dir ../../data/pdb_files/clinical/ \
    --input-csv ../../data/ImmunoStruct_clinical_data.csv \
    --allele-col-name allele \
    --peptide-col-name mut_pep

# [GPU] Step 4. AlphaFold2.
# start/end help run multiple jobs in parallel.
conda deactivate
conda activate immuno
python step4_msa_to_pdb.py \
    --input-csv ../../data/ImmunoStruct_IEDB_data.csv \
    --output-dir ../../data/pdb_files/IEDB/ \
    --start 0 --end 24540 \
    --params-loc /path/to/colabfold \
    --allele-col-name allele \
    --peptide-col-name peptide

python step4_msa_to_pdb.py \
    --input-csv ../../data/ImmunoStruct_CEDAR_data_cancer.csv \
    --output-dir ../../data/pdb_files/CEDAR_cancer/ \
    --start 0 --end 2801 \
    --params-loc /path/to/colabfold \
    --allele-col-name allele \
    --peptide-col-name mut_pep

python step4_msa_to_pdb.py \
    --input-csv ../../data/ImmunoStruct_CEDAR_data_wildtype.csv \
    --output-dir ../../data/pdb_files/CEDAR_wildtype/ \
    --start 0 --end 2801 \
    --params-loc /path/to/colabfold \
    --allele-col-name allele \
    --peptide-col-name wt_pep

python step4_msa_to_pdb.py \
    --input-csv ../../data/ImmunoStruct_clinical_data.csv \
    --output-dir ../../data/pdb_files/clinical/ \
    --start 0 --end 29485 \
    --params-loc /path/to/colabfold \
    --allele-col-name allele \
    --peptide-col-name mut_pep

# [CPU] Step 5. Moving and renaming the structure data in PDB files.
python step5_rename_pdb.py \
    --input-dir ../../data/pdb_files/IEDB/ \
    --output-dir ../../data/alphafold2_pdb_IEDB/

python step5_rename_pdb.py \
    --input-dir ../../data/pdb_files/CEDAR_cancer/ \
    --output-dir ../../data/alphafold2_pdb_CEDAR_cancer/

python step5_rename_pdb.py \
    --input-dir ../../data/pdb_files/CEDAR_wildtype/ \
    --output-dir ../../data/alphafold2_pdb_CEDAR_wildtype/

python step5_rename_pdb.py \
    --input-dir ../../data/pdb_files/clinical/ \
    --output-dir ../../data/alphafold2_pdb_clinical/

# [CPU] Step 6. Generating PyG graphs (structures in PDB files to structures in PyTorch .pt files).
python step6_pdb_to_pyg.py \
    --input-dir ../../data/alphafold2_pdb_IEDB/ \
    --output-dir ../../data/graph_pyg_IEDB/

python step6_pdb_to_pyg.py \
    --input-dir ../../data/alphafold2_pdb_CEDAR_cancer/ \
    --output-dir ../../data/graph_pyg_CEDAR_cancer/

python step6_pdb_to_pyg.py \
    --input-dir ../../data/alphafold2_pdb_CEDAR_wildtype/ \
    --output-dir ../../data/graph_pyg_CEDAR_wildtype/

python step6_pdb_to_pyg.py \
    --input-dir ../../data/alphafold2_pdb_clinical/ \
    --output-dir ../../data/graph_pyg_clinical/

Training and Testing

Activate the environment

conda activate immuno
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH

Set up Weights & Biases

Create a project on Weights & Biases matching your project name.

Run Experiments

NOTE: these are already deprecated. See immunostruct/old_scripts.

# Sequence + structure + biochemical property + multimodal multihead attention
python train_PropIEDB_PropCancer_ImmunoCancer.py --full-sequence --sequence-loss --model HybridModelv2 --wandb-username YOUR_WANDB_USERNAME

# Sequence + structure + biochemical property
python train_PropIEDB_PropCancer_ImmunoCancer.py --full-sequence --sequence-loss --model HybridModel --wandb-username YOUR_WANDB_USERNAME

# Sequence + biochemical property
python train_PropIEDB_PropCancer_ImmunoCancer.py --full-sequence --sequence-loss --model SequenceFpModel --wandb-username YOUR_WANDB_USERNAME

# Sequence-only model
python train_PropIEDB_PropCancer_ImmunoCancer.py --full-sequence --sequence-loss --model SequenceModel --wandb-username YOUR_WANDB_USERNAME

# Structure-only model
python train_PropIEDB_PropCancer_ImmunoCancer.py --full-sequence --model StructureModel --wandb-username YOUR_WANDB_USERNAME

Our main experiments

These are examples for training ImmunoStruct.

# IEDB training
python train_IEDB_wFT.py --model HybridModelv2 --sequence-loss --full-sequence --seed 1 --wandb-username immunoteam

# CEDAR training
python train_CEDAR_wFT.py --model HybridModelv2_Comparative --sequence-loss --full-sequence --comparative --use-wt-for-downstream --seed 1 --wandb-username immunoteam

For running inference using the models we provide:

# IEDB inference
python infer_IEDB_or_CEDAR.py --infer_dataset IEDB --model HybridModelv2 --model-path ../results/IEDB_model_seed1.pt --full-sequence --seed 1

# CEDAR inference
python infer_IEDB_or_CEDAR.py --infer_dataset CEDAR --model HybridModel_Comparative --model-path ../results/CEDAR_model_seed2.pt --full-sequence --seed 2

# Clinical inference
python infer_clinical_only.py --model HybridModel_Comparative --model-path ../results/CEDAR_model_seed2.pt --full-sequence

(back to top)

Troubleshooting

Common Issues

GLIBCXX Error

ImportError: $some_path/libstdc++.so.6: version 'GLIBCXX_3.4.29' not found

Solution: Add your conda environment path to LD_LIBRARY_PATH:

conda activate immuno
echo $CONDA_PREFIX
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH

CUDA Compatibility Issues

Ensure your CUDA version matches the PyTorch installation
Verify GPU availability with torch.cuda.is_available()

Memory Issues

Reduce batch size in training scripts
Use gradient checkpointing for large models

Wandb Authentication

Login to Wandb: wandb login
Ensure project names match between script and Wandb dashboard

(back to top)

License

Distributed under the Yale License. See LICENSE.txt for more information.

(back to top)

Contact

Krishnaswamy Lab - @KrishnaswamyLab

Project Link: https://github.com/KrishnaswamyLab/ImmunoStruct

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 121 Commits
assets		assets
data		data
immunostruct		immunostruct
pdf		pdf
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

`ImmunoStruct`

ImmunoStruct enables multimodal deep learning for immunogenicity prediction

News

About The Project

Key Features

Citation

Getting Started

Pre-requisites

Installation

Usage

Data Preparation

AlphaFold2 Structure Data

Training and Testing

Troubleshooting

Common Issues

License

Contact

About

Uh oh!

Releases 1

Packages

Contributors 2

Uh oh!

Languages

License

KrishnaswamyLab/ImmunoStruct

Folders and files

Latest commit

History

Repository files navigation

ImmunoStruct

ImmunoStruct enables multimodal deep learning for immunogenicity prediction

News

About The Project

Key Features

Citation

Getting Started

Pre-requisites

Installation

Usage

Data Preparation

AlphaFold2 Structure Data

Training and Testing

Troubleshooting

Common Issues

License

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Uh oh!

Languages

`ImmunoStruct`

Packages