In case you don't have access to Nature, here are the main paper and the supplementary materials.
Table of Contents
☐ TODO: create and release an end-to-end tool.
✅ Feb 20, 2026: The datasets and model weights are now open-sourced on huggingface. See instructions.
✅ Dec 31, 2025: Published in Nature Machine Intelligence.
✅ Dec 04, 2025: Informally presented at NeurIPS 2025 (did not submit, no dual-submission concern).
✅ Aug 18, 2025: Received the Colton Innovation Fund from Colton Center for Autoimmunity at Yale University.
✅ May 06, 2025: Submitted to Nature Machine Intelligence.
✅ Nov 05, 2024: Presented at MoML@MIT 2024 (non-archival abstract & poster).
✅ Nov 01, 2024: Preprint released.
ImmunoStruct is a multimodal deep learning framework that integrates sequence, structural, and biochemical information to predict multi-allele class-I peptide-MHC immunogenicity. By leveraging multimodal data from 26,049 peptide-MHCs and jointly modeling sequence and structure, ImmunoStruct significantly improves immunogenicity prediction performance for both infectious disease epitopes and cancer neoepitopes.
- Multimodal Integration: Combines peptide-MHC protein sequence, structure, and biochemical properties
- Novel Cancer-Wildtype Contrastive Learning: Enhances specificity for cancer neoepitope detection
- Enhanced Interpretability: Provides insights into the substructural basis of immunogenicity
If you use ImmunoStruct in your research, please cite our paper:
BibTeX:
@article{givechian2026immunostruct,
title={ImmunoStruct enables multimodal deep learning for immunogenicity prediction},
author={Givechian, Kevin Bijan and Rocha, Jo{\~a}o Felipe and Liu, Chen and Yang, Edward and Tyagi, Sidharth and Greene, Kerrie and Ying, Rex and Caron, Etienne and Iwasaki, Akiko and Krishnaswamy, Smita},
journal={Nature Machine Intelligence},
volume={8},
pages={70--83},
year={2026},
publisher={Nature Publishing Group UK London}
}Nature format:
Givechian, K.B., Rocha, J.F., Liu, C. et al. ImmunoStruct enables multimodal deep learning for immunogenicity prediction. Nat Mach Intell 8, 70–83 (2026). https://doi.org/10.1038/s42256-025-01163-y
To get ImmunoStruct up and running locally, follow these steps.
Before installation, ensure you have:
- CUDA-compatible GPU (recommended)
- Conda package manager
- Weights & Biases account for experiment tracking
-
Clone the repository
git clone https://github.com/KrishnaswamyLab/ImmunoStruct.git cd ImmunoStruct -
Create conda environment and install dependencies
conda create --name immuno python=3.8 -c anaconda -c conda-forge -y conda activate immuno conda install cudatoolkit=11.2 wandb pydantic -c conda-forge -y conda install scikit-image pillow matplotlib seaborn tqdm -c anaconda -y python -m pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu118 python -m pip install dgl -f https://data.dgl.ai/wheels/torch-2.1/cu118/repo.html python -m pip install torchdata==0.7.1 python -m pip install torch-scatter==2.1.2+pt21cu118 torch-sparse==0.6.18+pt21cu118 torch-cluster==1.6.3+pt21cu118 torch-spline-conv==1.2.2+pt21cu118 torch_geometric==2.5.3 numpy==1.21.1 -f https://data.pyg.org/whl/torch-2.1.2+cu118.html python -m pip install jax==0.2.25 jaxlib==0.1.69+cuda111 -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html python -m pip install "alphafold-colabfold==2.0.0" "colabfold==1.2.0" "dm-haiku==0.0.4" python -m pip install "biopython==1.78" python -m pip install graphein[extras] python -m pip install lifelines python -m pip install huggingface_hub python -m pip install ipykernel python -m pip install ipywidgets
The following steps might be necessary if you encounter problems running the inference. These are some package incompatibilities that we managed to resolve in a manual way:
- Go to /path/to/environment/lib/python3.8/site-packages/jaxlib/xla_client.py: change
np.objecttoobject. - Go to /path/to/environment/lib/python3.8/site-packages/alphafold/common/residue_constants.py: change
np.inttonp.int32. - Go to /path/to/environment/lib/python3.8/site-packages/alphafold/data/templates.py: change
np.objecttoobject.
- Go to /path/to/environment/lib/python3.8/site-packages/jaxlib/xla_client.py: change
-
Create and build another environment for obtaining MSAs locally. Only relevant if you want to run your own protein folding.
conda create --name local_msa python=3.10 -c anaconda -c conda-forge -y conda activate local_msa conda install cudatoolkit=11.2 wandb pydantic -c conda-forge -y conda install scikit-image pillow matplotlib seaborn tqdm -c anaconda -y python -m pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu118 pip install colabfold[alphafold]==1.5.5 pip install jax==0.4.23 jaxlib==0.4.23+cuda11.cudnn86 -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html conda install -c bioconda mmseqs2 -y
-
Download the dataset from huggingface.
conda activate immuno cd ./data/ hf download ChenLiu1996/ImmunoStruct --repo-type dataset --local-dir ./ -
Move the pre-trained model weights.
mkdir ../results/ mv IEDB_model_seed1.pt ../results/ mv CEDAR_model_seed2.pt ../results/
-
Make sure the following files are in the
datafolder:ImmunoStruct_IEDB_data.csvImmunoStruct_CEDAR_data_cancer.csvImmunoStruct_CEDAR_data_wildtype.csvImmunoStruct_clinical_data.csvImmunoStruct_clinical_data_survival.csvHLA_allele_sequences.csv
-
Unzip the graph structure PyTorch files.
unzip graph_pyg_IEDB.zip unzip graph_pyg_CEDAR_cancer.zip unzip graph_pyg_CEDAR_wildtype.zip unzip graph_pyg_clinical.zip
Now the following folders should be under the
datafolder:graph_pyg_IEDBgraph_pyg_CEDAR_cancergraph_pyg_CEDAR_wildtypegraph_pyg_clinical
-
If you want to customize the graph-building logic, the graph structure PDB files produced by AlphaFold2 are already made available by the same huggingface download command. Unzip the corresponding zip files and you will have the following folders.
alphafold2_pdb_IEDBalphafold2_pdb_CEDAR_canceralphafold2_pdb_CEDAR_wildtypealphafold2_pdb_clinical
We have provided the structure data encoded as PyTorch Geometric (PyG) graphs on huggingface. You just need to follow the instruction in the previous Data Preparation section. You can skip this section if you are not planning to fold your own data.
How the PyG graphs are generated
The PyG graphs are generated using a three-step process under immunostruct/preprocessing. The generation scripts are available in case you ever need to run some or all of them.
- Option 1 is easy to perform, but it's slow and rate-limited.
- Option 2 involves more steps, but it is more suitable to larger datasets (>2000 sequences).
-
Option 1: Using the online MSA server (slow, rate-limited, not recommended for >2000 sequences). Starting at
ImmunoStructroot folder.# [CPU] Step 1-3. Prepare MSA for AlphaFold. # Download colabfold and REMEMBER where it is downloaded to. Likely default to `~/.cache/colabfold`. python -m colabfold.download # Prepare MSA. conda activate immuno cd immunostruct/preprocessing python step1-3_server_sequence_to_msa.py \ --input-csv ../../data/ImmunoStruct_IEDB_data.csv \ --output-dir ../../data/pdb_files/IEDB/ \ --tmp-dir /tmp/ \ --allele-col-name allele \ --peptide-col-name peptide python step1-3_server_sequence_to_msa.py \ --input-csv ../../data/ImmunoStruct_CEDAR_data_cancer.csv \ --output-dir ../../data/pdb_files/CEDAR_cancer/ \ --tmp-dir /tmp/ \ --allele-col-name allele \ --peptide-col-name mut_pep python step1-3_server_sequence_to_msa.py \ --input-csv ../../data/ImmunoStruct_CEDAR_data_wildtype.csv \ --output-dir ../../data/pdb_files/CEDAR_wildtype/ \ --tmp-dir /tmp/ \ --allele-col-name allele \ --peptide-col-name wt_pep python step1-3_server_sequence_to_msa.py \ --input-csv ../../data/ImmunoStruct_clinical_data.csv \ --output-dir ../../data/pdb_files/clinical/ \ --tmp-dir /tmp/ \ --allele-col-name allele \ --peptide-col-name mut_pep # [GPU] Step 4. AlphaFold2. # start/end help run multiple jobs in parallel. python step4_msa_to_pdb.py \ --input-csv ../../data/ImmunoStruct_IEDB_data.csv \ --output-dir ../../data/pdb_files/IEDB/ \ --start 0 --end 24540 \ --params-loc /path/to/colabfold \ --allele-col-name allele \ --peptide-col-name peptide python step4_msa_to_pdb.py \ --input-csv ../../data/ImmunoStruct_CEDAR_data_cancer.csv \ --output-dir ../../data/pdb_files/CEDAR_cancer/ \ --start 0 --end 2801 \ --params-loc /path/to/colabfold \ --allele-col-name allele \ --peptide-col-name mut_pep python step4_msa_to_pdb.py \ --input-csv ../../data/ImmunoStruct_CEDAR_data_wildtype.csv \ --output-dir ../../data/pdb_files/CEDAR_wildtype/ \ --start 0 --end 2801 \ --params-loc /path/to/colabfold \ --allele-col-name allele \ --peptide-col-name wt_pep python step4_msa_to_pdb.py \ --input-csv ../../data/ImmunoStruct_clinical_data.csv \ --output-dir ../../data/pdb_files/clinical/ \ --start 0 --end 29485 \ --params-loc /path/to/colabfold \ --allele-col-name allele \ --peptide-col-name mut_pep # [CPU] Step 5. Moving and renaming the structure data in PDB files. python step5_rename_pdb.py \ --input-dir ../../data/pdb_files/IEDB/ \ --output-dir ../../data/alphafold2_pdb_IEDB/ python step5_rename_pdb.py \ --input-dir ../../data/pdb_files/CEDAR_cancer/ \ --output-dir ../../data/alphafold2_pdb_CEDAR_cancer/ python step5_rename_pdb.py \ --input-dir ../../data/pdb_files/CEDAR_wildtype/ \ --output-dir ../../data/alphafold2_pdb_CEDAR_wildtype/ python step5_rename_pdb.py \ --input-dir ../../data/pdb_files/clinical/ \ --output-dir ../../data/alphafold2_pdb_clinical/ # [CPU] Step 6. Generating PyG graphs (structures in PDB files to structures in PyTorch .pt files). python step6_pdb_to_pyg.py \ --input-dir ../../data/alphafold2_pdb_IEDB/ \ --output-dir ../../data/graph_pyg_IEDB/ python step6_pdb_to_pyg.py \ --input-dir ../../data/alphafold2_pdb_CEDAR_cancer/ \ --output-dir ../../data/graph_pyg_CEDAR_cancer/ python step6_pdb_to_pyg.py \ --input-dir ../../data/alphafold2_pdb_CEDAR_wildtype/ \ --output-dir ../../data/graph_pyg_CEDAR_wildtype/ python step6_pdb_to_pyg.py \ --input-dir ../../data/alphafold2_pdb_clinical/ \ --output-dir ../../data/graph_pyg_clinical/
-
Option 2: Performing MSA locally (what we did). Starting at
ImmunoStructroot folder.# [CPU] Step 1-3. Prepare MSA for AlphaFold. # Download colabfold and REMEMBER where it is downloaded to. python -m colabfold.download # Download the MSA database locally. mkdir ./database_msa/ cd ./database_msa/ wget https://wwwuser.gwdg.de/~compbiol/colabfold/uniref30_2302.tar.gz tar -xzvf uniref30_2302.tar.gz mmseqs tsv2exprofiledb uniref30_2302 uniref30_2302_db wget https://wwwuser.gwdg.de/~compbiol/colabfold/colabfold_envdb_202108.tar.gz tar -xzvf colabfold_envdb_202108.tar.gz mmseqs tsv2exprofiledb colabfold_envdb_202108 colabfold_envdb_202108_db cd .. # Prepare MSA. conda activate local_msa cd immunostruct/preprocessing python step1_local_sequence_to_fasta.py \ --input-csv ../../data/ImmunoStruct_IEDB_data.csv \ --output-fasta ../../data/fasta/ImmunoStruct_IEDB_data.fasta \ --allele-col-name allele \ --peptide-col-name peptide python step2_local_fasta_to_a3m.py \ --input-fasta ../../data/fasta/ImmunoStruct_IEDB_data.fasta \ --msa-database-dir ../../database_msa/ \ --output-dir ../../data/a3m/IEDB/ python step3_local_a3m_to_msa.py \ --input-dir ../../data/a3m/IEDB/ \ --output-dir ../../data/pdb_files/IEDB/ \ --input-csv ../../data/ImmunoStruct_IEDB_data.csv \ --allele-col-name allele \ --peptide-col-name peptide python step1_local_sequence_to_fasta.py \ --input-csv ../../data/ImmunoStruct_CEDAR_data_cancer.csv \ --output-fasta ../../data/fasta/ImmunoStruct_CEDAR_data_cancer.fasta \ --allele-col-name allele \ --peptide-col-name mut_pep python step2_local_fasta_to_a3m.py \ --input-fasta ../../data/fasta/ImmunoStruct_CEDAR_data_cancer.fasta \ --msa-database-dir ../../database_msa/ \ --output-dir ../../data/a3m/CEDAR_cancer/ python step3_local_a3m_to_msa.py \ --input-dir ../../data/a3m/CEDAR_cancer/ \ --output-dir ../../data/pdb_files/CEDAR_cancer/ \ --input-csv ../../data/ImmunoStruct_CEDAR_data_cancer.csv \ --allele-col-name allele \ --peptide-col-name mut_pep python step1_local_sequence_to_fasta.py \ --input-csv ../../data/ImmunoStruct_CEDAR_data_wildtype.csv \ --output-fasta ../../data/fasta/ImmunoStruct_CEDAR_data_wildtype.fasta \ --allele-col-name allele \ --peptide-col-name wt_pep python step2_local_fasta_to_a3m.py \ --input-fasta ../../data/fasta/ImmunoStruct_CEDAR_data_wildtype.fasta \ --msa-database-dir ../../database_msa/ \ --output-dir ../../data/a3m/CEDAR_wildtype/ python step3_local_a3m_to_msa.py \ --input-dir ../../data/a3m/CEDAR_wildtype/ \ --output-dir ../../data/pdb_files/CEDAR_wildtype/ \ --input-csv ../../data/ImmunoStruct_CEDAR_data_wildtype.csv \ --allele-col-name allele \ --peptide-col-name wt_pep python step1_local_sequence_to_fasta.py \ --input-csv ../../data/ImmunoStruct_clinical_data.csv \ --output-fasta ../../data/fasta/ImmunoStruct_clinical_data.fasta \ --allele-col-name allele \ --peptide-col-name mut_pep python step2_local_fasta_to_a3m.py \ --input-fasta ../../data/fasta/ImmunoStruct_clinical_data.fasta \ --msa-database-dir ../../database_msa/ \ --output-dir ../../data/a3m/clinical/ python step3_local_a3m_to_msa.py \ --input-dir ../../data/a3m/clinical/ \ --output-dir ../../data/pdb_files/clinical/ \ --input-csv ../../data/ImmunoStruct_clinical_data.csv \ --allele-col-name allele \ --peptide-col-name mut_pep # [GPU] Step 4. AlphaFold2. # start/end help run multiple jobs in parallel. conda deactivate conda activate immuno python step4_msa_to_pdb.py \ --input-csv ../../data/ImmunoStruct_IEDB_data.csv \ --output-dir ../../data/pdb_files/IEDB/ \ --start 0 --end 24540 \ --params-loc /path/to/colabfold \ --allele-col-name allele \ --peptide-col-name peptide python step4_msa_to_pdb.py \ --input-csv ../../data/ImmunoStruct_CEDAR_data_cancer.csv \ --output-dir ../../data/pdb_files/CEDAR_cancer/ \ --start 0 --end 2801 \ --params-loc /path/to/colabfold \ --allele-col-name allele \ --peptide-col-name mut_pep python step4_msa_to_pdb.py \ --input-csv ../../data/ImmunoStruct_CEDAR_data_wildtype.csv \ --output-dir ../../data/pdb_files/CEDAR_wildtype/ \ --start 0 --end 2801 \ --params-loc /path/to/colabfold \ --allele-col-name allele \ --peptide-col-name wt_pep python step4_msa_to_pdb.py \ --input-csv ../../data/ImmunoStruct_clinical_data.csv \ --output-dir ../../data/pdb_files/clinical/ \ --start 0 --end 29485 \ --params-loc /path/to/colabfold \ --allele-col-name allele \ --peptide-col-name mut_pep # [CPU] Step 5. Moving and renaming the structure data in PDB files. python step5_rename_pdb.py \ --input-dir ../../data/pdb_files/IEDB/ \ --output-dir ../../data/alphafold2_pdb_IEDB/ python step5_rename_pdb.py \ --input-dir ../../data/pdb_files/CEDAR_cancer/ \ --output-dir ../../data/alphafold2_pdb_CEDAR_cancer/ python step5_rename_pdb.py \ --input-dir ../../data/pdb_files/CEDAR_wildtype/ \ --output-dir ../../data/alphafold2_pdb_CEDAR_wildtype/ python step5_rename_pdb.py \ --input-dir ../../data/pdb_files/clinical/ \ --output-dir ../../data/alphafold2_pdb_clinical/ # [CPU] Step 6. Generating PyG graphs (structures in PDB files to structures in PyTorch .pt files). python step6_pdb_to_pyg.py \ --input-dir ../../data/alphafold2_pdb_IEDB/ \ --output-dir ../../data/graph_pyg_IEDB/ python step6_pdb_to_pyg.py \ --input-dir ../../data/alphafold2_pdb_CEDAR_cancer/ \ --output-dir ../../data/graph_pyg_CEDAR_cancer/ python step6_pdb_to_pyg.py \ --input-dir ../../data/alphafold2_pdb_CEDAR_wildtype/ \ --output-dir ../../data/graph_pyg_CEDAR_wildtype/ python step6_pdb_to_pyg.py \ --input-dir ../../data/alphafold2_pdb_clinical/ \ --output-dir ../../data/graph_pyg_clinical/
-
Activate the environment
conda activate immuno export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH
-
Set up Weights & Biases
Create a project on Weights & Biases matching your project name.
-
Run Experiments
NOTE: these are already deprecated. See
immunostruct/old_scripts.# Sequence + structure + biochemical property + multimodal multihead attention python train_PropIEDB_PropCancer_ImmunoCancer.py --full-sequence --sequence-loss --model HybridModelv2 --wandb-username YOUR_WANDB_USERNAME # Sequence + structure + biochemical property python train_PropIEDB_PropCancer_ImmunoCancer.py --full-sequence --sequence-loss --model HybridModel --wandb-username YOUR_WANDB_USERNAME # Sequence + biochemical property python train_PropIEDB_PropCancer_ImmunoCancer.py --full-sequence --sequence-loss --model SequenceFpModel --wandb-username YOUR_WANDB_USERNAME # Sequence-only model python train_PropIEDB_PropCancer_ImmunoCancer.py --full-sequence --sequence-loss --model SequenceModel --wandb-username YOUR_WANDB_USERNAME # Structure-only model python train_PropIEDB_PropCancer_ImmunoCancer.py --full-sequence --model StructureModel --wandb-username YOUR_WANDB_USERNAME
-
Our main experiments
These are examples for training ImmunoStruct.
# IEDB training python train_IEDB_wFT.py --model HybridModelv2 --sequence-loss --full-sequence --seed 1 --wandb-username immunoteam # CEDAR training python train_CEDAR_wFT.py --model HybridModelv2_Comparative --sequence-loss --full-sequence --comparative --use-wt-for-downstream --seed 1 --wandb-username immunoteam
For running inference using the models we provide:
# IEDB inference python infer_IEDB_or_CEDAR.py --infer_dataset IEDB --model HybridModelv2 --model-path ../results/IEDB_model_seed1.pt --full-sequence --seed 1 # CEDAR inference python infer_IEDB_or_CEDAR.py --infer_dataset CEDAR --model HybridModel_Comparative --model-path ../results/CEDAR_model_seed2.pt --full-sequence --seed 2 # Clinical inference python infer_clinical_only.py --model HybridModel_Comparative --model-path ../results/CEDAR_model_seed2.pt --full-sequence
GLIBCXX Error
ImportError: $some_path/libstdc++.so.6: version 'GLIBCXX_3.4.29' not found
Solution: Add your conda environment path to LD_LIBRARY_PATH:
conda activate immuno
echo $CONDA_PREFIX
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATHCUDA Compatibility Issues
- Ensure your CUDA version matches the PyTorch installation
- Verify GPU availability with
torch.cuda.is_available()
Memory Issues
- Reduce batch size in training scripts
- Use gradient checkpointing for large models
Wandb Authentication
- Login to Wandb:
wandb login - Ensure project names match between script and Wandb dashboard
Distributed under the Yale License. See LICENSE.txt for more information.
Krishnaswamy Lab - @KrishnaswamyLab
Project Link: https://github.com/KrishnaswamyLab/ImmunoStruct



