Skip to content

[Nature Machine Intelligence] ImmunoStruct enables multimodal deep learning for immunogenicity prediction

License

Notifications You must be signed in to change notification settings

KrishnaswamyLab/ImmunoStruct

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

121 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


ImmunoStruct

ImmunoStruct enables multimodal deep learning for immunogenicity prediction

nature PDF Huggingface Huggingface GitHub Stars
LinkedIn LinkedIn LinkedIn Google Scholar
Twitter Follow Twitter Follow Twitter Follow

In case you don't have access to Nature, here are the main paper and the supplementary materials.

Table of Contents
  1. News
  2. About The Project
  3. Citation
  4. Getting Started
  5. Usage
  6. Troubleshooting
  7. License
  8. Contact

News

News in English
news news news
wiki

News in Chinese
news news
news news news news news
wiki wiki

☐ TODO: create and release an end-to-end tool.
✅ Feb 20, 2026: The datasets and model weights are now open-sourced on huggingface. See instructions.
✅ Dec 31, 2025: Published in Nature Machine Intelligence.
✅ Dec 04, 2025: Informally presented at NeurIPS 2025 (did not submit, no dual-submission concern).
✅ Aug 18, 2025: Received the Colton Innovation Fund from Colton Center for Autoimmunity at Yale University.
✅ May 06, 2025: Submitted to Nature Machine Intelligence.
✅ Nov 05, 2024: Presented at MoML@MIT 2024 (non-archival abstract & poster).
✅ Nov 01, 2024: Preprint released.

About The Project

ImmunoStruct Architecture

ImmunoStruct is a multimodal deep learning framework that integrates sequence, structural, and biochemical information to predict multi-allele class-I peptide-MHC immunogenicity. By leveraging multimodal data from 26,049 peptide-MHCs and jointly modeling sequence and structure, ImmunoStruct significantly improves immunogenicity prediction performance for both infectious disease epitopes and cancer neoepitopes.

(back to top)

Key Features

  • Multimodal Integration: Combines peptide-MHC protein sequence, structure, and biochemical properties
  • Novel Cancer-Wildtype Contrastive Learning: Enhances specificity for cancer neoepitope detection
  • Enhanced Interpretability: Provides insights into the substructural basis of immunogenicity
Contrastive Learning Approach Visualizations

(back to top)

Citation

If you use ImmunoStruct in your research, please cite our paper:

BibTeX:

@article{givechian2026immunostruct,
  title={ImmunoStruct enables multimodal deep learning for immunogenicity prediction},
  author={Givechian, Kevin Bijan and Rocha, Jo{\~a}o Felipe and Liu, Chen and Yang, Edward and Tyagi, Sidharth and Greene, Kerrie and Ying, Rex and Caron, Etienne and Iwasaki, Akiko and Krishnaswamy, Smita},
  journal={Nature Machine Intelligence},
  volume={8},
  pages={70--83},
  year={2026},
  publisher={Nature Publishing Group UK London}
}

Nature format:
Givechian, K.B., Rocha, J.F., Liu, C. et al. ImmunoStruct enables multimodal deep learning for immunogenicity prediction. Nat Mach Intell 8, 70–83 (2026). https://doi.org/10.1038/s42256-025-01163-y

(back to top)

Getting Started

To get ImmunoStruct up and running locally, follow these steps.

Pre-requisites

Before installation, ensure you have:

  • CUDA-compatible GPU (recommended)
  • Conda package manager
  • Weights & Biases account for experiment tracking

Installation

  1. Clone the repository

    git clone https://github.com/KrishnaswamyLab/ImmunoStruct.git
    cd ImmunoStruct
  2. Create conda environment and install dependencies

    conda create --name immuno python=3.8 -c anaconda -c conda-forge -y
    conda activate immuno
    conda install cudatoolkit=11.2 wandb pydantic -c conda-forge -y
    conda install scikit-image pillow matplotlib seaborn tqdm -c anaconda -y
    python -m pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu118
    python -m pip install dgl -f https://data.dgl.ai/wheels/torch-2.1/cu118/repo.html
    python -m pip install torchdata==0.7.1
    python -m pip install torch-scatter==2.1.2+pt21cu118 torch-sparse==0.6.18+pt21cu118 torch-cluster==1.6.3+pt21cu118 torch-spline-conv==1.2.2+pt21cu118 torch_geometric==2.5.3 numpy==1.21.1 -f https://data.pyg.org/whl/torch-2.1.2+cu118.html
    python -m pip install jax==0.2.25 jaxlib==0.1.69+cuda111 -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
    python -m pip install "alphafold-colabfold==2.0.0" "colabfold==1.2.0" "dm-haiku==0.0.4"
    python -m pip install "biopython==1.78"
    python -m pip install graphein[extras]
    python -m pip install lifelines
    python -m pip install huggingface_hub
    python -m pip install ipykernel
    python -m pip install ipywidgets

    The following steps might be necessary if you encounter problems running the inference. These are some package incompatibilities that we managed to resolve in a manual way:

    • Go to /path/to/environment/lib/python3.8/site-packages/jaxlib/xla_client.py: change np.object to object.
    • Go to /path/to/environment/lib/python3.8/site-packages/alphafold/common/residue_constants.py: change np.int to np.int32.
    • Go to /path/to/environment/lib/python3.8/site-packages/alphafold/data/templates.py: change np.object to object.
  3. Create and build another environment for obtaining MSAs locally. Only relevant if you want to run your own protein folding.

    conda create --name local_msa python=3.10 -c anaconda -c conda-forge -y
    conda activate local_msa
    conda install cudatoolkit=11.2 wandb pydantic -c conda-forge -y
    conda install scikit-image pillow matplotlib seaborn tqdm -c anaconda -y
    python -m pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu118
    pip install colabfold[alphafold]==1.5.5
    pip install jax==0.4.23 jaxlib==0.4.23+cuda11.cudnn86 -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
    conda install -c bioconda mmseqs2 -y

(back to top)

Usage

Data Preparation

  1. Download the dataset from huggingface.

    conda activate immuno
    cd ./data/
    hf download ChenLiu1996/ImmunoStruct --repo-type dataset --local-dir ./
  2. Move the pre-trained model weights.

    mkdir ../results/
    mv IEDB_model_seed1.pt ../results/
    mv CEDAR_model_seed2.pt ../results/
  3. Make sure the following files are in the data folder:

    • ImmunoStruct_IEDB_data.csv
    • ImmunoStruct_CEDAR_data_cancer.csv
    • ImmunoStruct_CEDAR_data_wildtype.csv
    • ImmunoStruct_clinical_data.csv
    • ImmunoStruct_clinical_data_survival.csv
    • HLA_allele_sequences.csv
  4. Unzip the graph structure PyTorch files.

    unzip graph_pyg_IEDB.zip
    unzip graph_pyg_CEDAR_cancer.zip
    unzip graph_pyg_CEDAR_wildtype.zip
    unzip graph_pyg_clinical.zip

    Now the following folders should be under the data folder:

    • graph_pyg_IEDB
    • graph_pyg_CEDAR_cancer
    • graph_pyg_CEDAR_wildtype
    • graph_pyg_clinical
  5. If you want to customize the graph-building logic, the graph structure PDB files produced by AlphaFold2 are already made available by the same huggingface download command. Unzip the corresponding zip files and you will have the following folders.

    • alphafold2_pdb_IEDB
    • alphafold2_pdb_CEDAR_cancer
    • alphafold2_pdb_CEDAR_wildtype
    • alphafold2_pdb_clinical

AlphaFold2 Structure Data

We have provided the structure data encoded as PyTorch Geometric (PyG) graphs on huggingface. You just need to follow the instruction in the previous Data Preparation section. You can skip this section if you are not planning to fold your own data.

How the PyG graphs are generated


The PyG graphs are generated using a three-step process under immunostruct/preprocessing. The generation scripts are available in case you ever need to run some or all of them.

  • Option 1 is easy to perform, but it's slow and rate-limited.
  • Option 2 involves more steps, but it is more suitable to larger datasets (>2000 sequences).
  1. Option 1: Using the online MSA server (slow, rate-limited, not recommended for >2000 sequences). Starting at ImmunoStruct root folder.

    # [CPU] Step 1-3. Prepare MSA for AlphaFold.
    # Download colabfold and REMEMBER where it is downloaded to. Likely default to `~/.cache/colabfold`.
    python -m colabfold.download
    
    # Prepare MSA.
    conda activate immuno
    cd immunostruct/preprocessing
    python step1-3_server_sequence_to_msa.py \
        --input-csv ../../data/ImmunoStruct_IEDB_data.csv \
        --output-dir ../../data/pdb_files/IEDB/ \
        --tmp-dir /tmp/ \
        --allele-col-name allele \
        --peptide-col-name peptide
    
    python step1-3_server_sequence_to_msa.py \
        --input-csv ../../data/ImmunoStruct_CEDAR_data_cancer.csv \
        --output-dir ../../data/pdb_files/CEDAR_cancer/ \
        --tmp-dir /tmp/ \
        --allele-col-name allele \
        --peptide-col-name mut_pep
    
    python step1-3_server_sequence_to_msa.py \
        --input-csv ../../data/ImmunoStruct_CEDAR_data_wildtype.csv \
        --output-dir ../../data/pdb_files/CEDAR_wildtype/ \
        --tmp-dir /tmp/ \
        --allele-col-name allele \
        --peptide-col-name wt_pep
    
    python step1-3_server_sequence_to_msa.py \
        --input-csv ../../data/ImmunoStruct_clinical_data.csv \
        --output-dir ../../data/pdb_files/clinical/ \
        --tmp-dir /tmp/ \
        --allele-col-name allele \
        --peptide-col-name mut_pep
    
    # [GPU] Step 4. AlphaFold2.
    # start/end help run multiple jobs in parallel.
    python step4_msa_to_pdb.py \
        --input-csv ../../data/ImmunoStruct_IEDB_data.csv \
        --output-dir ../../data/pdb_files/IEDB/ \
        --start 0 --end 24540 \
        --params-loc /path/to/colabfold \
        --allele-col-name allele \
        --peptide-col-name peptide
    
    python step4_msa_to_pdb.py \
        --input-csv ../../data/ImmunoStruct_CEDAR_data_cancer.csv \
        --output-dir ../../data/pdb_files/CEDAR_cancer/ \
        --start 0 --end 2801 \
        --params-loc /path/to/colabfold \
        --allele-col-name allele \
        --peptide-col-name mut_pep
    
    python step4_msa_to_pdb.py \
        --input-csv ../../data/ImmunoStruct_CEDAR_data_wildtype.csv \
        --output-dir ../../data/pdb_files/CEDAR_wildtype/ \
        --start 0 --end 2801 \
        --params-loc /path/to/colabfold \
        --allele-col-name allele \
        --peptide-col-name wt_pep
    
    python step4_msa_to_pdb.py \
        --input-csv ../../data/ImmunoStruct_clinical_data.csv \
        --output-dir ../../data/pdb_files/clinical/ \
        --start 0 --end 29485 \
        --params-loc /path/to/colabfold \
        --allele-col-name allele \
        --peptide-col-name mut_pep
    
    # [CPU] Step 5. Moving and renaming the structure data in PDB files.
    python step5_rename_pdb.py \
        --input-dir ../../data/pdb_files/IEDB/ \
        --output-dir ../../data/alphafold2_pdb_IEDB/
    
    python step5_rename_pdb.py \
        --input-dir ../../data/pdb_files/CEDAR_cancer/ \
        --output-dir ../../data/alphafold2_pdb_CEDAR_cancer/
    
    python step5_rename_pdb.py \
        --input-dir ../../data/pdb_files/CEDAR_wildtype/ \
        --output-dir ../../data/alphafold2_pdb_CEDAR_wildtype/
    
    python step5_rename_pdb.py \
        --input-dir ../../data/pdb_files/clinical/ \
        --output-dir ../../data/alphafold2_pdb_clinical/
    
    # [CPU] Step 6. Generating PyG graphs (structures in PDB files to structures in PyTorch .pt files).
    python step6_pdb_to_pyg.py \
        --input-dir ../../data/alphafold2_pdb_IEDB/ \
        --output-dir ../../data/graph_pyg_IEDB/
    
    python step6_pdb_to_pyg.py \
        --input-dir ../../data/alphafold2_pdb_CEDAR_cancer/ \
        --output-dir ../../data/graph_pyg_CEDAR_cancer/
    
    python step6_pdb_to_pyg.py \
        --input-dir ../../data/alphafold2_pdb_CEDAR_wildtype/ \
        --output-dir ../../data/graph_pyg_CEDAR_wildtype/
    
    python step6_pdb_to_pyg.py \
        --input-dir ../../data/alphafold2_pdb_clinical/ \
        --output-dir ../../data/graph_pyg_clinical/
  2. Option 2: Performing MSA locally (what we did). Starting at ImmunoStruct root folder.

    # [CPU] Step 1-3. Prepare MSA for AlphaFold.
    # Download colabfold and REMEMBER where it is downloaded to.
    python -m colabfold.download
    
    # Download the MSA database locally.
    mkdir ./database_msa/
    cd ./database_msa/
    wget https://wwwuser.gwdg.de/~compbiol/colabfold/uniref30_2302.tar.gz
    tar -xzvf uniref30_2302.tar.gz
    mmseqs tsv2exprofiledb uniref30_2302 uniref30_2302_db
    wget https://wwwuser.gwdg.de/~compbiol/colabfold/colabfold_envdb_202108.tar.gz
    tar -xzvf colabfold_envdb_202108.tar.gz
    mmseqs tsv2exprofiledb colabfold_envdb_202108 colabfold_envdb_202108_db
    cd ..
    
    # Prepare MSA.
    conda activate local_msa
    cd immunostruct/preprocessing
    python step1_local_sequence_to_fasta.py \
        --input-csv ../../data/ImmunoStruct_IEDB_data.csv \
        --output-fasta ../../data/fasta/ImmunoStruct_IEDB_data.fasta \
        --allele-col-name allele \
        --peptide-col-name peptide
    python step2_local_fasta_to_a3m.py \
        --input-fasta ../../data/fasta/ImmunoStruct_IEDB_data.fasta \
        --msa-database-dir ../../database_msa/ \
        --output-dir ../../data/a3m/IEDB/
    python step3_local_a3m_to_msa.py \
        --input-dir ../../data/a3m/IEDB/ \
        --output-dir ../../data/pdb_files/IEDB/ \
        --input-csv ../../data/ImmunoStruct_IEDB_data.csv \
        --allele-col-name allele \
        --peptide-col-name peptide
    
    python step1_local_sequence_to_fasta.py \
        --input-csv ../../data/ImmunoStruct_CEDAR_data_cancer.csv \
        --output-fasta ../../data/fasta/ImmunoStruct_CEDAR_data_cancer.fasta \
        --allele-col-name allele \
        --peptide-col-name mut_pep
    python step2_local_fasta_to_a3m.py \
        --input-fasta ../../data/fasta/ImmunoStruct_CEDAR_data_cancer.fasta \
        --msa-database-dir ../../database_msa/ \
        --output-dir ../../data/a3m/CEDAR_cancer/
    python step3_local_a3m_to_msa.py \
        --input-dir ../../data/a3m/CEDAR_cancer/ \
        --output-dir ../../data/pdb_files/CEDAR_cancer/ \
        --input-csv ../../data/ImmunoStruct_CEDAR_data_cancer.csv \
        --allele-col-name allele \
        --peptide-col-name mut_pep
    
    python step1_local_sequence_to_fasta.py \
        --input-csv ../../data/ImmunoStruct_CEDAR_data_wildtype.csv \
        --output-fasta ../../data/fasta/ImmunoStruct_CEDAR_data_wildtype.fasta \
        --allele-col-name allele \
        --peptide-col-name wt_pep
    python step2_local_fasta_to_a3m.py \
        --input-fasta ../../data/fasta/ImmunoStruct_CEDAR_data_wildtype.fasta \
        --msa-database-dir ../../database_msa/ \
        --output-dir ../../data/a3m/CEDAR_wildtype/
    python step3_local_a3m_to_msa.py \
        --input-dir ../../data/a3m/CEDAR_wildtype/ \
        --output-dir ../../data/pdb_files/CEDAR_wildtype/ \
        --input-csv ../../data/ImmunoStruct_CEDAR_data_wildtype.csv \
        --allele-col-name allele \
        --peptide-col-name wt_pep
    
    python step1_local_sequence_to_fasta.py \
        --input-csv ../../data/ImmunoStruct_clinical_data.csv \
        --output-fasta ../../data/fasta/ImmunoStruct_clinical_data.fasta \
        --allele-col-name allele \
        --peptide-col-name mut_pep
    python step2_local_fasta_to_a3m.py \
        --input-fasta ../../data/fasta/ImmunoStruct_clinical_data.fasta \
        --msa-database-dir ../../database_msa/ \
        --output-dir ../../data/a3m/clinical/
    python step3_local_a3m_to_msa.py \
        --input-dir ../../data/a3m/clinical/ \
        --output-dir ../../data/pdb_files/clinical/ \
        --input-csv ../../data/ImmunoStruct_clinical_data.csv \
        --allele-col-name allele \
        --peptide-col-name mut_pep
    
    # [GPU] Step 4. AlphaFold2.
    # start/end help run multiple jobs in parallel.
    conda deactivate
    conda activate immuno
    python step4_msa_to_pdb.py \
        --input-csv ../../data/ImmunoStruct_IEDB_data.csv \
        --output-dir ../../data/pdb_files/IEDB/ \
        --start 0 --end 24540 \
        --params-loc /path/to/colabfold \
        --allele-col-name allele \
        --peptide-col-name peptide
    
    python step4_msa_to_pdb.py \
        --input-csv ../../data/ImmunoStruct_CEDAR_data_cancer.csv \
        --output-dir ../../data/pdb_files/CEDAR_cancer/ \
        --start 0 --end 2801 \
        --params-loc /path/to/colabfold \
        --allele-col-name allele \
        --peptide-col-name mut_pep
    
    python step4_msa_to_pdb.py \
        --input-csv ../../data/ImmunoStruct_CEDAR_data_wildtype.csv \
        --output-dir ../../data/pdb_files/CEDAR_wildtype/ \
        --start 0 --end 2801 \
        --params-loc /path/to/colabfold \
        --allele-col-name allele \
        --peptide-col-name wt_pep
    
    python step4_msa_to_pdb.py \
        --input-csv ../../data/ImmunoStruct_clinical_data.csv \
        --output-dir ../../data/pdb_files/clinical/ \
        --start 0 --end 29485 \
        --params-loc /path/to/colabfold \
        --allele-col-name allele \
        --peptide-col-name mut_pep
    
    # [CPU] Step 5. Moving and renaming the structure data in PDB files.
    python step5_rename_pdb.py \
        --input-dir ../../data/pdb_files/IEDB/ \
        --output-dir ../../data/alphafold2_pdb_IEDB/
    
    python step5_rename_pdb.py \
        --input-dir ../../data/pdb_files/CEDAR_cancer/ \
        --output-dir ../../data/alphafold2_pdb_CEDAR_cancer/
    
    python step5_rename_pdb.py \
        --input-dir ../../data/pdb_files/CEDAR_wildtype/ \
        --output-dir ../../data/alphafold2_pdb_CEDAR_wildtype/
    
    python step5_rename_pdb.py \
        --input-dir ../../data/pdb_files/clinical/ \
        --output-dir ../../data/alphafold2_pdb_clinical/
    
    # [CPU] Step 6. Generating PyG graphs (structures in PDB files to structures in PyTorch .pt files).
    python step6_pdb_to_pyg.py \
        --input-dir ../../data/alphafold2_pdb_IEDB/ \
        --output-dir ../../data/graph_pyg_IEDB/
    
    python step6_pdb_to_pyg.py \
        --input-dir ../../data/alphafold2_pdb_CEDAR_cancer/ \
        --output-dir ../../data/graph_pyg_CEDAR_cancer/
    
    python step6_pdb_to_pyg.py \
        --input-dir ../../data/alphafold2_pdb_CEDAR_wildtype/ \
        --output-dir ../../data/graph_pyg_CEDAR_wildtype/
    
    python step6_pdb_to_pyg.py \
        --input-dir ../../data/alphafold2_pdb_clinical/ \
        --output-dir ../../data/graph_pyg_clinical/

Training and Testing

  1. Activate the environment

    conda activate immuno
    export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH
  2. Set up Weights & Biases

    Create a project on Weights & Biases matching your project name.

  3. Run Experiments

    NOTE: these are already deprecated. See immunostruct/old_scripts.

    # Sequence + structure + biochemical property + multimodal multihead attention
    python train_PropIEDB_PropCancer_ImmunoCancer.py --full-sequence --sequence-loss --model HybridModelv2 --wandb-username YOUR_WANDB_USERNAME
    
    # Sequence + structure + biochemical property
    python train_PropIEDB_PropCancer_ImmunoCancer.py --full-sequence --sequence-loss --model HybridModel --wandb-username YOUR_WANDB_USERNAME
    
    # Sequence + biochemical property
    python train_PropIEDB_PropCancer_ImmunoCancer.py --full-sequence --sequence-loss --model SequenceFpModel --wandb-username YOUR_WANDB_USERNAME
    
    # Sequence-only model
    python train_PropIEDB_PropCancer_ImmunoCancer.py --full-sequence --sequence-loss --model SequenceModel --wandb-username YOUR_WANDB_USERNAME
    
    # Structure-only model
    python train_PropIEDB_PropCancer_ImmunoCancer.py --full-sequence --model StructureModel --wandb-username YOUR_WANDB_USERNAME
  4. Our main experiments

    These are examples for training ImmunoStruct.

    # IEDB training
    python train_IEDB_wFT.py --model HybridModelv2 --sequence-loss --full-sequence --seed 1 --wandb-username immunoteam
    
    # CEDAR training
    python train_CEDAR_wFT.py --model HybridModelv2_Comparative --sequence-loss --full-sequence --comparative --use-wt-for-downstream --seed 1 --wandb-username immunoteam

    For running inference using the models we provide:

    # IEDB inference
    python infer_IEDB_or_CEDAR.py --infer_dataset IEDB --model HybridModelv2 --model-path ../results/IEDB_model_seed1.pt --full-sequence --seed 1
    
    # CEDAR inference
    python infer_IEDB_or_CEDAR.py --infer_dataset CEDAR --model HybridModel_Comparative --model-path ../results/CEDAR_model_seed2.pt --full-sequence --seed 2
    
    # Clinical inference
    python infer_clinical_only.py --model HybridModel_Comparative --model-path ../results/CEDAR_model_seed2.pt --full-sequence

(back to top)

Troubleshooting

Common Issues

GLIBCXX Error

ImportError: $some_path/libstdc++.so.6: version 'GLIBCXX_3.4.29' not found

Solution: Add your conda environment path to LD_LIBRARY_PATH:

conda activate immuno
echo $CONDA_PREFIX
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH

CUDA Compatibility Issues

  • Ensure your CUDA version matches the PyTorch installation
  • Verify GPU availability with torch.cuda.is_available()

Memory Issues

  • Reduce batch size in training scripts
  • Use gradient checkpointing for large models

Wandb Authentication

  • Login to Wandb: wandb login
  • Ensure project names match between script and Wandb dashboard

(back to top)

License

Distributed under the Yale License. See LICENSE.txt for more information.

(back to top)

Contact

Krishnaswamy Lab - @KrishnaswamyLab

Project Link: https://github.com/KrishnaswamyLab/ImmunoStruct

(back to top)