Skip to content

QSong-github/MVCBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

46 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

MVCBench: A Multimodal Benchmark for Drug-induced Virtual Cell Phenotypes

Python License Project Page Dataset

A systematic benchmark for evaluating molecular and gene representations in predicting drug-induced multimodal virtual cell phenotypes.

Project Page | Dataset | Manuscript available upon request | Preprint coming soon


Overview

MVCBench is a benchmarking framework for studying how representation choices shape the prediction of drug-induced cellular phenotypes across transcriptional and morphological modalities. It systematically evaluates 24 representation methods spanning 12 drug molecular representations and 12 gene representation methods using nearly 1.1 million drug-induced profiles collected from large-scale transcriptomic and high-content imaging resources.

MVCBench Overview

Figure 1. Overview of MVCBench. The benchmark spans transcriptomic, morphological, and multimodal prediction settings, covering large-scale paired profiles, diverse representation models, and progressive evaluation stages from single-modality prediction to multimodal virtual cell construction.

Key Findings

  • Advanced molecular representations are highly beneficial for predicting drug-induced morphological phenotypes, where 3D-aware and deep learning-based encoders consistently outperform classical molecular fingerprints. By contrast, their gains for transcriptomic response prediction are much smaller, suggesting that chemical structure alone may be insufficient to fully explain gene expression responses.

  • For transcriptomic prediction, task-specific gene representations show clearer advantages than general-purpose foundation models. This indicates that alignment between representation learning objectives and perturbation-response tasks remains critical, even as single-cell foundation models continue to improve.

  • Multimodal integration consistently improves predictive performance over single-modality training. Beyond benchmark scores, MVCBench provides practical guidance for designing multimodal virtual cell systems, including the value of modality-aware optimization and task-dependent fusion strategies.

🧬 Benchmark Zoo

We evaluate widely used Drug Molecular Representation methods and Gene Representation methods (Single-cell Foundation Models).

πŸ§ͺ Molecule Representation Methods

Model Paper Code Stars
KPGT Nat. Commun. 2023 GitHub Stars
InfoAlign ICLR 2025 GitHub Stars
GeminiMol Adv. Sci. 2024 GitHub Stars
Ouroboros Adv. Sci. 2026 GitHub Stars
Mole-BERT ICLR 2023 GitHub Stars
ChemBERTa2 arXiv 2022 GitHub Stars
MolT5 EMNLP 2022 GitHub Stars
Chemprop JCIM 2024 GitHub Stars
MolCLR Nat. Mach. Intell. 2022 GitHub Stars
UniMol ICLR 2023 GitHub Stars
UniMol2 NIPS 2024 GitHub Stars

🧬 Gene Representation Methods (scFMs)

Model Paper Code Stars
Geneformer Nature 2023 HuggingFace ⭐ 281 likes
tGPT bioRxiv 2022 GitHub Stars
UCE bioRxiv 2023 GitHub Stars
scBERT Nat. Mach. Intell. 2022 GitHub Stars
CellPLM ICLR 2024 GitHub Stars
OpenBioMed arXiv 2023 GitHub Stars
scGPT Nat. Methods 2024 GitHub Stars
scFoundation Nat. Methods 2024 GitHub Stars
SCimilarity Nature 2025 GitHub Stars
Cell2Sentence ICML 2023 GitHub Stars
STATE bioRxiv 2025 GitHub Stars

πŸ’Ύ Datasets

MVCBench leverages over one million paired observations across transcriptomic and morphological landscapes.

Gene Expression

Cell Morphology

  • [cpg0016 & cpg0003] (Cell Painting Gallery) - AWS Registry

Multimodal (Paired)

  • CDRP-BBBC047-Bray & CDRPBIO-BBBC036-Bray - Available via the Project.

The preprocessed dataset used in this paper is available at MVCBench HuggingFace.


🧩 Embedding Extraction

MVCBench provides a unified and easy-to-use interface to extract embeddings using state-of-the-art foundation models.

Molecular Embeddings (e.g., UniMol2)

Extract single-cell representations from raw gene expression profiles; please refer to Get_Molecular_Embedding.ipynb.

Gene Embeddings (e.g., STATE)

Extract single-cell representations from raw gene expression profiles; please refer to Get_STATE_Embedding.ipynb.

inferer.encode_adata( # https://github.com/ArcInstitute/state
    input_file, 
    output_file, 
    emb_key=embed_key, 
    dataset_name=dataset_name,
    gene_column=gene_column
)

πŸš€ Getting Started

Installation

# Clone the repository
git clone https://github.com/QSong-github/MVCBench.git
cd MVCBench

# Create a virtual environment
conda create -n mvcbench python=3.11
conda activate mvcbench

# Install dependencies
pip install -r requirements.txt

Data organization

MVCBench expects benchmark datasets and precomputed molecular representations to be available under a data root directory. By default, the code looks for data under ./data. You can also set a custom location by defining the environment variable VCBENCH_DATA_ROOT.

export VCBENCH_DATA_ROOT=/path/to/MVCBench_data

The current codebase uses dataset paths defined in src/configs.py. In practice, the following resource groups are expected:

  • transcriptomic datasets such as LINCS2020, CIGS, and Tahoe-100M
  • morphology datasets such as BBBC036, BBBC047, and cpg0016
  • paired multimodal datasets for MVC experiments
  • precomputed molecular embeddings stored under Molecular_representations/...

The preprocessed benchmark release is available at MVCBench on Hugging Face.

Main training scripts

The repository currently provides three main training entry points:

  • train_gene.py for drug-to-gene-expression prediction
  • train_image.py for drug-to-morphology prediction
  • train_mvc.py for multimodal virtual cell modeling

Batch scripts for running benchmark sweeps are also provided in scripts/run_gene_batch.sh, scripts/run_image_batch.sh, and scripts/run_mvc_batch.sh.

Example commands

Run a gene expression benchmark:

python3 train_gene.py \
  --dataset_name LINCS \
  --molecule_feature ECFP4 \
  --gene_encoder_type Default \
  --split_data_type smiles_split \
  --n_epochs 2 \
  --batch_size 1024

Run a morphology benchmark:

python3 train_image.py \
  --dataset_name cpg0016 \
  --molecule_feature ECFP4 \
  --image_encoder_type Default \
  --split_data_type smiles_split \
  --n_epochs 2 \
  --batch_size 1024

Run a multimodal virtual cell benchmark:

python3 train_mvc.py \
  --dataset_name MVC_BBBC047 \
  --molecule_feature ECFP4 \
  --split_data_type smiles_split \
  --n_epochs 2 \
  --batch_size 1024

Outputs

Training outputs are written under the results/ directory by default. Depending on the task, the code will save:

  • model checkpoints such as best_model.pt
  • per-sample evaluation tables in CSV format
  • predicted profiles in HDF5 format when prediction export is enabled

Notes


πŸ–ŠοΈ Citation

If you find MVCBench useful for your research, please cite our paper:

@article{li2026mvcbench,
  title={MVCBench: A Multimodal Benchmark for Drug-induced Virtual Cell Phenotypes},
  author={Li, Bo and Wang, Qing and Wang, Shihang and Zhang, Bob and Peng, Yuzhong and Zeng, Pinxian and Liu, Chengliang and Li, Mengran and Tang, Ziyang and Yao, Xiaojun and Deng, Chuxia and Song, Qianqian},
  journal={bioRxiv},
  year={2026},
  publisher={Cold Spring Harbor Laboratory}
}

πŸ“§ Contact

For any questions or inquiries, please open an issue or contact:

About

Multimodal Benchmark for Drug-induced Virtual Cell Phenotypes

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors