MVCBench: A Multimodal Benchmark for Drug-induced Virtual Cell Phenotypes

A systematic benchmark for evaluating molecular and gene representations in predicting drug-induced multimodal virtual cell phenotypes.

Project Page | Dataset | Manuscript available upon request | Preprint coming soon

Overview

MVCBench is a benchmarking framework for studying how representation choices shape the prediction of drug-induced cellular phenotypes across transcriptional and morphological modalities. It systematically evaluates 24 representation methods spanning 12 drug molecular representations and 12 gene representation methods using nearly 1.1 million drug-induced profiles collected from large-scale transcriptomic and high-content imaging resources.

Figure 1. Overview of MVCBench. The benchmark spans transcriptomic, morphological, and multimodal prediction settings, covering large-scale paired profiles, diverse representation models, and progressive evaluation stages from single-modality prediction to multimodal virtual cell construction.

Key Findings

Advanced molecular representations are highly beneficial for predicting drug-induced morphological phenotypes, where 3D-aware and deep learning-based encoders consistently outperform classical molecular fingerprints. By contrast, their gains for transcriptomic response prediction are much smaller, suggesting that chemical structure alone may be insufficient to fully explain gene expression responses.
For transcriptomic prediction, task-specific gene representations show clearer advantages than general-purpose foundation models. This indicates that alignment between representation learning objectives and perturbation-response tasks remains critical, even as single-cell foundation models continue to improve.
Multimodal integration consistently improves predictive performance over single-modality training. Beyond benchmark scores, MVCBench provides practical guidance for designing multimodal virtual cell systems, including the value of modality-aware optimization and task-dependent fusion strategies.

🧬 Benchmark Zoo

We evaluate widely used Drug Molecular Representation methods and Gene Representation methods (Single-cell Foundation Models).

🧪 Molecule Representation Methods

Model	Paper	Code
KPGT	Nat. Commun. 2023	GitHub
InfoAlign	ICLR 2025	GitHub
GeminiMol	Adv. Sci. 2024	GitHub
Ouroboros	Adv. Sci. 2026	GitHub
Mole-BERT	ICLR 2023	GitHub
ChemBERTa2	arXiv 2022	GitHub
MolT5	EMNLP 2022	GitHub
Chemprop	JCIM 2024	GitHub
MolCLR	Nat. Mach. Intell. 2022	GitHub
UniMol	ICLR 2023	GitHub
UniMol2	NIPS 2024	GitHub

🧬 Gene Representation Methods (scFMs)

Model	Paper	Code	Stars
Geneformer	Nature 2023	HuggingFace	⭐ 281 likes
tGPT	bioRxiv 2022	GitHub
UCE	bioRxiv 2023	GitHub
scBERT	Nat. Mach. Intell. 2022	GitHub
CellPLM	ICLR 2024	GitHub
OpenBioMed	arXiv 2023	GitHub
scGPT	Nat. Methods 2024	GitHub
scFoundation	Nat. Methods 2024	GitHub
SCimilarity	Nature 2025	GitHub
Cell2Sentence	ICML 2023	GitHub
STATE	bioRxiv 2025	GitHub

💾 Datasets

MVCBench leverages over one million paired observations across transcriptomic and morphological landscapes.

Gene Expression

[CIGS] (Nat. Methods 2025) - Dataset Link
[Tahoe-100M] (bioRxiv 2025) - HuggingFace
[LINCS 2020] - Clue.io

Cell Morphology

[cpg0016 & cpg0003] (Cell Painting Gallery) - AWS Registry

Multimodal (Paired)

CDRP-BBBC047-Bray & CDRPBIO-BBBC036-Bray - Available via the Project.

The preprocessed dataset used in this paper is available at MVCBench HuggingFace.

🧩 Embedding Extraction

MVCBench provides a unified and easy-to-use interface to extract embeddings using state-of-the-art foundation models.

Molecular Embeddings (e.g., UniMol2)

Extract single-cell representations from raw gene expression profiles; please refer to Get_Molecular_Embedding.ipynb.

Gene Embeddings (e.g., STATE)

Extract single-cell representations from raw gene expression profiles; please refer to Get_STATE_Embedding.ipynb.

inferer.encode_adata( # https://github.com/ArcInstitute/state
    input_file, 
    output_file, 
    emb_key=embed_key, 
    dataset_name=dataset_name,
    gene_column=gene_column
)

🚀 Getting Started

Installation

# Clone the repository
git clone https://github.com/QSong-github/MVCBench.git
cd MVCBench

# Create a virtual environment
conda create -n mvcbench python=3.11
conda activate mvcbench

# Install dependencies
pip install -r requirements.txt

Data organization

MVCBench expects benchmark datasets and precomputed molecular representations to be available under a data root directory. By default, the code looks for data under ./data. You can also set a custom location by defining the environment variable VCBENCH_DATA_ROOT.

export VCBENCH_DATA_ROOT=/path/to/MVCBench_data

The current codebase uses dataset paths defined in src/configs.py. In practice, the following resource groups are expected:

transcriptomic datasets such as LINCS2020, CIGS, and Tahoe-100M
morphology datasets such as BBBC036, BBBC047, and cpg0016
paired multimodal datasets for MVC experiments
precomputed molecular embeddings stored under Molecular_representations/...

The preprocessed benchmark release is available at MVCBench on Hugging Face.

Main training scripts

The repository currently provides three main training entry points:

train_gene.py for drug-to-gene-expression prediction
train_image.py for drug-to-morphology prediction
train_mvc.py for multimodal virtual cell modeling

Batch scripts for running benchmark sweeps are also provided in scripts/run_gene_batch.sh, scripts/run_image_batch.sh, and scripts/run_mvc_batch.sh.

Example commands

Run a gene expression benchmark:

python3 train_gene.py \
  --dataset_name LINCS \
  --molecule_feature ECFP4 \
  --gene_encoder_type Default \
  --split_data_type smiles_split \
  --n_epochs 2 \
  --batch_size 1024

Run a morphology benchmark:

python3 train_image.py \
  --dataset_name cpg0016 \
  --molecule_feature ECFP4 \
  --image_encoder_type Default \
  --split_data_type smiles_split \
  --n_epochs 2 \
  --batch_size 1024

Run a multimodal virtual cell benchmark:

python3 train_mvc.py \
  --dataset_name MVC_BBBC047 \
  --molecule_feature ECFP4 \
  --split_data_type smiles_split \
  --n_epochs 2 \
  --batch_size 1024

Outputs

Training outputs are written under the results/ directory by default. Depending on the task, the code will save:

model checkpoints such as best_model.pt
per-sample evaluation tables in CSV format
predicted profiles in HDF5 format when prediction export is enabled

Notes

Example notebooks for representation extraction are available in examples/Get_Molecular_Embedding.ipynb and examples/Get_STATE_Embedding.ipynb.
Dataset names, file mappings, and embedding filenames are configured in src/configs.py.

🖊️ Citation

If you find MVCBench useful for your research, please cite our paper:

@article{li2026mvcbench,
  title={MVCBench: A Multimodal Benchmark for Drug-induced Virtual Cell Phenotypes},
  author={Li, Bo and Wang, Qing and Wang, Shihang and Zhang, Bob and Peng, Yuzhong and Zeng, Pinxian and Liu, Chengliang and Li, Mengran and Tang, Ziyang and Yao, Xiaojun and Deng, Chuxia and Song, Qianqian},
  journal={bioRxiv},
  year={2026},
  publisher={Cold Spring Harbor Laboratory}
}

📧 Contact

For any questions or inquiries, please open an issue or contact:

Bo Li: Boom985426@gmail.com
Bob Zhang: bobzhang@um.edu.mo
Qianqian Song: qsong1@ufl.edu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MVCBench: A Multimodal Benchmark for Drug-induced Virtual Cell Phenotypes

Overview

Key Findings

🧬 Benchmark Zoo

🧪 Molecule Representation Methods

🧬 Gene Representation Methods (scFMs)

💾 Datasets

Gene Expression

Cell Morphology

Multimodal (Paired)

🧩 Embedding Extraction

Molecular Embeddings (e.g., UniMol2)

Gene Embeddings (e.g., STATE)

🚀 Getting Started

Installation

Data organization

Main training scripts

Example commands

Outputs

Notes

🖊️ Citation

📧 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
docs		docs
examples		examples
scripts		scripts
src		src
README.md		README.md
requirements.txt		requirements.txt
train_gene.py		train_gene.py
train_image.py		train_image.py
train_mvc.py		train_mvc.py

Folders and files

Latest commit

History

Repository files navigation

MVCBench: A Multimodal Benchmark for Drug-induced Virtual Cell Phenotypes

Overview

Key Findings

🧬 Benchmark Zoo

🧪 Molecule Representation Methods

🧬 Gene Representation Methods (scFMs)

💾 Datasets

Gene Expression

Cell Morphology

Multimodal (Paired)

🧩 Embedding Extraction

Molecular Embeddings (e.g., UniMol2)

Gene Embeddings (e.g., STATE)

🚀 Getting Started

Installation

Data organization

Main training scripts

Example commands

Outputs

Notes

🖊️ Citation

📧 Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages