A systematic benchmark for evaluating molecular and gene representations in predicting drug-induced multimodal virtual cell phenotypes.
Project Page | Dataset | Manuscript available upon request | Preprint coming soon
MVCBench is a benchmarking framework for studying how representation choices shape the prediction of drug-induced cellular phenotypes across transcriptional and morphological modalities. It systematically evaluates 24 representation methods spanning 12 drug molecular representations and 12 gene representation methods using nearly 1.1 million drug-induced profiles collected from large-scale transcriptomic and high-content imaging resources.
Figure 1. Overview of MVCBench. The benchmark spans transcriptomic, morphological, and multimodal prediction settings, covering large-scale paired profiles, diverse representation models, and progressive evaluation stages from single-modality prediction to multimodal virtual cell construction.
-
Advanced molecular representations are highly beneficial for predicting drug-induced morphological phenotypes, where 3D-aware and deep learning-based encoders consistently outperform classical molecular fingerprints. By contrast, their gains for transcriptomic response prediction are much smaller, suggesting that chemical structure alone may be insufficient to fully explain gene expression responses.
-
For transcriptomic prediction, task-specific gene representations show clearer advantages than general-purpose foundation models. This indicates that alignment between representation learning objectives and perturbation-response tasks remains critical, even as single-cell foundation models continue to improve.
-
Multimodal integration consistently improves predictive performance over single-modality training. Beyond benchmark scores, MVCBench provides practical guidance for designing multimodal virtual cell systems, including the value of modality-aware optimization and task-dependent fusion strategies.
We evaluate widely used Drug Molecular Representation methods and Gene Representation methods (Single-cell Foundation Models).
| Model | Paper | Code | Stars |
|---|---|---|---|
| KPGT | Nat. Commun. 2023 | GitHub | |
| InfoAlign | ICLR 2025 | GitHub | |
| GeminiMol | Adv. Sci. 2024 | GitHub | |
| Ouroboros | Adv. Sci. 2026 | GitHub | |
| Mole-BERT | ICLR 2023 | GitHub | |
| ChemBERTa2 | arXiv 2022 | GitHub | |
| MolT5 | EMNLP 2022 | GitHub | |
| Chemprop | JCIM 2024 | GitHub | |
| MolCLR | Nat. Mach. Intell. 2022 | GitHub | |
| UniMol | ICLR 2023 | GitHub | |
| UniMol2 | NIPS 2024 | GitHub |
| Model | Paper | Code | Stars |
|---|---|---|---|
| Geneformer | Nature 2023 | HuggingFace | β 281 likes |
| tGPT | bioRxiv 2022 | GitHub | |
| UCE | bioRxiv 2023 | GitHub | |
| scBERT | Nat. Mach. Intell. 2022 | GitHub | |
| CellPLM | ICLR 2024 | GitHub | |
| OpenBioMed | arXiv 2023 | GitHub | |
| scGPT | Nat. Methods 2024 | GitHub | |
| scFoundation | Nat. Methods 2024 | GitHub | |
| SCimilarity | Nature 2025 | GitHub | |
| Cell2Sentence | ICML 2023 | GitHub | |
| STATE | bioRxiv 2025 | GitHub |
MVCBench leverages over one million paired observations across transcriptomic and morphological landscapes.
- [CIGS] (Nat. Methods 2025) - Dataset Link
- [Tahoe-100M] (bioRxiv 2025) - HuggingFace
- [LINCS 2020] - Clue.io
- [cpg0016 & cpg0003] (Cell Painting Gallery) - AWS Registry
- CDRP-BBBC047-Bray & CDRPBIO-BBBC036-Bray - Available via the Project.
The preprocessed dataset used in this paper is available at MVCBench HuggingFace.
MVCBench provides a unified and easy-to-use interface to extract embeddings using state-of-the-art foundation models.
Extract single-cell representations from raw gene expression profiles; please refer to Get_Molecular_Embedding.ipynb.
Extract single-cell representations from raw gene expression profiles; please refer to Get_STATE_Embedding.ipynb.
inferer.encode_adata( # https://github.com/ArcInstitute/state
input_file,
output_file,
emb_key=embed_key,
dataset_name=dataset_name,
gene_column=gene_column
)# Clone the repository
git clone https://github.com/QSong-github/MVCBench.git
cd MVCBench
# Create a virtual environment
conda create -n mvcbench python=3.11
conda activate mvcbench
# Install dependencies
pip install -r requirements.txt
MVCBench expects benchmark datasets and precomputed molecular representations to be available under a data root directory. By default, the code looks for data under ./data. You can also set a custom location by defining the environment variable VCBENCH_DATA_ROOT.
export VCBENCH_DATA_ROOT=/path/to/MVCBench_dataThe current codebase uses dataset paths defined in src/configs.py. In practice, the following resource groups are expected:
- transcriptomic datasets such as
LINCS2020,CIGS, andTahoe-100M - morphology datasets such as
BBBC036,BBBC047, andcpg0016 - paired multimodal datasets for MVC experiments
- precomputed molecular embeddings stored under
Molecular_representations/...
The preprocessed benchmark release is available at MVCBench on Hugging Face.
The repository currently provides three main training entry points:
train_gene.pyfor drug-to-gene-expression predictiontrain_image.pyfor drug-to-morphology predictiontrain_mvc.pyfor multimodal virtual cell modeling
Batch scripts for running benchmark sweeps are also provided in scripts/run_gene_batch.sh, scripts/run_image_batch.sh, and scripts/run_mvc_batch.sh.
Run a gene expression benchmark:
python3 train_gene.py \
--dataset_name LINCS \
--molecule_feature ECFP4 \
--gene_encoder_type Default \
--split_data_type smiles_split \
--n_epochs 2 \
--batch_size 1024Run a morphology benchmark:
python3 train_image.py \
--dataset_name cpg0016 \
--molecule_feature ECFP4 \
--image_encoder_type Default \
--split_data_type smiles_split \
--n_epochs 2 \
--batch_size 1024Run a multimodal virtual cell benchmark:
python3 train_mvc.py \
--dataset_name MVC_BBBC047 \
--molecule_feature ECFP4 \
--split_data_type smiles_split \
--n_epochs 2 \
--batch_size 1024Training outputs are written under the results/ directory by default. Depending on the task, the code will save:
- model checkpoints such as
best_model.pt - per-sample evaluation tables in CSV format
- predicted profiles in HDF5 format when prediction export is enabled
- Example notebooks for representation extraction are available in examples/Get_Molecular_Embedding.ipynb and examples/Get_STATE_Embedding.ipynb.
- Dataset names, file mappings, and embedding filenames are configured in src/configs.py.
If you find MVCBench useful for your research, please cite our paper:
@article{li2026mvcbench,
title={MVCBench: A Multimodal Benchmark for Drug-induced Virtual Cell Phenotypes},
author={Li, Bo and Wang, Qing and Wang, Shihang and Zhang, Bob and Peng, Yuzhong and Zeng, Pinxian and Liu, Chengliang and Li, Mengran and Tang, Ziyang and Yao, Xiaojun and Deng, Chuxia and Song, Qianqian},
journal={bioRxiv},
year={2026},
publisher={Cold Spring Harbor Laboratory}
}
For any questions or inquiries, please open an issue or contact:
- Bo Li: Boom985426@gmail.com
- Bob Zhang: bobzhang@um.edu.mo
- Qianqian Song: qsong1@ufl.edu
