CodeBind: Decoupled Representation Learning for Multimodal Alignment with Unified Compositional Codebook

Visual AI Lab, The University of Hong Kong

^*Equal contribution ^†Corresponding author

^{Target modalities are partially aligned with bridging modalities via codebooks, resulting in a shared space. Unique features from both bridging and target modalities are preserved in specific space. Compositional VQ utilizes a combination of multiple low-dimensional codevectors to reconstruct a complete embedding.}

📣 Updates

[May 18, 2026] Paper Release via ArXiv
[May 16, 2026] Code Initial Release

✨ Overview

Multimodal representation alignment is crucial for large language models and robotics. Traditional methods often struggle with cross-modal information discrepancies and data scarcity, resulting in suboptimal alignment spaces that neglect modality-unique features.

We introduce CodeBind, a novel framework that optimizes multimodal representation spaces using a modality-shared-specific codebook design.

Unlike conventional hard alignment approaches, CodeBind decomposes features into:

Shared Components: Ensuring semantic consistency across modalities.
Specific Components: Preserving modality-unique details.

This approach employs a compositional vector quantization scheme, where a shared codebook bridges modality gaps, and modality-specific codebooks mitigate representation bias by preventing dominant modalities from overshadowing others. Validated across nine modalities (text, image, video, audio, depth, thermal, tactile, 3D point cloud, EEG), CodeBind achieves state-of-the-art performance in multimodal classification and retrieval tasks.

📝 TODOs

Release the training code
Release CodeBind-IB checkpoints
Release applications code

🔨 Installation

First, clone the repository and install the required packages.

git clone https://github.com/Visual-AI/codebind.git
cd codebind
conda install pytorch==2.1.1 torchvision==0.16.1 torchaudio==2.1.1 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txt

📚 Quick Start

You can use CodeBind to extract and compare features across modalities. An example snippet is provided below:

# TBD

📦 Datasets

Please refer to Doc/DATASETS.md for dataset preparation.

🧩 Model Zoo

Please refer to Doc/MODEL_ZOO.md for details on available CodeBind checkpoints.

🚀 Training & Inference

Please refer to Doc/TRAINING.md for details on CodeBind training scripts for different modalities.

🙏 Acknowledgements

This repository builds upon the invaluable contributions of the open-source community. We extend our sincere appreciation to the following projects for their foundational work:

📜 Citation

If you find this repository useful, please consider giving a star ⭐ and citation:

@article{chen2026codebind,
    title={CodeBind: Decoupled Representation Learning for Multimodal Alignment
    with Unified Compositional Codebook},
    author={Zeyu Chen and Jie Li and Kai, Han},
    journal={arXiv preprint arXiv:2605.18257},
    year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Doc		Doc
application		application
assets		assets
config		config
datasets		datasets
models		models
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
example.py		example.py
main.py		main.py
main_lossbalance.py		main_lossbalance.py
requirements.txt		requirements.txt
util.py		util.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CodeBind: Decoupled Representation Learning for Multimodal Alignment with Unified Compositional Codebook

📣 Updates

✨ Overview

📝 TODOs

🔨 Installation

📚 Quick Start

📦 Datasets

🧩 Model Zoo

🚀 Training & Inference

🙏 Acknowledgements

📜 Citation

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CodeBind: Decoupled Representation Learning for Multimodal Alignment with Unified Compositional Codebook

📣 Updates

✨ Overview

📝 TODOs

🔨 Installation

📚 Quick Start

📦 Datasets

🧩 Model Zoo

🚀 Training & Inference

🙏 Acknowledgements

📜 Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages