dmoe-release.mp4
dMoE: dLLMs with Learnable Block Experts
Sicheng Feng1, Zigeng Chen1, Gongfan Fang1, Xinyin Ma1, Xinchao Wang1,*
1National University of Singapore, Singapore
βCorresponding author: xinchao@nus.edu.sg
- [4.26.2026]: Code and model are released. ArXiv submission is on hold; paper preview available here.
- Learnable Block Experts: Introduces block-level MoE routing into dLLMs, drastically compressing the number of activated unique experts across diffusion steps β directly targeting memory footprint reduction.
- Reduced MoE Bandwidth: By constraining expert activation at the block level, dMoE significantly reduces memory bandwidth consumed by expert weight loading during the block diffusion process.
- Improved Efficiency-Accuracy Trade-off: dMoE achieves competitive performance on reasoning and general benchmarks while reducing unnecessary computation through adaptive expert activation.
- Plug-and-play on LLaDA-2.0: Built directly on top of LLaDA-2.0-mini without architectural changes, enabling straightforward extension to other masked dLLMs.
- π‘ Introduction
- π» Model and Datasets
- π Quick Start
- π§ Installation
- π₯ Training
- β‘ Evaluation
- βοΈ Acknowledgement
- π Citation
We present dMoE, a framework that introduces Learnable Block Experts into diffusion large language models (dLLMs).
| Model | Description | Source Model | Link |
|---|---|---|---|
| π€ dMoE-16B | General-purpose dLLM with learnable block experts | LLaDA-2.0-mini | Hugging Face |
git clone https://github.com/fscdc/dMoE.git
cd dMoE
conda create -n dmoe python==3.12
conda activate dmoe
pip install -r ./evaluations/requirements.txtimport torch
from transformers import AutoTokenizer
from evaluations.models.modeling_llada2_moe_be_adaptive import LLaDA2MoeModelLM
MODEL_NAME = "FSCCS/dMoE-16B"
device = "cuda:0"
model = LLaDA2MoeModelLM.from_pretrained(
MODEL_NAME, trust_remote_code=True, torch_dtype=torch.bfloat16
).to(device).eval()
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
prompt = "A robe takes 2 bolts of blue fiber and half that much white fiber. How many bolts in total does it take?" + "\nLet's think step by step\n"
messages = [[{"role": "user", "content": prompt}]]
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(input_text, return_tensors="pt", padding_side="left")
input_ids = inputs["input_ids"].to(device)
with torch.no_grad():
out, unique_experts_count = model.generate(
input_ids,
steps=32,
gen_length=2048,
block_length=32,
temperature=0.0,
eos_early_stop=True,
)
generated = out[:, input_ids.shape[1]:]
result = tokenizer.batch_decode(generated, skip_special_tokens=True)
print("Output:", result[0])
print("Unique experts count:", unique_experts_count)Clone the dMoE repository:
git clone https://github.com/fscdc/dmoe.git --recursive
cd dmoeInstall the training environment (based on dFactory):
cd training
conda create -n dmoe-training python==3.12
conda activate dmoe-training
pip install -e VeOmni/Install the evaluation environment:
cd evaluations
conda create -n dmoe python==3.12
conda activate dmoe
pip install -r requirements.txtOur training pipeline is based on dFactory.
cd trainingDownload the base model and convert it to the merged-expert format required for training:
# Download the original LLaDA-2.0-mini weights
python scripts/download_hf_model.py \
--repo_id inclusionAI/LLaDA2.0-mini \
--local_dir /path/to/separate_expert_model
# Convert to merged format for training
python scripts/moe_convertor.py \
--input-path /path/to/separate_expert_model \
--output-path /path/to/merged4moe/LLaDA2.0-mini \
--mode merge# gsm8k as an example
python scripts/build_gsm8k_dataset.pyEdit configs/moe/llada2_mini_adaptive.yaml:
model:
model_path: "/path/to/merged4moe/LLaDA2.0-mini"
tokenizer_path: "/path/to/merged4moe/LLaDA2.0-mini"
data:
train_path: "/your/data/path"
train:
output_dir: "/your/output/path"PYTHONPATH=$(pwd)/VeOmni:$PYTHONPATH sh train.sh tasks/train_llada2_bd.py configs/moe/llada2_mini_adaptive.yaml
# Or
bash scripts_moe/train_adaptive.shAfter training, convert the checkpoint back to the standard MoE format:
python scripts/moe_convertor.py \
--input-path ./logs/llada2_mini_dmoe/checkpoints/global_step_XXX/hf_ckpt/ \
--output-path /path/to/output/llada2_mini_dmoe \
--mode splitcd evaluationsSet --model-name to your local model path in scripts_moe/run_be_adaptive.sh, then run:
bash scripts_moe/run_be_adaptive.shThis evaluates on four benchmarks:
- β GSM8K
- β MATH500
- β MMLU
- β ARC-C
bash scripts_moe/run_origin.shWe sincerely thank Huawei for their support and contribution to the research and development of this model and algorithm. Furthermore, our code builds on dFactory. We acknowledge these great works for laying the groundwork that made our approach possible.
If our research assists your work, please give us a star β or cite us using:
@article{feng2026dmoe,
title={dMoE: dLLMs with Learnable Block Experts},
author={Feng, Sicheng and Chen, Zigeng and Fang, Gongfan and Ma, Xinyin and Wang, Xinchao},
journal={arXiv preprint arXiv:TODO},
year={2026}
}