Merge dev: fused MoE kernels, Qwen3 support, model improvements by drunkcoding · Pull Request #75 · EfficientMoE/MoE-Infinity

drunkcoding · 2026-02-26T11:11:42Z

Summary

Fused MoE MLP kernel (CUTLASS BF16): 3-GEMM fused path (gate→up w/ SiLU-mul epilogue→down) with small-M and large-K tile dispatch; correctness and performance benchmarks confirm parity with Torch-native reference
New CUDA benchmark tests: 11 test files in tests/cuda/ covering fused MLP, topk-softmax, masked-select, and activation kernels
Qwen3 model support and improvements to DeepSeek-V2/V3, Grok, Mixtral, and NLLBMoE
Router kernel implementation
C++ core updates: memory allocator, parallel dispatcher, prefetch scheduler, and AIO layer improvements
CI fixes: CUDA extension guard for CPU-only builds, codespell statics allowlist, Claude Code Review GitHub Actions workflow

Test plan

CUDA tests pass: cd tests/cuda && cmake -B build && cmake --build build -j$(nproc)
C++ unit tests pass: cd tests/cpp/unittest/queues && cmake -B build && cmake --build build -j$(nproc) && ctest --test-dir build -V
Pre-commit hooks green: pre-commit run --all-files
Inference smoke test: CUDA_VISIBLE_DEVICES=0 python examples/interface_example.py --model_name_or_path deepseek-ai/DeepSeek-V2-Lite-Chat --offload_dir <path>
GitHub Actions: Formatting + Build Test CI green on this PR

🤖 Generated with Claude Code

* add forward and call * fix a bug

* add back expert parallel by id hash * add grok ep * fix mistral typo * accom cuda copy bug * sync after compute * fix:sync to make sure that input is ready --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> Co-authored-by: luzhan <513964121@qq.com>

* add override QuantLinear (#29) Co-authored-by: xly <leyang.xue@ed.ac.uk> * use torch streampool * format * working deepspeed backend * fix: revert apply_rotary_pos_emb in deepseek * fix busy waiting * fix deepseek flashattn * add deepseek v3 * format and fix multigpu deepseek bug * with device caching allocator * add on-demand lock cache --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> Co-authored-by: lausannel <513964121@qq.com>

* Fix: Undefined Symbol Compilation Error (#37) * reformat code vllm style * add threadsafe queues * fix compilation error --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * Refactor code for better performance (#38) * reformat code vllm style * add threadsafe queues * fix compilation error * split files and remove queuing * performance improvement * remove error dependency * add try lock return check * fix header dependency * fix hard coded number --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * CI: add pre commit format ci (#40) * ci: add pre commit format ci * fix: add requirements for linting * fix: format code before merge * fix: update local clang format version * Chore: rename organization name & optimize CI (#41) * reformat code vllm style * add threadsafe queues * fix compilation error * split files and remove queuing * performance improvement * remove error dependency * add try lock return check * fix header dependency * fix hard coded number * update CI using cuda docker image * repo consistency * pr template fix * format doc * delete gpu option, add --no-install-recommends * add cuda matrix and remove cuda full package install * remove publish container * change team name to efficient moe --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * CI: fix not a git repository in CI (#43) * CI: fix missing sudo in apt install (#44) * CI: fix missing sudo (#45) * CI: revert os matrix in CI (#46) * CI: add missing apt update after installing deb file (#47) * Doc: Update README example to DeepSeek and Suppress Warning (#49) * reformat code vllm style * add threadsafe queues * fix compilation error * split files and remove queuing * performance improvement * remove error dependency * add try lock return check * fix header dependency * fix hard coded number * update CI using cuda docker image * repo consistency * pr template fix * format doc * delete gpu option, add --no-install-recommends * add cuda matrix and remove cuda full package install * remove publish container * change team name to efficient moe * update readme example to deepseek and supress warning * format * revert CI changes to main version --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * CI: do not build test if document update (#52) * reformat code vllm style * add threadsafe queues * fix compilation error * split files and remove queuing * performance improvement * remove error dependency * add try lock return check * fix header dependency * fix hard coded number * update CI using cuda docker image * repo consistency * pr template fix * format doc * delete gpu option, add --no-install-recommends * add cuda matrix and remove cuda full package install * remove publish container * change team name to efficient moe * update readme example to deepseek and supress warning * format * revert CI changes to main version * update readme conda env and ignore doc update in build and release * fix wildcard --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * feat: Introduce Local Server for OpenAI-Compatible APIs (#4) * update table format * improve table clarity * init code commit * add openai api support * add test scripts, update readme, update api * format and change to deepseek in example * fix format * remove unused files * fix api server token id device --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> Co-authored-by: Yao <fuyao3860@gmail.com> * feat: set parameter to device before serving (#56) * update table format * improve table clarity * init code commit * add openai api support * add test scripts, update readme, update api * format and change to deepseek in example * fix format * remove unused files * fix api server token id device * fix gen broken * update readme links * cancel concurrent job * set dense node to device * sparse node set cpu * remove OS def * use update to date clang-format * fix setuptools version * fix setuptools version for python 3.8 * keep single cuda version in publish --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> Co-authored-by: Yao <fuyao3860@gmail.com> * Chore(deps): Bump pyarrow from 12.0.0 to 14.0.1 (#69) Bumps [pyarrow](https://github.com/apache/arrow) from 12.0.0 to 14.0.1. - [Release notes](https://github.com/apache/arrow/releases) - [Commits](apache/arrow@go/v12.0.0...go/v14.0.1) --- updated-dependencies: - dependency-name: pyarrow dependency-version: 14.0.1 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: xly <leyang.xue@ed.ac.uk> Co-authored-by: Zhan Lu <51200935+lausannel@users.noreply.github.com> Co-authored-by: Yao Fu <yao.fu.aisys@gmail.com> Co-authored-by: Yao <fuyao3860@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Fix: Undefined Symbol Compilation Error (#37) * reformat code vllm style * add threadsafe queues * fix compilation error --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * Refactor code for better performance (#38) * reformat code vllm style * add threadsafe queues * fix compilation error * split files and remove queuing * performance improvement * remove error dependency * add try lock return check * fix header dependency * fix hard coded number --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * CI: add pre commit format ci (#40) * ci: add pre commit format ci * fix: add requirements for linting * fix: format code before merge * fix: update local clang format version * Chore: rename organization name & optimize CI (#41) * reformat code vllm style * add threadsafe queues * fix compilation error * split files and remove queuing * performance improvement * remove error dependency * add try lock return check * fix header dependency * fix hard coded number * update CI using cuda docker image * repo consistency * pr template fix * format doc * delete gpu option, add --no-install-recommends * add cuda matrix and remove cuda full package install * remove publish container * change team name to efficient moe --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * CI: fix not a git repository in CI (#43) * CI: fix missing sudo in apt install (#44) * CI: fix missing sudo (#45) * CI: revert os matrix in CI (#46) * CI: add missing apt update after installing deb file (#47) * Doc: Update README example to DeepSeek and Suppress Warning (#49) * reformat code vllm style * add threadsafe queues * fix compilation error * split files and remove queuing * performance improvement * remove error dependency * add try lock return check * fix header dependency * fix hard coded number * update CI using cuda docker image * repo consistency * pr template fix * format doc * delete gpu option, add --no-install-recommends * add cuda matrix and remove cuda full package install * remove publish container * change team name to efficient moe * update readme example to deepseek and supress warning * format * revert CI changes to main version --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * CI: do not build test if document update (#52) * reformat code vllm style * add threadsafe queues * fix compilation error * split files and remove queuing * performance improvement * remove error dependency * add try lock return check * fix header dependency * fix hard coded number * update CI using cuda docker image * repo consistency * pr template fix * format doc * delete gpu option, add --no-install-recommends * add cuda matrix and remove cuda full package install * remove publish container * change team name to efficient moe * update readme example to deepseek and supress warning * format * revert CI changes to main version * update readme conda env and ignore doc update in build and release * fix wildcard --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * feat: Introduce Local Server for OpenAI-Compatible APIs (#4) * update table format * improve table clarity * init code commit * add openai api support * add test scripts, update readme, update api * format and change to deepseek in example * fix format * remove unused files * fix api server token id device --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> Co-authored-by: Yao <fuyao3860@gmail.com> * feat: set parameter to device before serving (#56) * update table format * improve table clarity * init code commit * add openai api support * add test scripts, update readme, update api * format and change to deepseek in example * fix format * remove unused files * fix api server token id device * fix gen broken * update readme links * cancel concurrent job * set dense node to device * sparse node set cpu * remove OS def * use update to date clang-format * fix setuptools version * fix setuptools version for python 3.8 * keep single cuda version in publish --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> Co-authored-by: Yao <fuyao3860@gmail.com> * Initial plan * Add mypy lint hook Co-authored-by: drunkcoding <14305648+drunkcoding@users.noreply.github.com> * Configure mypy settings Co-authored-by: drunkcoding <14305648+drunkcoding@users.noreply.github.com> * Adjust mypy scope Co-authored-by: drunkcoding <14305648+drunkcoding@users.noreply.github.com> * Scope mypy checks Co-authored-by: drunkcoding <14305648+drunkcoding@users.noreply.github.com> --------- Co-authored-by: Leyang Xue <s2062808@ed.ac.uk> Co-authored-by: xly <leyang.xue@ed.ac.uk> Co-authored-by: Zhan Lu <51200935+lausannel@users.noreply.github.com> Co-authored-by: Yao Fu <yao.fu.aisys@gmail.com> Co-authored-by: Yao <fuyao3860@gmail.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: drunkcoding <14305648+drunkcoding@users.noreply.github.com>

* add openai api support * add test scripts, update readme, update api * Fix: Undefined Symbol Compilation Error (#37) * reformat code vllm style * add threadsafe queues * fix compilation error --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * Refactor code for better performance (#38) * reformat code vllm style * add threadsafe queues * fix compilation error * split files and remove queuing * performance improvement * remove error dependency * add try lock return check * fix header dependency * fix hard coded number --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * CI: add pre commit format ci (#40) * ci: add pre commit format ci * fix: add requirements for linting * fix: format code before merge * fix: update local clang format version * Chore: rename organization name & optimize CI (#41) * reformat code vllm style * add threadsafe queues * fix compilation error * split files and remove queuing * performance improvement * remove error dependency * add try lock return check * fix header dependency * fix hard coded number * update CI using cuda docker image * repo consistency * pr template fix * format doc * delete gpu option, add --no-install-recommends * add cuda matrix and remove cuda full package install * remove publish container * change team name to efficient moe --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * CI: fix not a git repository in CI (#43) * CI: fix missing sudo in apt install (#44) * CI: fix missing sudo (#45) * CI: revert os matrix in CI (#46) * CI: add missing apt update after installing deb file (#47) * Doc: Update README example to DeepSeek and Suppress Warning (#49) * reformat code vllm style * add threadsafe queues * fix compilation error * split files and remove queuing * performance improvement * remove error dependency * add try lock return check * fix header dependency * fix hard coded number * update CI using cuda docker image * repo consistency * pr template fix * format doc * delete gpu option, add --no-install-recommends * add cuda matrix and remove cuda full package install * remove publish container * change team name to efficient moe * update readme example to deepseek and supress warning * format * revert CI changes to main version --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * CI: do not build test if document update (#52) * reformat code vllm style * add threadsafe queues * fix compilation error * split files and remove queuing * performance improvement * remove error dependency * add try lock return check * fix header dependency * fix hard coded number * update CI using cuda docker image * repo consistency * pr template fix * format doc * delete gpu option, add --no-install-recommends * add cuda matrix and remove cuda full package install * remove publish container * change team name to efficient moe * update readme example to deepseek and supress warning * format * revert CI changes to main version * update readme conda env and ignore doc update in build and release * fix wildcard --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * format and change to deepseek in example * fix format * remove unused files * fix api server token id device * feat: Introduce Local Server for OpenAI-Compatible APIs (#4) * update table format * improve table clarity * init code commit * add openai api support * add test scripts, update readme, update api * format and change to deepseek in example * fix format * remove unused files * fix api server token id device --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> Co-authored-by: Yao <fuyao3860@gmail.com> * fix gen broken * update readme links * cancel concurrent job * set dense node to device * sparse node set cpu * remove OS def * use update to date clang-format * fix setuptools version * fix setuptools version for python 3.8 * keep single cuda version in publish * feat: set parameter to device before serving (#56) * update table format * improve table clarity * init code commit * add openai api support * add test scripts, update readme, update api * format and change to deepseek in example * fix format * remove unused files * fix api server token id device * fix gen broken * update readme links * cancel concurrent job * set dense node to device * sparse node set cpu * remove OS def * use update to date clang-format * fix setuptools version * fix setuptools version for python 3.8 * keep single cuda version in publish --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> Co-authored-by: Yao <fuyao3860@gmail.com> * add max length in gen openai * fix cache race condition * all param init at host * add qwen3 * ubuntu lts and build * pre-commit ubuntu version * router weights update overlap * rename deepseek_v2 and reduce torch kernel launch * fix import * fix build and fix bug * fix citation linebreak * fix typo * fix dtype size * remove comments * fix example * pr update init * remove comment and unify deepseek preroute * feat: Merge kernels from vLLM and FlashInfer (#63) * new allocator * add kernel compilation * stable topk --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> --------- Co-authored-by: Yao <fuyao3860@gmail.com> Co-authored-by: xly <leyang.xue@ed.ac.uk> Co-authored-by: Zhan Lu <51200935+lausannel@users.noreply.github.com> Co-authored-by: Yao Fu <yao.fu.aisys@gmail.com>

* "Claude PR Assistant workflow" * "Claude Code Review workflow"

* add openai api support * add test scripts, update readme, update api * format and change to deepseek in example * fix format * remove unused files * fix api server token id device * fix gen broken * update readme links * cancel concurrent job * set dense node to device * sparse node set cpu * remove OS def * use update to date clang-format * fix setuptools version * fix setuptools version for python 3.8 * keep single cuda version in publish * add max length in gen openai * fix cache race condition * all param init at host * add docker and sllm style read * wrap docker and test coverage * test * Clean up symlinks: remove unused op_builder, core/core, and move test_io to extensions * Replace core/kernel directory with symlink to extensions/kernel * seperations * remove ops dependency * Add CUTLASS fused MoE FFN kernel and supporting infrastructure - Add extensions/kernel/fused_moe_mlp.cu/h: BF16 CUTLASS 3-GEMM fused path (gate → up w/ SiLU-mul epilogue → down) with small-M and large-K tile dispatch - Add tests/cuda/test_fused_mlp_cutlass.cu: BF16 CUTLASS vs Torch-native benchmark - Integrate fused kernel into core/parallel/expert_module.cpp via ForwardHelper() - Update core/model/fused_mlp.cu/h and extensions/kernel/epilogue_utils.h - Improve core/utils: cache.h, lockfree_queue.h, simple_object_pool.h - Update tests/cuda/CMakeLists.txt with KERNEL_SRC pattern for CUTLASS tests - Update CLAUDE.md docs and setup.py build config Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Add prefill-decode collocation benchmark with throughput analysis Benchmarks five attention colocation strategies for serving decode and prefill requests on the same GPU time-slice: 0 serial — sequential on default stream 1 varlen-fused — single flash_attn_varlen_func (continuous batching) 2 dual-stream — two CUDA streams, no SM partition 3 green-ctx-sm — SM-partitioned green contexts (CUDA ≥ 12.4) 4 green-ctx-sm-wq — SM + work-queue balanced scope (CUDA 13.1+) Throughput analysis includes: - Separate decode-only / prefill-only baselines with TFLOPS and tok/s - Ideal-overlap bound (perfect concurrency = max(dec, pre)) - Per-mode: TFLOPS, decode tok/s, prefill tok/s, overlap efficiency - Generation-projection table: decode overhead and Δ vs serial per mode CUDA 13.1 green context API notes (driver 590.x): - CUdevResourceDesc is a pointer typedef (c_void_p), not a struct - cuGreenCtxStreamCreate requires CU_STREAM_NON_BLOCKING flag - CU_DEV_RESOURCE_TYPE_WORKQUEUE_CONFIG = 1000; configure sharingScope to CU_WORKQUEUE_SCOPE_GREEN_CTX_BALANCED for WQ isolation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Bump pydantic and transformers to resolve Dependabot alerts - pydantic 1.10.12 → 1.10.13: fixes ReDoS in email validation (GHSA-mr82-8j83-vxmv) - transformers 4.51.3 → 4.53.0: fixes 14 alerts including 3 HIGH RCE (GHSA-wrfc-pvp9-mr9g, GHSA-hxxf-235m-72v3, GHSA-qxrp-vhvm-j765) and 11 MEDIUM/LOW ReDoS vulnerabilities - Remove torch==2.3.1 pin (managed by conda env / base image) - Add flash-attn to requirements Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix build * Make readme_example.py testable via --help Add argparse to readme_example.py so that model-loading code runs only after parse_args(), allowing `--help` to exit 0 without a GPU or model. Replace the AST-only test_readme_example_syntax with test_readme_example_help, which mirrors the existing test_interface_example_help pattern and is verified passing in Docker. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Fix format CI and build-test CI - requirements.txt: sort flash-attn alphabetically (between fastapi and hjson) so requirements-txt-fixer pre-commit hook passes - build-test.yml: replace Ubuntu 20.04 CUDA container (Python 3.8, broken PyTorch wheel) with actions/setup-python Python 3.10 + CPU-only torch; switch from full wheel build to sdist-only (--no-isolation) to avoid CUTLASS dependency and 20+ min compile time Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * green ctx bench * Fix CI: guard CUDA extensions behind cuda_available, add statics to codespell ignore - setup.py: only build CUDAExtension when torch.version.cuda is set; the build-test CI installs CPU-only torch and lacks CUDA_HOME, causing CUDAExtension to abort with OSError - .pre-commit-config.yaml: add 'statics' to codespell ignore-words-list; the term is valid C++ (module-level static variables) but was flagged as a misspelling of 'statistics' Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * tests update --------- Co-authored-by: Yao <fuyao3860@gmail.com> Co-authored-by: xly <leyang.xue@ed.ac.uk> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

This PR merges a large dev branch focused on faster MoE inference (CUTLASS fused kernels + routing), adds Qwen3 MoE support, and modernizes the build/test/CI pipeline around new C++/CUDA core refactors.

Changes:

Adds CUTLASS BF16 fused MoE MLP path + multiple CUDA benchmark/test binaries under tests/cuda/.
Introduces Qwen3 MoE integration and refactors expert dispatch to a new dispatch_local(...)/wait_dispatch_local() flow.
Overhauls build tooling (new top-level CMake, setup.py CUDA-extension gating, Docker verification, CI workflow updates) and adds/extends C++ unit tests.

Reviewed changes

Copilot reviewed 131 out of 161 changed files in this pull request and generated 20 comments.

Show a summary per file

File	Description
tests/cuda/test_single_gemm_tiled.cu	New CUTLASS MMA tiled GEMM test binary
tests/cuda/test_fused_mlp_cutlass.cu	Benchmark fused CUTLASS MoE MLP vs Torch reference
tests/cuda/test_fused_mlp.cu	CUTLASS back-to-back GEMM test harness
tests/cuda/test_expert_fusion_v2.cu	Experimental fused dual-GEMM + SiLU*mul kernel
tests/cuda/test_autotune_blocksize.cu	Tile/block-size autotuning demo
tests/cuda/CMakeLists.txt	Adds CUDA test/benchmark build targets
tests/cpp/unittest/utils/test_simple_object_pool.cpp	New unit tests for SimpleObjectPool
tests/cpp/unittest/utils/test_lfu_cache.cpp	New unit tests documenting LFUCache behavior
tests/cpp/unittest/utils/CMakeLists.txt	Adds utils unit test executables
tests/cpp/unittest/queues/test_lockfree_queue.cpp	Adds edge/concurrency tests for queue
tests/cpp/unittest/queues/CMakeLists.txt	Adds queue unit test targets
setup.py	Replaces op_builder flow with CUDAExtensions for `_store`/`_engine`
pyproject.toml	Adds ruff-format config and mypy config scoping
op_builder/prefetch.py	Removes old DeepSpeed-derived op builder
op_builder/all_ops.py	Removes op builder registry
op_builder/init.py	Removes op_builder package init logic
moe_infinity/utils/hf_config.py	Adds Qwen3 architecture parsing support
moe_infinity/utils/config.py	Updates ArcherConfig typing + defaults
moe_infinity/runtime/model_offload.py	Switches to `_store` extension + adds Qwen3 wiring
moe_infinity/runtime/compile.py	Adds TorchScript expert compilation helper
moe_infinity/ops/op_builder	Removes source-tree alias file
moe_infinity/ops/core	Removes source-tree alias file
moe_infinity/models/switch_transformers.py	Switches to async dispatch/wait execution path
moe_infinity/models/qwen.py	Adds Qwen3MoEBlock implementation
moe_infinity/models/nllb_moe.py	Switches to async dispatch/wait execution path
moe_infinity/models/modeling_deepseek_v2/modeling_deepseek.py	Tweaks generation cache handling
moe_infinity/models/mixtral.py	Switches to async dispatch/wait execution path
moe_infinity/models/grok.py	Switches to async dispatch/wait execution path
moe_infinity/models/deepseek.py	Reworks DeepSeek MoE gate routing and dispatch
moe_infinity/models/init.py	Exposes Qwen3MoEBlock
moe_infinity/kernel/router.py	Adds Triton routing kernels (softmax/topk)
moe_infinity/kernel/init.py	Exports router launchers
moe_infinity/entrypoints/openai/protocol.py	Maps OpenAI `max_tokens` to HF `max_new_tokens`
moe_infinity/entrypoints/big_modeling.py	Updates DeepSeek rotary hook wiring
moe_infinity/distributed/expert_executor.py	Refactors dispatch API to pass weights + wait method
moe_infinity/common/constants.py	Adds Qwen3 mapping + DeepSeek v2 import path update
extensions/test_io/CMakeLists.txt	Adds test I/O Python module build via CMake
extensions/kernel/utils.h	Adds kernel utilities + arch detection helpers
extensions/kernel/ops.h	Declares activation/topk_softmax kernel APIs
extensions/kernel/grouped_threadblock_swizzle.h	Adds grouped swizzle support for grouped GEMMs
extensions/kernel/fused_moe_mlp.h	Declares CUTLASS fused MoE MLP API
extensions/kernel/epilogue_utils.h	Adds CUTLASS epilogue helpers (SiLU + mul)
extensions/kernel/dispatch_utils.h	Adds dispatch macros for floating types
extensions/kernel/common_device.h	Adds device activation utility implementations
extensions/kernel/b2b_gemm_grouped_problem_visitor.h	Adds grouped B2B GEMM visitor implementation
extensions/kernel/activation_kernels.cu	Adds CUDA activation kernels for gating/activations
extensions/CMakeLists.txt	Adds CMake-based extension build targets
examples/readme_example.py	Makes example configurable and updates default checkpoint
examples/interface_example.py	Updates dataset/prompting and supports Qwen3
docker/verify_build.py	Adds build outputs verification script
docker/Dockerfile	Adds reproducible build/test container with CUTLASS
core/utils/threadsafe_queue.h	Adds NotifyAll to unblock waiters
core/utils/simple_object_pool.h	Adds vector include for getMany support
core/utils/logger.h	Adds vector/enum logging helpers + conditional macros
core/utils/lockfree_queue.h	Replaces lock-free impl with mutex-backed queue
core/utils/cuda_utils.h	Adds CUTLASS error helper + GPU timer
core/utils/cuda_utils.cpp	Handles null pointer in IsDevicePointer
core/utils/cache.h	Fixes LFUCache invalidation/reset + adds touch tracking
core/python/py_archer_prefetch.cpp	Exposes init_moe_layer/topk_softmax + new dispatcher APIs
core/python/expert_gemm.cu	Adds cuBLAS-based expert fused MLP prototype
core/prefetch/archer_prefetch_handle.cpp	Adds MOE_IO_THREADS support to tensor handle
core/parallel/expert_dispatcher.h	Refactors dispatcher inputs/outputs and adds notify/start
core/model/moe.cpp	Adds InitMoELayer/TopKSoftmax bridge
core/model/model_topology.h	Adds SetModuleMemoryFromDisk_Views declaration
core/model/model_topology.cpp	Adds pipelined disk->host->GPU path + view-based setup
core/model/fused_mlp.h	Adds legacy wrapper API header
core/model/fused_mlp.cu	Implements legacy wrapper calling fused CUTLASS path
core/memory/torch_caching_allocator.h	Adds Torch allocator replacement scaffolding
core/memory/torch_caching_allocator.cpp	Instantiates allocator replacer
core/memory/shared_memory.h	Adds shared memory helpers + deleter
core/memory/shared_memory.cpp	Implements shared memory helpers
core/memory/pinned_memory_pool.h	Adds pinned pool abstraction
core/memory/pinned_memory_pool.cpp	Implements pinned pool with register/unregister
core/memory/device_caching_allocator.cpp	Uses CUDA_CHECK for cudaMalloc retry
core/memory/caching_allocator_bk.h	Adds legacy/backup caching allocator header
core/memory/caching_allocator.h	Adds unified caching allocator + Torch hooks
core/core	Removes core path alias file
core/common/types.h	Adds enum/string helpers + utilities
core/common/sync.h	Adds Futex utility class
core/common/pytorch.h	Adds dtype conversion helpers + blob helpers
core/common/generator.h	Adds ID/UUID/time helpers
core/common/context.h	Adds global Context config
core/common/constant.h	Adds size constants
core/aio/archer_tensor_handle.h	Adds partition sizing + IO thread configuration
core/aio/archer_tensor_handle.cpp	Implements partitioned storage rollover
core/aio/archer_prio_aio_handle.h	Adds pinned pool + exit notifications + thread count
core/aio/archer_prio_aio_handle.cpp	Adds pinned-buffer chunked writes + scheduler CV
core/aio/archer_aio_threadpool.h	Adds round-robin counter
core/aio/archer_aio_threadpool.cpp	Replaces rand() scheduling with round-robin
core/aio/archer_aio_thread.h	Adds atomics + CVs for thread coordination
core/aio/archer_aio_thread.cpp	Reworks thread loop to use CVs (no busy-wait)
core/CMakeLists.txt	Adds archer_core static lib build definition
RELEASE.md	Updates build instructions (drops BUILD_OPS)
README.md	Updates benchmark table + citation formatting
MANIFEST.in	Updates packaging includes (kernel headers/cuda)
CMakeLists.txt	Adds top-level CMake build entrypoint
CITATIONS.md	Adds author to citations
.pre-commit-config.yaml	Adds statics allowlist + mypy hook
.github/workflows/publish.yml	Updates runner OS + wheel build command
.github/workflows/publish-test.yml	Updates runner OS + wheel build command
.github/workflows/pre-commit-format.yml	Updates runner OS
.github/workflows/claude.yml	Adds Claude workflow automation
.github/workflows/claude-code-review.yml	Adds Claude Code Review workflow
.github/workflows/build-test.yml	Switches to CPU-only sdist build in CI
.dockerignore	Adds docker ignore patterns

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+@triton.jit
+def fused_softmax_topk_kernel(
+    hidden_ptr,  # [B, H]
+    weight_ptr,  # [H, E]
+    routing_mask_ptr,  # [B, E] (bool)
+    routing_weight_ptr,  # [B, E] (float16)
+    B: tl.constexpr,
+    H: tl.constexpr,
+    E: tl.constexpr,
+    TOPK: tl.constexpr,
+    BLOCK_E: tl.constexpr,
+    normalize_topk: tl.constexpr,  # New!
+):


+    logits += tl.load(bias_ptr + off_e, mask=off_e < E, other=0.0)
+


+
+    logits = tl.full([BLOCK_E], -float("inf"), dtype=tl.float32)
+
+    for h in range(H):
+        h_val = tl.load(hidden_ptr + batch_id * H + h)
+        w_ptr = weight_ptr + off_e * H + h
+        valid = off_e < E
+        w_val = tl.load(w_ptr, mask=valid, other=0.0)
+        logits = tl.where(valid, logits + h_val * w_val, logits)
+


+    # Softmax
+    max_logit = tl.max(logits, axis=0)
+    logits = logits - max_logit
+    exp_logits = tl.exp(logits)
+    sum_exp = tl.sum(exp_logits, axis=0)
+    probs = exp_logits / sum_exp


+
+    for i in range(BLOCK_E):
+        if i < E:
+            p = tl.load(probs + batch_id * E + i)


+PinnedMemoryPool::~PinnedMemoryPool() {
+  for (auto* ptr : all_chunks_) {
+    cudaHostUnregister(ptr);
+    free(ptr);
+  }
+}


+except ImportError as exc:
+    raise ImportError(
+        "moe_infinity._store extension is required. Install with CUDA enabled."
+    ) from exc
+


+    so_path = "moe_infinity/_store.cpython-311-x86_64-linux-gnu.so"
+    if os.path.exists(so_path):


+endmacro()
+
+# Add include directories and link for CUTLASS
+set(CUTLASS_DIR $ENV{HOME}/cutlass)


+set(CUDA_ARCHITECTURES 86)
+
+# set nvcc flags
+set(CUDA_NVCC_FLAGS "${CUDA_NVCC_FLAGS} -g -G -lineinfo -rdynamic -O3 -gencode arch=compute_86,code=sm_86 -Xcompiler -fopenmp")


Copilot

Pull request overview

This PR integrates fused CUTLASS MoE MLP kernels and router support, adds Qwen3 MoE model support, and refactors the build/test tooling (CMake, setup.py, CI, Docker) to support the updated C++/CUDA core and new benchmarks/tests.

Changes:

Adds/updates CUDA kernels and benchmarks (fused MoE MLP + router/topk-softmax/activations) and related build plumbing.
Adds Qwen3 MoE integration and updates several model wrappers to use the new expert dispatch interface.
Refactors core I/O/AIO, memory pools, dispatch infrastructure, and introduces new C++ unit tests + CI/Docker verification.

Reviewed changes

Copilot reviewed 131 out of 161 changed files in this pull request and generated 14 comments.

Show a summary per file

File	Description
tests/cuda/test_single_gemm_tiled.cu	Adds a CUTLASS tiled GEMM test program.
tests/cuda/test_fused_mlp_cutlass.cu	Adds Torch-vs-CUTLASS fused MLP benchmark and correctness check.
tests/cuda/test_fused_mlp.cu	Adds CUTLASS back-to-back GEMM test driver.
tests/cuda/test_expert_fusion_v2.cu	Adds an experimental fully fused expert GEMM kernel test.
tests/cuda/test_autotune_blocksize.cu	Adds a tile/blocksize autotuning example binary.
tests/cuda/CMakeLists.txt	Adds CUDA tests build configuration and Torch/CUTLASS wiring.
tests/cpp/unittest/utils/test_simple_object_pool.cpp	Adds gtest coverage for SimpleObjectPool.
tests/cpp/unittest/utils/test_lfu_cache.cpp	Adds gtest coverage for LFUCache behavior and reset/eviction.
tests/cpp/unittest/utils/CMakeLists.txt	Adds CMake targets for utils unit tests.
tests/cpp/unittest/queues/test_lockfree_queue.cpp	Adds edge-case and concurrency tests for queue implementation.
tests/cpp/unittest/queues/CMakeLists.txt	Adds CMake targets for queue unit tests.
setup.py	Refactors extension build to new _store/_engine CUDAExtensions + CUTLASS includes.
pyproject.toml	Adds ruff format settings and mypy config.
op_builder/prefetch.py	Removes legacy DeepSpeed-derived op builder file.
op_builder/all_ops.py	Removes legacy op discovery/build system.
op_builder/init.py	Removes legacy op-builder package init/closures.
moe_infinity/utils/hf_config.py	Adds Qwen3 parsing support and typing improvements.
moe_infinity/utils/config.py	Adjusts ArcherConfig types/defaults (trace_path optional, threads default).
moe_infinity/runtime/model_offload.py	Switches to loading new _store extension; adds Qwen3 patching and new dispatch path.
moe_infinity/runtime/compile.py	Adds TorchScript compilation helper for expert modules.
moe_infinity/ops/op_builder	Removes symlink-like path entry.
moe_infinity/ops/core	Removes symlink-like path entry.
moe_infinity/models/switch_transformers.py	Migrates dispatch to new executor API (dispatch_local + wait).
moe_infinity/models/qwen.py	Adds Qwen3MoEBlock implementation using executor + fused routing.
moe_infinity/models/nllb_moe.py	Migrates dispatch to new executor API and output handling.
moe_infinity/models/modeling_deepseek_v2/modeling_deepseek.py	Alters generation cache max length handling.
moe_infinity/models/mixtral.py	Migrates dispatch to new executor API and reshaping/casting.
moe_infinity/models/grok.py	Migrates dispatch to new executor API and casting.
moe_infinity/models/deepseek.py	Reworks DeepSeek MoE gating/routing and executor dispatch integration.
moe_infinity/models/init.py	Exports Qwen3MoEBlock.
moe_infinity/kernel/router.py	Adds Triton router kernels (softmax + fused softmax/topk).
moe_infinity/kernel/init.py	Exposes router kernel launch helpers.
moe_infinity/entrypoints/openai/protocol.py	Maps OpenAI max_tokens to HF max_new_tokens; raises default max_tokens.
moe_infinity/entrypoints/big_modeling.py	Updates deepseek module patching to deepseek_v2/deepseek_v3 paths.
moe_infinity/distributed/expert_executor.py	Changes executor API to pass weights; adds wait_dispatch_local.
moe_infinity/common/constants.py	Adds Qwen3 model mapping and updates deepseek import path.
extensions/test_io/CMakeLists.txt	Adds CMake to build/install a test IO Python module.
extensions/kernel/utils.h	Adds CUDA/CUTLASS arch detection helpers and tile calculator.
extensions/kernel/ops.h	Adds kernel function declarations (activations, topk_softmax).
extensions/kernel/grouped_threadblock_swizzle.h	Adds CUTLASS grouped swizzle/visitor for grouped GEMMs.
extensions/kernel/fused_moe_mlp.h	Adds fused CUTLASS MoE MLP interface header.
extensions/kernel/epilogue_utils.h	Adds custom CUTLASS epilogue utilities for fused SiLU+mul.
extensions/kernel/dispatch_utils.h	Adds ATen dispatch helper macros.
extensions/kernel/common_device.h	Adds device activation helpers (relu/silu/gelu) + warp activation.
extensions/kernel/b2b_gemm_grouped_problem_visitor.h	Adds grouped problem visitor for B2B GEMMs.
extensions/kernel/activation_kernels.cu	Adds CUDA activation/gating kernels and PyTorch bindings.
extensions/CMakeLists.txt	Adds CMake build/install for prefetch_op and test_io modules.
examples/readme_example.py	Makes README example configurable via CLI and updates default model.
examples/interface_example.py	Reworks dataset/prompting and moves MoE init earlier; adds Qwen3 handling.
docker/verify_build.py	Adds build verification script for produced shared libs and refactor artifacts.
docker/Dockerfile	Adds full build/test Dockerfile including CUTLASS build and extension verification.
core/utils/threadsafe_queue.h	Adds NotifyAll to wake all waiters.
core/utils/simple_object_pool.h	Adds include for new APIs/tests.
core/utils/logger.h	Adds LogStream streaming for vector and enum classes + conditional warn/fatal macros.
core/utils/lockfree_queue.h	Replaces lock-free queue with mutex-backed queue implementation (API rename).
core/utils/cuda_utils.h	Adds CUTLASS error checking and a GPU timer utility.
core/utils/cuda_utils.cpp	Adds nullptr handling to IsDevicePointer.
core/utils/cache.h	Fixes LFU cache iterator invalidation and reset behavior; adds touch counter.
core/python/py_archer_prefetch.cpp	Extends Python bindings with init_moe_layer/topk_softmax and wait_hidden_states.
core/python/expert_gemm.cu	Adds cuBLAS-based expert fused MLP kernels and PyBind module.
core/prefetch/archer_prefetch_handle.cpp	Adds MOE_IO_THREADS support and passes IO thread count to tensor handle.
core/parallel/expert_dispatcher.h	Refactors dispatcher inputs/outputs, threading/queues, adds notify hooks and JIT path.
core/model/moe.cpp	Adds MoELayer init/topk-softmax entrypoints for Python.
core/model/model_topology.h	Adds SetModuleMemoryFromDisk_Views declaration.
core/model/model_topology.cpp	Adds pipelined disk→host→GPU path and view-only host tensor setup.
core/model/fused_mlp.h	Adds legacy wrapper declaration pointing at new fused_moe_ffn_into path.
core/model/fused_mlp.cu	Implements legacy wrapper delegating to fused_moe_ffn_into.
core/memory/torch_caching_allocator.h	Adds Torch allocator integration to use caching allocator on load.
core/memory/torch_caching_allocator.cpp	Instantiates allocator replacement singleton.
core/memory/shared_memory.h	Adds shared memory attach/detach helpers and RAII deleter.
core/memory/shared_memory.cpp	Implements shared memory mmap attach/detach functions.
core/memory/pinned_memory_pool.h	Adds pinned host memory pool interface.
core/memory/pinned_memory_pool.cpp	Implements pinned pool allocation and acquire/release.
core/memory/device_caching_allocator.cpp	Improves cudaMalloc error handling via CUDA_CHECK.
core/memory/caching_allocator_bk.h	Adds backup templated caching allocator (CUDA/UM/host variants).
core/memory/caching_allocator.h	Adds unified caching allocator API incl. SHM/PIN/CUDA and Torch hooks.
core/core	Removes path entry file.
core/common/types.h	Adds enum/string helpers, constexpr utilities, and pointer helpers.
core/common/sync.h	Adds a Futex wrapper utility.
core/common/pytorch.h	Adds blob→tensor helpers and dtype conversion utilities.
core/common/generator.h	Adds UUID/time/id generators.
core/common/context.h	Adds global Context container for config values.
core/common/constant.h	Adds size constants and warp size constant.
core/aio/archer_tensor_handle.h	Adds partition size constant and IO-thread configurable ctor.
core/aio/archer_tensor_handle.cpp	Adds storage partition rollover and passes IO thread config to prio handle.
core/aio/archer_prio_aio_handle.h	Refactors prio AIO to use atomic exit + thread pool + pinned pool.
core/aio/archer_prio_aio_handle.cpp	Implements multi-thread IO scheduling, pinned pool writes, clean shutdown, notifications.
core/aio/archer_aio_threadpool.h	Adds round-robin enqueue counter.
core/aio/archer_aio_threadpool.cpp	Uses round-robin instead of rand for thread selection.
core/aio/archer_aio_thread.h	Adds atomics + condition variables for proper waiting.
core/aio/archer_aio_thread.cpp	Implements cv-based worker loop and done notification (removes busy-wait).
core/CMakeLists.txt	Adds archer_core static library build with CUDA sources and linking.
RELEASE.md	Updates release steps to remove BUILD_OPS usage.
README.md	Updates benchmark table and citation formatting/authors.
MANIFEST.in	Changes packaging includes (drops op_builder, adds extensions/kernel sources).
CMakeLists.txt	Adds top-level CMake build for core + extensions, Torch/CUTLASS discovery.
CITATIONS.md	Adds an author to citations.
.pre-commit-config.yaml	Adds mypy hook and updates codespell allowlist.
.github/workflows/publish.yml	Updates runner OS and removes BUILD_OPS from build step.
.github/workflows/publish-test.yml	Updates runner OS and removes BUILD_OPS from build step.
.github/workflows/pre-commit-format.yml	Updates runner OS.
.github/workflows/claude.yml	Adds Claude Code workflow.
.github/workflows/claude-code-review.yml	Adds Claude Code Review workflow.
.github/workflows/build-test.yml	Updates CI to build sdist on ubuntu-22.04 with CPU torch.
.dockerignore	Adds docker ignore rules for build artifacts and large files.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

drunkcoding · 2026-03-16T20:42:52Z

+
+template <typename T>
+struct DetectedArch<T, std::enable_if_t<(__CUDA_ARCH__ > 0)>> {
+  using SM = DetectedArchT<__CUDA_ARCH__>;
+};


Updated in local branch: wrapped the device-only specialization with #ifdef __CUDA_ARCH__ so host compilation no longer evaluates __CUDA_ARCH__ expressions.

drunkcoding · 2026-03-16T20:42:54Z

+
+#include <cstdint>
+#include <mutex>
+#include <unordered_map>


Fixed in local branch by adding missing <string> and <stdexcept> includes directly in core/common/context.h.

drunkcoding · 2026-03-16T20:42:55Z

+      return !io_queue_high_.empty() || !io_queue_low_.empty() || time_to_exit_;
+    });
+    if (time_to_exit_) {


Fixed in local branch: switched time_to_exit_ checks to time_to_exit_.load() in both the wait predicate and immediate exit branch.

drunkcoding · 2026-03-16T20:42:56Z

+def launch_fused_softmax_topk(hidden_states, weight, bias, top_k):
+    B, H = hidden_states.shape
+    E = weight.shape[1]
+    dtype = hidden_states.dtype
+
+    routing_mask = torch.zeros(
+        (B, E), dtype=torch.bool, device=hidden_states.device
+    )
+    routing_weight = torch.zeros(
+        (B, E), dtype=dtype, device=hidden_states.device
+    )
+
+    BLOCK_E = 32  # Must divide E
+
+    fused_softmax_topk_kernel[(B,)](
+        hidden_states,
+        weight,
+        bias,
+        routing_mask,
+        routing_weight,
+        B=B,
+        H=H,
+        E=E,
+        TOPK=top_k,
+        BLOCK_E=BLOCK_E,
+        normalize_topk=True,
+    )


Adjusted in local branch: BLOCK_E is now triton.next_power_of_2(E) and invalid expert lanes are masked to -inf, so routing is no longer capped at a fixed 32-expert block.

drunkcoding · 2026-03-16T20:42:57Z

+        (B, E), dtype=dtype, device=hidden_states.device
+    )
+
+    BLOCK_E = 128


Addressed in the same local router update: removed the hard-coded BLOCK_E=128 path and switched to triton.next_power_of_2(E) to avoid the constant cap.

drunkcoding · 2026-03-16T20:43:03Z

+cmake_minimum_required(VERSION 3.10)
+project(LockFreeQueueTests)
+
+add_compile_definitions(_GLIBCXX_USE_CXX11_ABI=0) # needed for torch backward compatibility


Updated in local branch: removed hard-coded _GLIBCXX_USE_CXX11_ABI=0 and switched to Torch-provided compile flags via add_compile_options(${TORCH_CXX_FLAGS}).

drunkcoding · 2026-03-16T20:44:19Z

+// Known bugs found during review:
+//
+//  BUG-1 (use-after-erase in get()):
+//    `get()` stores an iterator `node = keyNodeMap[key]`, calls
+//    `freqListMap[freq].erase(node)` which invalidates `node`, then reads
+//    `node->freq`, `node->value` and pushes `*node` into another list.
+//    Dereferencing an erased list iterator is undefined behaviour.
+//    The tests below document the *intended* correct behaviour; they may
+//    crash or produce wrong results with AddressSanitizer enabled.
+//
+//  BUG-2 (concurrent modification in reset()):
+//    `reset()` iterates over `freqListMap` while simultaneously calling
+//    `freqListMap[1].push_back(node)` which can insert a new key=1 entry and
+//    invalidate the range-for iterator.


Not addressed in the current local diff for this round. I will keep this thread open for a follow-up cleanup of stale test commentary.

Addressed locally: the stale "known bugs" wording in tests/cpp/unittest/utils/test_lfu_cache.cpp has been updated to regression-coverage framing, so the comment no longer implies the bug is still expected.

drunkcoding · 2026-03-16T20:44:20Z

+      int ret = syscall(SYS_futex, &value_, FUTEX_WAIT, expected, nullptr,
+                        nullptr, 0);


Not addressed in the current local diff for this round. I will track this futex typing/portability concern in a separate follow-up patch.

Addressed locally in core/common/sync.h: Futex is now constrained to futex-compatible 32-bit integral storage via static_asserts, and futex syscalls use explicit futex-address/expected-value casts (reinterpret_cast<int*>(&value_), static_cast(...)).

drunkcoding · 2026-03-16T20:44:21Z

+void* OpenSharedMemory(const char* name, size_t size);
+void CloseSharedMemory(void* ptr, size_t size);
+std::tuple<void*, int> AttachSharedMemory(const char* name, size_t size);
+void DetachSharedMemory(void* ptr, int fd, size_t size);


Not addressed in the current local diff for this round. I will include the explicit <tuple> include in a follow-up header hygiene pass.

Addressed locally: added explicit #include in core/memory/shared_memory.h for the std::tuple<void*, int> AttachSharedMemory declaration.

drunkcoding · 2026-03-16T20:44:22Z

+void* OpenSharedMemory(const char* name, size_t size) {
+  int shm_fd = shm_open(name, O_RDWR, 0666);
+  LOG_FATAL_IF(shm_fd == -1,
+               "shm_open failed. name: {}, size: {}; errno: {}, message: {}",
+               name, size, errno, strerror(errno));


Not addressed in the current local diff for this round. I will handle these missing standard includes in a follow-up source hygiene patch.

Addressed locally: added explicit #include <errno.h>, #include <fcntl.h>, and #include <string.h> in core/memory/shared_memory.cpp to remove transitive-include reliance for errno/O_RDWR/strerror usage.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Pull request overview

Copilot reviewed 131 out of 161 changed files in this pull request and generated 13 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-19T19:09:49Z

+    # Top-k selection (insertion sort)
+    top_vals = tl.full([TOPK], -float("inf"), dtype=tl.float32)
+    top_idxs = tl.full([TOPK], -1, dtype=tl.int32)
+
+    for i in range(BLOCK_E):
+        p = probs[i]
+        idx = i
+
+        # insert into sorted list
+        for j in range(TOPK):
+            if p > top_vals[j]:
+                for k in range(TOPK - 1, j, -1):
+                    top_vals[k] = top_vals[k - 1]
+                    top_idxs[k] = top_idxs[k - 1]
+                top_vals[j] = p
+                top_idxs[j] = idx
+                break


This uses Python if / break control flow inside a @triton.jit kernel based on runtime values (p > top_vals[j]), which Triton does not support. Rewrite top-k selection using vectorized tl.where-based logic (e.g., repeated tl.max with masking for TOPK steps, or sorting via supported primitives), so the kernel can compile and execute correctly.

Copilot · 2026-03-19T19:09:49Z

+    for h in range(H):
+        h_val = tl.load(hidden_ptr + batch_id * H + h)
+        w_ptr = weight_ptr + off_e * H + h
+        valid = off_e < E
+        w_val = tl.load(w_ptr, mask=valid, other=0.0)
+        logits = tl.where(valid, logits + h_val * w_val, logits)
+


The kernel computes logits via a fully unrolled scalar loop over H (and later over BLOCK_E), which will explode compile time and generate very large kernels for typical MoE hidden sizes (e.g., 2k–8k). Consider blocking over H (e.g., tl.arange tiles + partial reductions) or using a matmul-style approach (or call into an existing optimized kernel) to keep compile time and runtime reasonable.

Copilot · 2026-03-19T19:09:49Z

+import torch.nn.functional as F
+from transformers.models.qwen3_moe.modeling_qwen3_moe import Qwen3MoeMLP
+
+import moe_infinity._store as prefetch_lib


Importing moe_infinity._store at module import time will raise ImportError (and break importing moe_infinity.models.qwen) in CPU-only installs or builds where the CUDA extension is not present. Since the block already uses a runtime-injected self.lib, defer importing _store (or remove this import entirely) and load it lazily through the same _load_prefetch_lib() mechanism used elsewhere.

Copilot · 2026-03-19T19:09:50Z

+        self.expert_executor.dispatch_local(
+            self.layer_id, hidden_states, router_mask, combining_weights
        )
-        for output, _, idx, _ in results:
-            token_indices = router_mask[..., idx].bool()
-            weights = combining_weights[..., idx]
-            # print(router_mask.shape, combining_weights.shape, hidden_states.shape, flush=True)
-            # print(output.shape, weights.shape, token_indices.shape, next_states.shape, flush=True)
-            # print(output.shape, weights[token_indices].shape, next_states[token_indices].shape, flush=True)
-            next_states[token_indices] += torch.einsum(
-                "b,be->be", weights[token_indices], output.to(weights.device)
-            )
+        next_states = self.expert_executor.wait_dispatch_local()


The previous logic initialized next_states = torch.zeros_like(hidden_states) and then selectively added expert outputs, so treating 0 as a sentinel for 'unrouted' tokens was valid. Now next_states is produced by the executor and may legitimately contain zeros for routed tokens, which would be incorrectly overwritten by the source hidden_states. Use an explicit token participation mask (e.g., derived from router_mask) to merge routed vs unrouted tokens, or ensure the executor returns hidden_states for unrouted tokens internally.

Copilot · 2026-03-19T19:09:50Z

        next_states[next_states == 0] = hidden_states[next_states == 0]
-        hidden_states = next_states
+        hidden_states = next_states.to(hidden_states.dtype)


The previous logic initialized next_states = torch.zeros_like(hidden_states) and then selectively added expert outputs, so treating 0 as a sentinel for 'unrouted' tokens was valid. Now next_states is produced by the executor and may legitimately contain zeros for routed tokens, which would be incorrectly overwritten by the source hidden_states. Use an explicit token participation mask (e.g., derived from router_mask) to merge routed vs unrouted tokens, or ensure the executor returns hidden_states for unrouted tokens internally.

Copilot · 2026-03-19T19:09:51Z

+  fused_gemm_kernel<<<grid, block, shared_mem_size, stream>>>(
+      X.device_data(), Wg.device_data(), Wu.device_data(), C.device_data(), B,
+      No, K);


cutlass::HostTensor typically requires explicit sync_device() before using device_data() to ensure device allocations are populated. As written, X/Wg/Wu are filled on host but never synced to device before launching the kernel, so the kernel can read uninitialized device buffers. Call X.sync_device(); Wg.sync_device(); Wu.sync_device(); (and ensure C is allocated on device as needed) before the launch.

Copilot · 2026-03-19T19:09:51Z

@@ -0,0 +1,45 @@
+#pragma once
+
+#include <torch/torch.h>


This header uses memcpy but does not include <cstring> (or another header guaranteeing the declaration). Add #include <cstring> to avoid compilation failures on stricter toolchains.

Copilot · 2026-03-19T19:09:51Z

+  void copy_data(void* dest, const void* src, size_t count) const override {
+    LOG_DEBUG("Copy data from {:p} to {:p}, size: {}", src, dest, count);
+    memcpy(dest, src, count);
+  }


This header uses memcpy but does not include <cstring> (or another header guaranteeing the declaration). Add #include <cstring> to avoid compilation failures on stricter toolchains.

Copilot · 2026-03-19T19:09:52Z

+#pragma once
+
+#include <linux/futex.h>
+#include <sys/syscall.h>
+#include <unistd.h>


This header unconditionally includes Linux-only futex headers, which will fail to build on non-Linux platforms. If cross-platform builds are expected, guard this implementation with #if defined(__linux__) and provide a fallback (e.g., condition_variable-based) or move this file behind a Linux-only build option.

Copilot · 2026-03-19T19:09:52Z

+constexpr int64_t MB = KB * KB;
+constexpr int64_t GB = KB * KB * KB;
+
+constexpr int kWrapSize = 32;


The constant name appears to be a typo: kWrapSize should likely be kWarpSize (CUDA warp size). Renaming avoids confusion and improves searchability.

…cientMoE#75) * update table format * improve table clarity * init code commit * doc: add flashattention installation guide and change toc * feat: remove libaio dependency * remove spdlog dependency * misc: remove unused code and dependencies * misc: remove commented-out code and unused imports * fix: cuda oom due to safe tensors open * remove gcc-12 requirement * gptq disable exllama * fix: key error in offload set * add forward and call (EfficientMoE#7) * add forward and call * fix a bug * feat: support grok-1 model * update API note and install * Feature/expert parallel (EfficientMoE#9) * add back expert parallel by id hash * add grok ep * fix mistral typo * accom cuda copy bug * sync after compute * fix:sync to make sure that input is ready --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> Co-authored-by: luzhan <513964121@qq.com> * fix tokenizer in example * Xly/deepseek (EfficientMoE#34) * add override QuantLinear (EfficientMoE#29) Co-authored-by: xly <leyang.xue@ed.ac.uk> * use torch streampool * format * working deepspeed backend * fix: revert apply_rotary_pos_emb in deepseek * fix busy waiting * fix deepseek flashattn * add deepseek v3 * format and fix multigpu deepseek bug * with device caching allocator * add on-demand lock cache --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> Co-authored-by: lausannel <513964121@qq.com> * Upstream (EfficientMoE#72) * Fix: Undefined Symbol Compilation Error (EfficientMoE#37) * reformat code vllm style * add threadsafe queues * fix compilation error --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * Refactor code for better performance (EfficientMoE#38) * reformat code vllm style * add threadsafe queues * fix compilation error * split files and remove queuing * performance improvement * remove error dependency * add try lock return check * fix header dependency * fix hard coded number --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * CI: add pre commit format ci (EfficientMoE#40) * ci: add pre commit format ci * fix: add requirements for linting * fix: format code before merge * fix: update local clang format version * Chore: rename organization name & optimize CI (EfficientMoE#41) * reformat code vllm style * add threadsafe queues * fix compilation error * split files and remove queuing * performance improvement * remove error dependency * add try lock return check * fix header dependency * fix hard coded number * update CI using cuda docker image * repo consistency * pr template fix * format doc * delete gpu option, add --no-install-recommends * add cuda matrix and remove cuda full package install * remove publish container * change team name to efficient moe --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * CI: fix not a git repository in CI (EfficientMoE#43) * CI: fix missing sudo in apt install (EfficientMoE#44) * CI: fix missing sudo (EfficientMoE#45) * CI: revert os matrix in CI (EfficientMoE#46) * CI: add missing apt update after installing deb file (EfficientMoE#47) * Doc: Update README example to DeepSeek and Suppress Warning (EfficientMoE#49) * reformat code vllm style * add threadsafe queues * fix compilation error * split files and remove queuing * performance improvement * remove error dependency * add try lock return check * fix header dependency * fix hard coded number * update CI using cuda docker image * repo consistency * pr template fix * format doc * delete gpu option, add --no-install-recommends * add cuda matrix and remove cuda full package install * remove publish container * change team name to efficient moe * update readme example to deepseek and supress warning * format * revert CI changes to main version --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * CI: do not build test if document update (EfficientMoE#52) * reformat code vllm style * add threadsafe queues * fix compilation error * split files and remove queuing * performance improvement * remove error dependency * add try lock return check * fix header dependency * fix hard coded number * update CI using cuda docker image * repo consistency * pr template fix * format doc * delete gpu option, add --no-install-recommends * add cuda matrix and remove cuda full package install * remove publish container * change team name to efficient moe * update readme example to deepseek and supress warning * format * revert CI changes to main version * update readme conda env and ignore doc update in build and release * fix wildcard --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * feat: Introduce Local Server for OpenAI-Compatible APIs (EfficientMoE#4) * update table format * improve table clarity * init code commit * add openai api support * add test scripts, update readme, update api * format and change to deepseek in example * fix format * remove unused files * fix api server token id device --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> Co-authored-by: Yao <fuyao3860@gmail.com> * feat: set parameter to device before serving (EfficientMoE#56) * update table format * improve table clarity * init code commit * add openai api support * add test scripts, update readme, update api * format and change to deepseek in example * fix format * remove unused files * fix api server token id device * fix gen broken * update readme links * cancel concurrent job * set dense node to device * sparse node set cpu * remove OS def * use update to date clang-format * fix setuptools version * fix setuptools version for python 3.8 * keep single cuda version in publish --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> Co-authored-by: Yao <fuyao3860@gmail.com> * Chore(deps): Bump pyarrow from 12.0.0 to 14.0.1 (EfficientMoE#69) Bumps [pyarrow](https://github.com/apache/arrow) from 12.0.0 to 14.0.1. - [Release notes](https://github.com/apache/arrow/releases) - [Commits](apache/arrow@go/v12.0.0...go/v14.0.1) --- updated-dependencies: - dependency-name: pyarrow dependency-version: 14.0.1 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: xly <leyang.xue@ed.ac.uk> Co-authored-by: Zhan Lu <51200935+lausannel@users.noreply.github.com> Co-authored-by: Yao Fu <yao.fu.aisys@gmail.com> Co-authored-by: Yao <fuyao3860@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Audit repository for stale code indicators (EfficientMoE#71) * Fix: Undefined Symbol Compilation Error (EfficientMoE#37) * reformat code vllm style * add threadsafe queues * fix compilation error --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * Refactor code for better performance (EfficientMoE#38) * reformat code vllm style * add threadsafe queues * fix compilation error * split files and remove queuing * performance improvement * remove error dependency * add try lock return check * fix header dependency * fix hard coded number --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * CI: add pre commit format ci (EfficientMoE#40) * ci: add pre commit format ci * fix: add requirements for linting * fix: format code before merge * fix: update local clang format version * Chore: rename organization name & optimize CI (EfficientMoE#41) * reformat code vllm style * add threadsafe queues * fix compilation error * split files and remove queuing * performance improvement * remove error dependency * add try lock return check * fix header dependency * fix hard coded number * update CI using cuda docker image * repo consistency * pr template fix * format doc * delete gpu option, add --no-install-recommends * add cuda matrix and remove cuda full package install * remove publish container * change team name to efficient moe --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * CI: fix not a git repository in CI (EfficientMoE#43) * CI: fix missing sudo in apt install (EfficientMoE#44) * CI: fix missing sudo (EfficientMoE#45) * CI: revert os matrix in CI (EfficientMoE#46) * CI: add missing apt update after installing deb file (EfficientMoE#47) * Doc: Update README example to DeepSeek and Suppress Warning (EfficientMoE#49) * reformat code vllm style * add threadsafe queues * fix compilation error * split files and remove queuing * performance improvement * remove error dependency * add try lock return check * fix header dependency * fix hard coded number * update CI using cuda docker image * repo consistency * pr template fix * format doc * delete gpu option, add --no-install-recommends * add cuda matrix and remove cuda full package install * remove publish container * change team name to efficient moe * update readme example to deepseek and supress warning * format * revert CI changes to main version --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * CI: do not build test if document update (EfficientMoE#52) * reformat code vllm style * add threadsafe queues * fix compilation error * split files and remove queuing * performance improvement * remove error dependency * add try lock return check * fix header dependency * fix hard coded number * update CI using cuda docker image * repo consistency * pr template fix * format doc * delete gpu option, add --no-install-recommends * add cuda matrix and remove cuda full package install * remove publish container * change team name to efficient moe * update readme example to deepseek and supress warning * format * revert CI changes to main version * update readme conda env and ignore doc update in build and release * fix wildcard --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * feat: Introduce Local Server for OpenAI-Compatible APIs (EfficientMoE#4) * update table format * improve table clarity * init code commit * add openai api support * add test scripts, update readme, update api * format and change to deepseek in example * fix format * remove unused files * fix api server token id device --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> Co-authored-by: Yao <fuyao3860@gmail.com> * feat: set parameter to device before serving (EfficientMoE#56) * update table format * improve table clarity * init code commit * add openai api support * add test scripts, update readme, update api * format and change to deepseek in example * fix format * remove unused files * fix api server token id device * fix gen broken * update readme links * cancel concurrent job * set dense node to device * sparse node set cpu * remove OS def * use update to date clang-format * fix setuptools version * fix setuptools version for python 3.8 * keep single cuda version in publish --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> Co-authored-by: Yao <fuyao3860@gmail.com> * Initial plan * Add mypy lint hook Co-authored-by: drunkcoding <14305648+drunkcoding@users.noreply.github.com> * Configure mypy settings Co-authored-by: drunkcoding <14305648+drunkcoding@users.noreply.github.com> * Adjust mypy scope Co-authored-by: drunkcoding <14305648+drunkcoding@users.noreply.github.com> * Scope mypy checks Co-authored-by: drunkcoding <14305648+drunkcoding@users.noreply.github.com> --------- Co-authored-by: Leyang Xue <s2062808@ed.ac.uk> Co-authored-by: xly <leyang.xue@ed.ac.uk> Co-authored-by: Zhan Lu <51200935+lausannel@users.noreply.github.com> Co-authored-by: Yao Fu <yao.fu.aisys@gmail.com> Co-authored-by: Yao <fuyao3860@gmail.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: drunkcoding <14305648+drunkcoding@users.noreply.github.com> * feat: performance improvement and Qwen3 support (EfficientMoE#60) * add openai api support * add test scripts, update readme, update api * Fix: Undefined Symbol Compilation Error (EfficientMoE#37) * reformat code vllm style * add threadsafe queues * fix compilation error --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * Refactor code for better performance (EfficientMoE#38) * reformat code vllm style * add threadsafe queues * fix compilation error * split files and remove queuing * performance improvement * remove error dependency * add try lock return check * fix header dependency * fix hard coded number --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * CI: add pre commit format ci (EfficientMoE#40) * ci: add pre commit format ci * fix: add requirements for linting * fix: format code before merge * fix: update local clang format version * Chore: rename organization name & optimize CI (EfficientMoE#41) * reformat code vllm style * add threadsafe queues * fix compilation error * split files and remove queuing * performance improvement * remove error dependency * add try lock return check * fix header dependency * fix hard coded number * update CI using cuda docker image * repo consistency * pr template fix * format doc * delete gpu option, add --no-install-recommends * add cuda matrix and remove cuda full package install * remove publish container * change team name to efficient moe --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * CI: fix not a git repository in CI (EfficientMoE#43) * CI: fix missing sudo in apt install (EfficientMoE#44) * CI: fix missing sudo (EfficientMoE#45) * CI: revert os matrix in CI (EfficientMoE#46) * CI: add missing apt update after installing deb file (EfficientMoE#47) * Doc: Update README example to DeepSeek and Suppress Warning (EfficientMoE#49) * reformat code vllm style * add threadsafe queues * fix compilation error * split files and remove queuing * performance improvement * remove error dependency * add try lock return check * fix header dependency * fix hard coded number * update CI using cuda docker image * repo consistency * pr template fix * format doc * delete gpu option, add --no-install-recommends * add cuda matrix and remove cuda full package install * remove publish container * change team name to efficient moe * update readme example to deepseek and supress warning * format * revert CI changes to main version --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * CI: do not build test if document update (EfficientMoE#52) * reformat code vllm style * add threadsafe queues * fix compilation error * split files and remove queuing * performance improvement * remove error dependency * add try lock return check * fix header dependency * fix hard coded number * update CI using cuda docker image * repo consistency * pr template fix * format doc * delete gpu option, add --no-install-recommends * add cuda matrix and remove cuda full package install * remove publish container * change team name to efficient moe * update readme example to deepseek and supress warning * format * revert CI changes to main version * update readme conda env and ignore doc update in build and release * fix wildcard --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * format and change to deepseek in example * fix format * remove unused files * fix api server token id device * feat: Introduce Local Server for OpenAI-Compatible APIs (EfficientMoE#4) * update table format * improve table clarity * init code commit * add openai api support * add test scripts, update readme, update api * format and change to deepseek in example * fix format * remove unused files * fix api server token id device --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> Co-authored-by: Yao <fuyao3860@gmail.com> * fix gen broken * update readme links * cancel concurrent job * set dense node to device * sparse node set cpu * remove OS def * use update to date clang-format * fix setuptools version * fix setuptools version for python 3.8 * keep single cuda version in publish * feat: set parameter to device before serving (EfficientMoE#56) * update table format * improve table clarity * init code commit * add openai api support * add test scripts, update readme, update api * format and change to deepseek in example * fix format * remove unused files * fix api server token id device * fix gen broken * update readme links * cancel concurrent job * set dense node to device * sparse node set cpu * remove OS def * use update to date clang-format * fix setuptools version * fix setuptools version for python 3.8 * keep single cuda version in publish --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> Co-authored-by: Yao <fuyao3860@gmail.com> * add max length in gen openai * fix cache race condition * all param init at host * add qwen3 * ubuntu lts and build * pre-commit ubuntu version * router weights update overlap * rename deepseek_v2 and reduce torch kernel launch * fix import * fix build and fix bug * fix citation linebreak * fix typo * fix dtype size * remove comments * fix example * pr update init * remove comment and unify deepseek preroute * feat: Merge kernels from vLLM and FlashInfer (EfficientMoE#63) * new allocator * add kernel compilation * stable topk --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> --------- Co-authored-by: Yao <fuyao3860@gmail.com> Co-authored-by: xly <leyang.xue@ed.ac.uk> Co-authored-by: Zhan Lu <51200935+lausannel@users.noreply.github.com> Co-authored-by: Yao Fu <yao.fu.aisys@gmail.com> * Add Claude Code GitHub Workflow (EfficientMoE#73) * "Claude PR Assistant workflow" * "Claude Code Review workflow" * Xly/code clean (EfficientMoE#74) * add openai api support * add test scripts, update readme, update api * format and change to deepseek in example * fix format * remove unused files * fix api server token id device * fix gen broken * update readme links * cancel concurrent job * set dense node to device * sparse node set cpu * remove OS def * use update to date clang-format * fix setuptools version * fix setuptools version for python 3.8 * keep single cuda version in publish * add max length in gen openai * fix cache race condition * all param init at host * add docker and sllm style read * wrap docker and test coverage * test * Clean up symlinks: remove unused op_builder, core/core, and move test_io to extensions * Replace core/kernel directory with symlink to extensions/kernel * seperations * remove ops dependency * Add CUTLASS fused MoE FFN kernel and supporting infrastructure - Add extensions/kernel/fused_moe_mlp.cu/h: BF16 CUTLASS 3-GEMM fused path (gate → up w/ SiLU-mul epilogue → down) with small-M and large-K tile dispatch - Add tests/cuda/test_fused_mlp_cutlass.cu: BF16 CUTLASS vs Torch-native benchmark - Integrate fused kernel into core/parallel/expert_module.cpp via ForwardHelper() - Update core/model/fused_mlp.cu/h and extensions/kernel/epilogue_utils.h - Improve core/utils: cache.h, lockfree_queue.h, simple_object_pool.h - Update tests/cuda/CMakeLists.txt with KERNEL_SRC pattern for CUTLASS tests - Update CLAUDE.md docs and setup.py build config Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Add prefill-decode collocation benchmark with throughput analysis Benchmarks five attention colocation strategies for serving decode and prefill requests on the same GPU time-slice: 0 serial — sequential on default stream 1 varlen-fused — single flash_attn_varlen_func (continuous batching) 2 dual-stream — two CUDA streams, no SM partition 3 green-ctx-sm — SM-partitioned green contexts (CUDA ≥ 12.4) 4 green-ctx-sm-wq — SM + work-queue balanced scope (CUDA 13.1+) Throughput analysis includes: - Separate decode-only / prefill-only baselines with TFLOPS and tok/s - Ideal-overlap bound (perfect concurrency = max(dec, pre)) - Per-mode: TFLOPS, decode tok/s, prefill tok/s, overlap efficiency - Generation-projection table: decode overhead and Δ vs serial per mode CUDA 13.1 green context API notes (driver 590.x): - CUdevResourceDesc is a pointer typedef (c_void_p), not a struct - cuGreenCtxStreamCreate requires CU_STREAM_NON_BLOCKING flag - CU_DEV_RESOURCE_TYPE_WORKQUEUE_CONFIG = 1000; configure sharingScope to CU_WORKQUEUE_SCOPE_GREEN_CTX_BALANCED for WQ isolation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Bump pydantic and transformers to resolve Dependabot alerts - pydantic 1.10.12 → 1.10.13: fixes ReDoS in email validation (GHSA-mr82-8j83-vxmv) - transformers 4.51.3 → 4.53.0: fixes 14 alerts including 3 HIGH RCE (GHSA-wrfc-pvp9-mr9g, GHSA-hxxf-235m-72v3, GHSA-qxrp-vhvm-j765) and 11 MEDIUM/LOW ReDoS vulnerabilities - Remove torch==2.3.1 pin (managed by conda env / base image) - Add flash-attn to requirements Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix build * Make readme_example.py testable via --help Add argparse to readme_example.py so that model-loading code runs only after parse_args(), allowing `--help` to exit 0 without a GPU or model. Replace the AST-only test_readme_example_syntax with test_readme_example_help, which mirrors the existing test_interface_example_help pattern and is verified passing in Docker. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Fix format CI and build-test CI - requirements.txt: sort flash-attn alphabetically (between fastapi and hjson) so requirements-txt-fixer pre-commit hook passes - build-test.yml: replace Ubuntu 20.04 CUDA container (Python 3.8, broken PyTorch wheel) with actions/setup-python Python 3.10 + CPU-only torch; switch from full wheel build to sdist-only (--no-isolation) to avoid CUTLASS dependency and 20+ min compile time Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * green ctx bench * Fix CI: guard CUDA extensions behind cuda_available, add statics to codespell ignore - setup.py: only build CUDAExtension when torch.version.cuda is set; the build-test CI installs CPU-only torch and lacks CUDA_HOME, causing CUDAExtension to abort with OSError - .pre-commit-config.yaml: add 'statics' to codespell ignore-words-list; the term is valid C++ (module-level static variables) but was flagged as a misspelling of 'statistics' Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * tests update --------- Co-authored-by: Yao <fuyao3860@gmail.com> Co-authored-by: xly <leyang.xue@ed.ac.uk> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * remove claude * format * resolve review * resolve reviews --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: xly <leyang.xue@ed.ac.uk> Co-authored-by: Zhan Lu <51200935+lausannel@users.noreply.github.com> Co-authored-by: lausannel <513964121@qq.com> Co-authored-by: Yao Fu <fuyao3860@gmail.com> Co-authored-by: Yao Fu <yao.fu.aisys@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: drunkcoding <14305648+drunkcoding@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

xly and others added 30 commits February 27, 2024 14:22

update table format

000f22a

improve table clarity

c871b41

init code commit

9cd8e99

doc: add flashattention installation guide and change toc

4b7a3e4

feat: remove libaio dependency

77a2ec7

remove spdlog dependency

14f15c0

misc: remove unused code and dependencies

9de6f9c

misc: remove commented-out code and unused imports

b4ddf62

fix: cuda oom due to safe tensors open

7e15ce8

remove gcc-12 requirement

356c216

Merge branch 'dev' of github.com:TorchMoE/MoE-Infinity into dev

78f0c97

gptq disable exllama

970b8fb

fix: key error in offload set

669ea70

add forward and call (#7)

14d95c2

* add forward and call * fix a bug

feat: support grok-1 model

94dc366

Merge branch 'dev' of github.com:TorchMoE/MoE-Infinity into dev

80de8b5

update API note and install

17bfd7d

Merge branch 'main' into dev

2dcde43

Merge branch 'main' into dev

30676fa

Feature/expert parallel (#9)

08ded21

* add back expert parallel by id hash * add grok ep * fix mistral typo * accom cuda copy bug * sync after compute * fix:sync to make sure that input is ready --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> Co-authored-by: luzhan <513964121@qq.com>

Merge branch 'main' into dev

218f152

fix tokenizer in example

7aade14

Merge branch 'main' into dev

48554f0

Merge remote-tracking branch 'origin/main' into dev

48bb3bc

Merge branch 'main' into dev

c8bbb3e

Add Claude Code GitHub Workflow (#73)

94e9e59

* "Claude PR Assistant workflow" * "Claude Code Review workflow"

drunkcoding and others added 4 commits February 26, 2026 11:08

Merge branch 'main' into dev

012d3fd

remove claude

f6a12c0

format

501fc8b

drunkcoding requested a review from Copilot March 15, 2026 22:16

Copilot AI reviewed Mar 15, 2026

View reviewed changes

resolve review

04b4a88

drunkcoding requested a review from Copilot March 16, 2026 18:26

Copilot AI reviewed Mar 16, 2026

View reviewed changes

resolve reviews

4272e88

drunkcoding requested a review from Copilot March 19, 2026 18:06

Copilot AI reviewed Mar 19, 2026

View reviewed changes

drunkcoding requested a review from Copilot March 19, 2026 19:05

Copilot AI reviewed Mar 19, 2026

View reviewed changes

drunkcoding merged commit 9e42c9d into main Mar 19, 2026
3 checks passed

		logits += tl.load(bias_ptr + off_e, mask=off_e < E, other=0.0)

		so_path = "moe_infinity/_store.cpython-311-x86_64-linux-gnu.so"
		if os.path.exists(so_path):

		int ret = syscall(SYS_futex, &value_, FUTEX_WAIT, expected, nullptr,
		nullptr, 0);

Conversation

drunkcoding commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

drunkcoding commented Feb 26, 2026 •

edited

Loading