Skip to content

Merge dev: fused MoE kernels, Qwen3 support, model improvements#75

Merged
drunkcoding merged 36 commits intomainfrom
dev
Mar 19, 2026
Merged

Merge dev: fused MoE kernels, Qwen3 support, model improvements#75
drunkcoding merged 36 commits intomainfrom
dev

Conversation

@drunkcoding
Copy link
Copy Markdown
Contributor

@drunkcoding drunkcoding commented Feb 26, 2026

Summary

  • Fused MoE MLP kernel (CUTLASS BF16): 3-GEMM fused path (gate→up w/ SiLU-mul epilogue→down) with small-M and large-K tile dispatch; correctness and performance benchmarks confirm parity with Torch-native reference
  • New CUDA benchmark tests: 11 test files in tests/cuda/ covering fused MLP, topk-softmax, masked-select, and activation kernels
  • Qwen3 model support and improvements to DeepSeek-V2/V3, Grok, Mixtral, and NLLBMoE
  • Router kernel implementation
  • C++ core updates: memory allocator, parallel dispatcher, prefetch scheduler, and AIO layer improvements
  • CI fixes: CUDA extension guard for CPU-only builds, codespell statics allowlist, Claude Code Review GitHub Actions workflow

Test plan

  • CUDA tests pass: cd tests/cuda && cmake -B build && cmake --build build -j$(nproc)
  • C++ unit tests pass: cd tests/cpp/unittest/queues && cmake -B build && cmake --build build -j$(nproc) && ctest --test-dir build -V
  • Pre-commit hooks green: pre-commit run --all-files
  • Inference smoke test: CUDA_VISIBLE_DEVICES=0 python examples/interface_example.py --model_name_or_path deepseek-ai/DeepSeek-V2-Lite-Chat --offload_dir <path>
  • GitHub Actions: Formatting + Build Test CI green on this PR

🤖 Generated with Claude Code

xly and others added 30 commits February 27, 2024 14:22
* add forward and call
* fix a bug
* add back expert parallel by id hash

* add grok ep

* fix mistral typo

* accom cuda copy bug

* sync after compute

* fix:sync to make sure that input is ready

---------

Co-authored-by: xly <leyang.xue@ed.ac.uk>
Co-authored-by: luzhan <513964121@qq.com>
* add override QuantLinear (#29)

Co-authored-by: xly <leyang.xue@ed.ac.uk>

* use torch streampool

* format

* working deepspeed backend

* fix: revert apply_rotary_pos_emb in deepseek

* fix busy waiting

* fix deepseek flashattn

* add deepseek v3

* format and fix multigpu deepseek bug

* with device caching allocator

* add on-demand lock cache

---------

Co-authored-by: xly <leyang.xue@ed.ac.uk>
Co-authored-by: lausannel <513964121@qq.com>
* Fix: Undefined Symbol Compilation Error (#37)

* reformat code vllm style

* add threadsafe queues

* fix compilation error

---------

Co-authored-by: xly <leyang.xue@ed.ac.uk>

* Refactor code for better performance (#38)

* reformat code vllm style

* add threadsafe queues

* fix compilation error

* split files and remove queuing

* performance improvement

* remove error dependency

* add try lock return check

* fix header dependency

* fix hard coded number

---------

Co-authored-by: xly <leyang.xue@ed.ac.uk>

* CI: add pre commit format ci (#40)

* ci: add pre commit format ci

* fix: add requirements for linting

* fix: format code before merge

* fix: update local clang format version

* Chore: rename organization name & optimize CI (#41)

* reformat code vllm style

* add threadsafe queues

* fix compilation error

* split files and remove queuing

* performance improvement

* remove error dependency

* add try lock return check

* fix header dependency

* fix hard coded number

* update CI using cuda docker image

* repo consistency

* pr template fix

* format doc

* delete gpu option, add --no-install-recommends

* add cuda matrix and remove cuda full package install

* remove publish container

* change team name to efficient moe

---------

Co-authored-by: xly <leyang.xue@ed.ac.uk>

* CI: fix not a git repository in CI (#43)

* CI: fix missing sudo in apt install (#44)

* CI: fix missing sudo (#45)

* CI: revert os matrix in CI (#46)

* CI: add missing apt update after installing deb file (#47)

* Doc: Update README example to DeepSeek and Suppress Warning (#49)

* reformat code vllm style

* add threadsafe queues

* fix compilation error

* split files and remove queuing

* performance improvement

* remove error dependency

* add try lock return check

* fix header dependency

* fix hard coded number

* update CI using cuda docker image

* repo consistency

* pr template fix

* format doc

* delete gpu option, add --no-install-recommends

* add cuda matrix and remove cuda full package install

* remove publish container

* change team name to efficient moe

* update readme example to deepseek and supress warning

* format

* revert CI changes to main version

---------

Co-authored-by: xly <leyang.xue@ed.ac.uk>

* CI: do not build test if document update (#52)

* reformat code vllm style

* add threadsafe queues

* fix compilation error

* split files and remove queuing

* performance improvement

* remove error dependency

* add try lock return check

* fix header dependency

* fix hard coded number

* update CI using cuda docker image

* repo consistency

* pr template fix

* format doc

* delete gpu option, add --no-install-recommends

* add cuda matrix and remove cuda full package install

* remove publish container

* change team name to efficient moe

* update readme example to deepseek and supress warning

* format

* revert CI changes to main version

* update readme conda env and ignore doc update in build and release

* fix wildcard

---------

Co-authored-by: xly <leyang.xue@ed.ac.uk>

* feat: Introduce Local Server for OpenAI-Compatible APIs  (#4)

* update table format

* improve table clarity

* init code commit

* add openai api support

* add test scripts, update readme, update api

* format and change to deepseek in example

* fix format

* remove unused files

* fix api server token id device

---------

Co-authored-by: xly <leyang.xue@ed.ac.uk>
Co-authored-by: Yao <fuyao3860@gmail.com>

* feat: set parameter to device before serving (#56)

* update table format

* improve table clarity

* init code commit

* add openai api support

* add test scripts, update readme, update api

* format and change to deepseek in example

* fix format

* remove unused files

* fix api server token id device

* fix gen broken

* update readme links

* cancel concurrent job

* set dense node to device

* sparse node set cpu

* remove OS def

* use update to date clang-format

* fix setuptools version

* fix setuptools version for python 3.8

* keep single cuda version in publish

---------

Co-authored-by: xly <leyang.xue@ed.ac.uk>
Co-authored-by: Yao <fuyao3860@gmail.com>

* Chore(deps): Bump pyarrow from 12.0.0 to 14.0.1 (#69)

Bumps [pyarrow](https://github.com/apache/arrow) from 12.0.0 to 14.0.1.
- [Release notes](https://github.com/apache/arrow/releases)
- [Commits](apache/arrow@go/v12.0.0...go/v14.0.1)

---
updated-dependencies:
- dependency-name: pyarrow
  dependency-version: 14.0.1
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: xly <leyang.xue@ed.ac.uk>
Co-authored-by: Zhan Lu <51200935+lausannel@users.noreply.github.com>
Co-authored-by: Yao Fu <yao.fu.aisys@gmail.com>
Co-authored-by: Yao <fuyao3860@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Fix: Undefined Symbol Compilation Error (#37)

* reformat code vllm style

* add threadsafe queues

* fix compilation error

---------

Co-authored-by: xly <leyang.xue@ed.ac.uk>

* Refactor code for better performance (#38)

* reformat code vllm style

* add threadsafe queues

* fix compilation error

* split files and remove queuing

* performance improvement

* remove error dependency

* add try lock return check

* fix header dependency

* fix hard coded number

---------

Co-authored-by: xly <leyang.xue@ed.ac.uk>

* CI: add pre commit format ci (#40)

* ci: add pre commit format ci

* fix: add requirements for linting

* fix: format code before merge

* fix: update local clang format version

* Chore: rename organization name & optimize CI (#41)

* reformat code vllm style

* add threadsafe queues

* fix compilation error

* split files and remove queuing

* performance improvement

* remove error dependency

* add try lock return check

* fix header dependency

* fix hard coded number

* update CI using cuda docker image

* repo consistency

* pr template fix

* format doc

* delete gpu option, add --no-install-recommends

* add cuda matrix and remove cuda full package install

* remove publish container

* change team name to efficient moe

---------

Co-authored-by: xly <leyang.xue@ed.ac.uk>

* CI: fix not a git repository in CI (#43)

* CI: fix missing sudo in apt install (#44)

* CI: fix missing sudo (#45)

* CI: revert os matrix in CI (#46)

* CI: add missing apt update after installing deb file (#47)

* Doc: Update README example to DeepSeek and Suppress Warning (#49)

* reformat code vllm style

* add threadsafe queues

* fix compilation error

* split files and remove queuing

* performance improvement

* remove error dependency

* add try lock return check

* fix header dependency

* fix hard coded number

* update CI using cuda docker image

* repo consistency

* pr template fix

* format doc

* delete gpu option, add --no-install-recommends

* add cuda matrix and remove cuda full package install

* remove publish container

* change team name to efficient moe

* update readme example to deepseek and supress warning

* format

* revert CI changes to main version

---------

Co-authored-by: xly <leyang.xue@ed.ac.uk>

* CI: do not build test if document update (#52)

* reformat code vllm style

* add threadsafe queues

* fix compilation error

* split files and remove queuing

* performance improvement

* remove error dependency

* add try lock return check

* fix header dependency

* fix hard coded number

* update CI using cuda docker image

* repo consistency

* pr template fix

* format doc

* delete gpu option, add --no-install-recommends

* add cuda matrix and remove cuda full package install

* remove publish container

* change team name to efficient moe

* update readme example to deepseek and supress warning

* format

* revert CI changes to main version

* update readme conda env and ignore doc update in build and release

* fix wildcard

---------

Co-authored-by: xly <leyang.xue@ed.ac.uk>

* feat: Introduce Local Server for OpenAI-Compatible APIs  (#4)

* update table format

* improve table clarity

* init code commit

* add openai api support

* add test scripts, update readme, update api

* format and change to deepseek in example

* fix format

* remove unused files

* fix api server token id device

---------

Co-authored-by: xly <leyang.xue@ed.ac.uk>
Co-authored-by: Yao <fuyao3860@gmail.com>

* feat: set parameter to device before serving (#56)

* update table format

* improve table clarity

* init code commit

* add openai api support

* add test scripts, update readme, update api

* format and change to deepseek in example

* fix format

* remove unused files

* fix api server token id device

* fix gen broken

* update readme links

* cancel concurrent job

* set dense node to device

* sparse node set cpu

* remove OS def

* use update to date clang-format

* fix setuptools version

* fix setuptools version for python 3.8

* keep single cuda version in publish

---------

Co-authored-by: xly <leyang.xue@ed.ac.uk>
Co-authored-by: Yao <fuyao3860@gmail.com>

* Initial plan

* Add mypy lint hook

Co-authored-by: drunkcoding <14305648+drunkcoding@users.noreply.github.com>

* Configure mypy settings

Co-authored-by: drunkcoding <14305648+drunkcoding@users.noreply.github.com>

* Adjust mypy scope

Co-authored-by: drunkcoding <14305648+drunkcoding@users.noreply.github.com>

* Scope mypy checks

Co-authored-by: drunkcoding <14305648+drunkcoding@users.noreply.github.com>

---------

Co-authored-by: Leyang Xue <s2062808@ed.ac.uk>
Co-authored-by: xly <leyang.xue@ed.ac.uk>
Co-authored-by: Zhan Lu <51200935+lausannel@users.noreply.github.com>
Co-authored-by: Yao Fu <yao.fu.aisys@gmail.com>
Co-authored-by: Yao <fuyao3860@gmail.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: drunkcoding <14305648+drunkcoding@users.noreply.github.com>
* add openai api support

* add test scripts, update readme, update api

* Fix: Undefined Symbol Compilation Error (#37)

* reformat code vllm style

* add threadsafe queues

* fix compilation error

---------

Co-authored-by: xly <leyang.xue@ed.ac.uk>

* Refactor code for better performance (#38)

* reformat code vllm style

* add threadsafe queues

* fix compilation error

* split files and remove queuing

* performance improvement

* remove error dependency

* add try lock return check

* fix header dependency

* fix hard coded number

---------

Co-authored-by: xly <leyang.xue@ed.ac.uk>

* CI: add pre commit format ci (#40)

* ci: add pre commit format ci

* fix: add requirements for linting

* fix: format code before merge

* fix: update local clang format version

* Chore: rename organization name & optimize CI (#41)

* reformat code vllm style

* add threadsafe queues

* fix compilation error

* split files and remove queuing

* performance improvement

* remove error dependency

* add try lock return check

* fix header dependency

* fix hard coded number

* update CI using cuda docker image

* repo consistency

* pr template fix

* format doc

* delete gpu option, add --no-install-recommends

* add cuda matrix and remove cuda full package install

* remove publish container

* change team name to efficient moe

---------

Co-authored-by: xly <leyang.xue@ed.ac.uk>

* CI: fix not a git repository in CI (#43)

* CI: fix missing sudo in apt install (#44)

* CI: fix missing sudo (#45)

* CI: revert os matrix in CI (#46)

* CI: add missing apt update after installing deb file (#47)

* Doc: Update README example to DeepSeek and Suppress Warning (#49)

* reformat code vllm style

* add threadsafe queues

* fix compilation error

* split files and remove queuing

* performance improvement

* remove error dependency

* add try lock return check

* fix header dependency

* fix hard coded number

* update CI using cuda docker image

* repo consistency

* pr template fix

* format doc

* delete gpu option, add --no-install-recommends

* add cuda matrix and remove cuda full package install

* remove publish container

* change team name to efficient moe

* update readme example to deepseek and supress warning

* format

* revert CI changes to main version

---------

Co-authored-by: xly <leyang.xue@ed.ac.uk>

* CI: do not build test if document update (#52)

* reformat code vllm style

* add threadsafe queues

* fix compilation error

* split files and remove queuing

* performance improvement

* remove error dependency

* add try lock return check

* fix header dependency

* fix hard coded number

* update CI using cuda docker image

* repo consistency

* pr template fix

* format doc

* delete gpu option, add --no-install-recommends

* add cuda matrix and remove cuda full package install

* remove publish container

* change team name to efficient moe

* update readme example to deepseek and supress warning

* format

* revert CI changes to main version

* update readme conda env and ignore doc update in build and release

* fix wildcard

---------

Co-authored-by: xly <leyang.xue@ed.ac.uk>

* format and change to deepseek in example

* fix format

* remove unused files

* fix api server token id device

* feat: Introduce Local Server for OpenAI-Compatible APIs  (#4)

* update table format

* improve table clarity

* init code commit

* add openai api support

* add test scripts, update readme, update api

* format and change to deepseek in example

* fix format

* remove unused files

* fix api server token id device

---------

Co-authored-by: xly <leyang.xue@ed.ac.uk>
Co-authored-by: Yao <fuyao3860@gmail.com>

* fix gen broken

* update readme links

* cancel concurrent job

* set dense node to device

* sparse node set cpu

* remove OS def

* use update to date clang-format

* fix setuptools version

* fix setuptools version for python 3.8

* keep single cuda version in publish

* feat: set parameter to device before serving (#56)

* update table format

* improve table clarity

* init code commit

* add openai api support

* add test scripts, update readme, update api

* format and change to deepseek in example

* fix format

* remove unused files

* fix api server token id device

* fix gen broken

* update readme links

* cancel concurrent job

* set dense node to device

* sparse node set cpu

* remove OS def

* use update to date clang-format

* fix setuptools version

* fix setuptools version for python 3.8

* keep single cuda version in publish

---------

Co-authored-by: xly <leyang.xue@ed.ac.uk>
Co-authored-by: Yao <fuyao3860@gmail.com>

* add max length in gen openai

* fix cache race condition

* all param init at host

* add qwen3

* ubuntu lts and build

* pre-commit ubuntu version

* router weights update overlap

* rename deepseek_v2 and reduce torch kernel launch

* fix import

* fix build and fix bug

* fix citation linebreak

* fix typo

* fix dtype size

* remove comments

* fix example

* pr update init

* remove comment and unify deepseek preroute

* feat: Merge kernels from vLLM and FlashInfer (#63)

* new allocator

* add kernel compilation

* stable topk

---------

Co-authored-by: xly <leyang.xue@ed.ac.uk>

---------

Co-authored-by: Yao <fuyao3860@gmail.com>
Co-authored-by: xly <leyang.xue@ed.ac.uk>
Co-authored-by: Zhan Lu <51200935+lausannel@users.noreply.github.com>
Co-authored-by: Yao Fu <yao.fu.aisys@gmail.com>
* "Claude PR Assistant workflow"

* "Claude Code Review workflow"
drunkcoding and others added 4 commits February 26, 2026 11:08
* add openai api support

* add test scripts, update readme, update api

* format and change to deepseek in example

* fix format

* remove unused files

* fix api server token id device

* fix gen broken

* update readme links

* cancel concurrent job

* set dense node to device

* sparse node set cpu

* remove OS def

* use update to date clang-format

* fix setuptools version

* fix setuptools version for python 3.8

* keep single cuda version in publish

* add max length in gen openai

* fix cache race condition

* all param init at host

* add docker and sllm style read

* wrap docker and test coverage

* test

* Clean up symlinks: remove unused op_builder, core/core, and move test_io to extensions

* Replace core/kernel directory with symlink to extensions/kernel

* seperations

* remove ops dependency

* Add CUTLASS fused MoE FFN kernel and supporting infrastructure

- Add extensions/kernel/fused_moe_mlp.cu/h: BF16 CUTLASS 3-GEMM fused path
  (gate → up w/ SiLU-mul epilogue → down) with small-M and large-K tile dispatch
- Add tests/cuda/test_fused_mlp_cutlass.cu: BF16 CUTLASS vs Torch-native benchmark
- Integrate fused kernel into core/parallel/expert_module.cpp via ForwardHelper()
- Update core/model/fused_mlp.cu/h and extensions/kernel/epilogue_utils.h
- Improve core/utils: cache.h, lockfree_queue.h, simple_object_pool.h
- Update tests/cuda/CMakeLists.txt with KERNEL_SRC pattern for CUTLASS tests
- Update CLAUDE.md docs and setup.py build config

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Add prefill-decode collocation benchmark with throughput analysis

Benchmarks five attention colocation strategies for serving decode and
prefill requests on the same GPU time-slice:
  0  serial          — sequential on default stream
  1  varlen-fused    — single flash_attn_varlen_func (continuous batching)
  2  dual-stream     — two CUDA streams, no SM partition
  3  green-ctx-sm    — SM-partitioned green contexts (CUDA ≥ 12.4)
  4  green-ctx-sm-wq — SM + work-queue balanced scope (CUDA 13.1+)

Throughput analysis includes:
- Separate decode-only / prefill-only baselines with TFLOPS and tok/s
- Ideal-overlap bound (perfect concurrency = max(dec, pre))
- Per-mode: TFLOPS, decode tok/s, prefill tok/s, overlap efficiency
- Generation-projection table: decode overhead and Δ vs serial per mode

CUDA 13.1 green context API notes (driver 590.x):
- CUdevResourceDesc is a pointer typedef (c_void_p), not a struct
- cuGreenCtxStreamCreate requires CU_STREAM_NON_BLOCKING flag
- CU_DEV_RESOURCE_TYPE_WORKQUEUE_CONFIG = 1000; configure sharingScope
  to CU_WORKQUEUE_SCOPE_GREEN_CTX_BALANCED for WQ isolation

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Bump pydantic and transformers to resolve Dependabot alerts

- pydantic 1.10.12 → 1.10.13: fixes ReDoS in email validation (GHSA-mr82-8j83-vxmv)
- transformers 4.51.3 → 4.53.0: fixes 14 alerts including 3 HIGH RCE
  (GHSA-wrfc-pvp9-mr9g, GHSA-hxxf-235m-72v3, GHSA-qxrp-vhvm-j765)
  and 11 MEDIUM/LOW ReDoS vulnerabilities
- Remove torch==2.3.1 pin (managed by conda env / base image)
- Add flash-attn to requirements

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix build

* Make readme_example.py testable via --help

Add argparse to readme_example.py so that model-loading code runs only
after parse_args(), allowing `--help` to exit 0 without a GPU or model.
Replace the AST-only test_readme_example_syntax with test_readme_example_help,
which mirrors the existing test_interface_example_help pattern and is verified
passing in Docker.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fix format CI and build-test CI

- requirements.txt: sort flash-attn alphabetically (between fastapi and hjson)
  so requirements-txt-fixer pre-commit hook passes
- build-test.yml: replace Ubuntu 20.04 CUDA container (Python 3.8, broken
  PyTorch wheel) with actions/setup-python Python 3.10 + CPU-only torch;
  switch from full wheel build to sdist-only (--no-isolation) to avoid
  CUTLASS dependency and 20+ min compile time

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* green ctx bench

* Fix CI: guard CUDA extensions behind cuda_available, add statics to codespell ignore

- setup.py: only build CUDAExtension when torch.version.cuda is set;
  the build-test CI installs CPU-only torch and lacks CUDA_HOME, causing
  CUDAExtension to abort with OSError
- .pre-commit-config.yaml: add 'statics' to codespell ignore-words-list;
  the term is valid C++ (module-level static variables) but was flagged
  as a misspelling of 'statistics'

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* tests update

---------

Co-authored-by: Yao <fuyao3860@gmail.com>
Co-authored-by: xly <leyang.xue@ed.ac.uk>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
@drunkcoding drunkcoding requested a review from Copilot March 15, 2026 22:16
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR merges a large dev branch focused on faster MoE inference (CUTLASS fused kernels + routing), adds Qwen3 MoE support, and modernizes the build/test/CI pipeline around new C++/CUDA core refactors.

Changes:

  • Adds CUTLASS BF16 fused MoE MLP path + multiple CUDA benchmark/test binaries under tests/cuda/.
  • Introduces Qwen3 MoE integration and refactors expert dispatch to a new dispatch_local(...)/wait_dispatch_local() flow.
  • Overhauls build tooling (new top-level CMake, setup.py CUDA-extension gating, Docker verification, CI workflow updates) and adds/extends C++ unit tests.

Reviewed changes

Copilot reviewed 131 out of 161 changed files in this pull request and generated 20 comments.

Show a summary per file
File Description
tests/cuda/test_single_gemm_tiled.cu New CUTLASS MMA tiled GEMM test binary
tests/cuda/test_fused_mlp_cutlass.cu Benchmark fused CUTLASS MoE MLP vs Torch reference
tests/cuda/test_fused_mlp.cu CUTLASS back-to-back GEMM test harness
tests/cuda/test_expert_fusion_v2.cu Experimental fused dual-GEMM + SiLU*mul kernel
tests/cuda/test_autotune_blocksize.cu Tile/block-size autotuning demo
tests/cuda/CMakeLists.txt Adds CUDA test/benchmark build targets
tests/cpp/unittest/utils/test_simple_object_pool.cpp New unit tests for SimpleObjectPool
tests/cpp/unittest/utils/test_lfu_cache.cpp New unit tests documenting LFUCache behavior
tests/cpp/unittest/utils/CMakeLists.txt Adds utils unit test executables
tests/cpp/unittest/queues/test_lockfree_queue.cpp Adds edge/concurrency tests for queue
tests/cpp/unittest/queues/CMakeLists.txt Adds queue unit test targets
setup.py Replaces op_builder flow with CUDAExtensions for _store/_engine
pyproject.toml Adds ruff-format config and mypy config scoping
op_builder/prefetch.py Removes old DeepSpeed-derived op builder
op_builder/all_ops.py Removes op builder registry
op_builder/init.py Removes op_builder package init logic
moe_infinity/utils/hf_config.py Adds Qwen3 architecture parsing support
moe_infinity/utils/config.py Updates ArcherConfig typing + defaults
moe_infinity/runtime/model_offload.py Switches to _store extension + adds Qwen3 wiring
moe_infinity/runtime/compile.py Adds TorchScript expert compilation helper
moe_infinity/ops/op_builder Removes source-tree alias file
moe_infinity/ops/core Removes source-tree alias file
moe_infinity/models/switch_transformers.py Switches to async dispatch/wait execution path
moe_infinity/models/qwen.py Adds Qwen3MoEBlock implementation
moe_infinity/models/nllb_moe.py Switches to async dispatch/wait execution path
moe_infinity/models/modeling_deepseek_v2/modeling_deepseek.py Tweaks generation cache handling
moe_infinity/models/mixtral.py Switches to async dispatch/wait execution path
moe_infinity/models/grok.py Switches to async dispatch/wait execution path
moe_infinity/models/deepseek.py Reworks DeepSeek MoE gate routing and dispatch
moe_infinity/models/init.py Exposes Qwen3MoEBlock
moe_infinity/kernel/router.py Adds Triton routing kernels (softmax/topk)
moe_infinity/kernel/init.py Exports router launchers
moe_infinity/entrypoints/openai/protocol.py Maps OpenAI max_tokens to HF max_new_tokens
moe_infinity/entrypoints/big_modeling.py Updates DeepSeek rotary hook wiring
moe_infinity/distributed/expert_executor.py Refactors dispatch API to pass weights + wait method
moe_infinity/common/constants.py Adds Qwen3 mapping + DeepSeek v2 import path update
extensions/test_io/CMakeLists.txt Adds test I/O Python module build via CMake
extensions/kernel/utils.h Adds kernel utilities + arch detection helpers
extensions/kernel/ops.h Declares activation/topk_softmax kernel APIs
extensions/kernel/grouped_threadblock_swizzle.h Adds grouped swizzle support for grouped GEMMs
extensions/kernel/fused_moe_mlp.h Declares CUTLASS fused MoE MLP API
extensions/kernel/epilogue_utils.h Adds CUTLASS epilogue helpers (SiLU + mul)
extensions/kernel/dispatch_utils.h Adds dispatch macros for floating types
extensions/kernel/common_device.h Adds device activation utility implementations
extensions/kernel/b2b_gemm_grouped_problem_visitor.h Adds grouped B2B GEMM visitor implementation
extensions/kernel/activation_kernels.cu Adds CUDA activation kernels for gating/activations
extensions/CMakeLists.txt Adds CMake-based extension build targets
examples/readme_example.py Makes example configurable and updates default checkpoint
examples/interface_example.py Updates dataset/prompting and supports Qwen3
docker/verify_build.py Adds build outputs verification script
docker/Dockerfile Adds reproducible build/test container with CUTLASS
core/utils/threadsafe_queue.h Adds NotifyAll to unblock waiters
core/utils/simple_object_pool.h Adds vector include for getMany support
core/utils/logger.h Adds vector/enum logging helpers + conditional macros
core/utils/lockfree_queue.h Replaces lock-free impl with mutex-backed queue
core/utils/cuda_utils.h Adds CUTLASS error helper + GPU timer
core/utils/cuda_utils.cpp Handles null pointer in IsDevicePointer
core/utils/cache.h Fixes LFUCache invalidation/reset + adds touch tracking
core/python/py_archer_prefetch.cpp Exposes init_moe_layer/topk_softmax + new dispatcher APIs
core/python/expert_gemm.cu Adds cuBLAS-based expert fused MLP prototype
core/prefetch/archer_prefetch_handle.cpp Adds MOE_IO_THREADS support to tensor handle
core/parallel/expert_dispatcher.h Refactors dispatcher inputs/outputs and adds notify/start
core/model/moe.cpp Adds InitMoELayer/TopKSoftmax bridge
core/model/model_topology.h Adds SetModuleMemoryFromDisk_Views declaration
core/model/model_topology.cpp Adds pipelined disk->host->GPU path + view-based setup
core/model/fused_mlp.h Adds legacy wrapper API header
core/model/fused_mlp.cu Implements legacy wrapper calling fused CUTLASS path
core/memory/torch_caching_allocator.h Adds Torch allocator replacement scaffolding
core/memory/torch_caching_allocator.cpp Instantiates allocator replacer
core/memory/shared_memory.h Adds shared memory helpers + deleter
core/memory/shared_memory.cpp Implements shared memory helpers
core/memory/pinned_memory_pool.h Adds pinned pool abstraction
core/memory/pinned_memory_pool.cpp Implements pinned pool with register/unregister
core/memory/device_caching_allocator.cpp Uses CUDA_CHECK for cudaMalloc retry
core/memory/caching_allocator_bk.h Adds legacy/backup caching allocator header
core/memory/caching_allocator.h Adds unified caching allocator + Torch hooks
core/core Removes core path alias file
core/common/types.h Adds enum/string helpers + utilities
core/common/sync.h Adds Futex utility class
core/common/pytorch.h Adds dtype conversion helpers + blob helpers
core/common/generator.h Adds ID/UUID/time helpers
core/common/context.h Adds global Context config
core/common/constant.h Adds size constants
core/aio/archer_tensor_handle.h Adds partition sizing + IO thread configuration
core/aio/archer_tensor_handle.cpp Implements partitioned storage rollover
core/aio/archer_prio_aio_handle.h Adds pinned pool + exit notifications + thread count
core/aio/archer_prio_aio_handle.cpp Adds pinned-buffer chunked writes + scheduler CV
core/aio/archer_aio_threadpool.h Adds round-robin counter
core/aio/archer_aio_threadpool.cpp Replaces rand() scheduling with round-robin
core/aio/archer_aio_thread.h Adds atomics + CVs for thread coordination
core/aio/archer_aio_thread.cpp Reworks thread loop to use CVs (no busy-wait)
core/CMakeLists.txt Adds archer_core static lib build definition
RELEASE.md Updates build instructions (drops BUILD_OPS)
README.md Updates benchmark table + citation formatting
MANIFEST.in Updates packaging includes (kernel headers/cuda)
CMakeLists.txt Adds top-level CMake build entrypoint
CITATIONS.md Adds author to citations
.pre-commit-config.yaml Adds statics allowlist + mypy hook
.github/workflows/publish.yml Updates runner OS + wheel build command
.github/workflows/publish-test.yml Updates runner OS + wheel build command
.github/workflows/pre-commit-format.yml Updates runner OS
.github/workflows/claude.yml Adds Claude workflow automation
.github/workflows/claude-code-review.yml Adds Claude Code Review workflow
.github/workflows/build-test.yml Switches to CPU-only sdist build in CI
.dockerignore Adds docker ignore patterns

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +42 to +54
@triton.jit
def fused_softmax_topk_kernel(
hidden_ptr, # [B, H]
weight_ptr, # [H, E]
routing_mask_ptr, # [B, E] (bool)
routing_weight_ptr, # [B, E] (float16)
B: tl.constexpr,
H: tl.constexpr,
E: tl.constexpr,
TOPK: tl.constexpr,
BLOCK_E: tl.constexpr,
normalize_topk: tl.constexpr, # New!
):
Comment on lines +67 to +68
logits += tl.load(bias_ptr + off_e, mask=off_e < E, other=0.0)

Comment on lines +155 to +164

logits = tl.full([BLOCK_E], -float("inf"), dtype=tl.float32)

for h in range(H):
h_val = tl.load(hidden_ptr + batch_id * H + h)
w_ptr = weight_ptr + off_e * H + h
valid = off_e < E
w_val = tl.load(w_ptr, mask=valid, other=0.0)
logits = tl.where(valid, logits + h_val * w_val, logits)

Comment on lines +165 to +170
# Softmax
max_logit = tl.max(logits, axis=0)
logits = logits - max_logit
exp_logits = tl.exp(logits)
sum_exp = tl.sum(exp_logits, axis=0)
probs = exp_logits / sum_exp
Comment thread moe_infinity/kernel/router.py Outdated

for i in range(BLOCK_E):
if i < E:
p = tl.load(probs + batch_id * E + i)
Comment on lines +35 to +40
PinnedMemoryPool::~PinnedMemoryPool() {
for (auto* ptr : all_chunks_) {
cudaHostUnregister(ptr);
free(ptr);
}
}
Comment thread moe_infinity/runtime/model_offload.py Outdated
Comment on lines +64 to +68
except ImportError as exc:
raise ImportError(
"moe_infinity._store extension is required. Install with CUDA enabled."
) from exc

Comment thread docker/verify_build.py Outdated
Comment on lines +45 to +46
so_path = "moe_infinity/_store.cpython-311-x86_64-linux-gnu.so"
if os.path.exists(so_path):
Comment thread tests/cuda/CMakeLists.txt Outdated
endmacro()

# Add include directories and link for CUTLASS
set(CUTLASS_DIR $ENV{HOME}/cutlass)
Comment thread tests/cuda/CMakeLists.txt Outdated
Comment on lines +121 to +124
set(CUDA_ARCHITECTURES 86)

# set nvcc flags
set(CUDA_NVCC_FLAGS "${CUDA_NVCC_FLAGS} -g -G -lineinfo -rdynamic -O3 -gencode arch=compute_86,code=sm_86 -Xcompiler -fopenmp")
@drunkcoding drunkcoding requested a review from Copilot March 16, 2026 18:26
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR integrates fused CUTLASS MoE MLP kernels and router support, adds Qwen3 MoE model support, and refactors the build/test tooling (CMake, setup.py, CI, Docker) to support the updated C++/CUDA core and new benchmarks/tests.

Changes:

  • Adds/updates CUDA kernels and benchmarks (fused MoE MLP + router/topk-softmax/activations) and related build plumbing.
  • Adds Qwen3 MoE integration and updates several model wrappers to use the new expert dispatch interface.
  • Refactors core I/O/AIO, memory pools, dispatch infrastructure, and introduces new C++ unit tests + CI/Docker verification.

Reviewed changes

Copilot reviewed 131 out of 161 changed files in this pull request and generated 14 comments.

Show a summary per file
File Description
tests/cuda/test_single_gemm_tiled.cu Adds a CUTLASS tiled GEMM test program.
tests/cuda/test_fused_mlp_cutlass.cu Adds Torch-vs-CUTLASS fused MLP benchmark and correctness check.
tests/cuda/test_fused_mlp.cu Adds CUTLASS back-to-back GEMM test driver.
tests/cuda/test_expert_fusion_v2.cu Adds an experimental fully fused expert GEMM kernel test.
tests/cuda/test_autotune_blocksize.cu Adds a tile/blocksize autotuning example binary.
tests/cuda/CMakeLists.txt Adds CUDA tests build configuration and Torch/CUTLASS wiring.
tests/cpp/unittest/utils/test_simple_object_pool.cpp Adds gtest coverage for SimpleObjectPool.
tests/cpp/unittest/utils/test_lfu_cache.cpp Adds gtest coverage for LFUCache behavior and reset/eviction.
tests/cpp/unittest/utils/CMakeLists.txt Adds CMake targets for utils unit tests.
tests/cpp/unittest/queues/test_lockfree_queue.cpp Adds edge-case and concurrency tests for queue implementation.
tests/cpp/unittest/queues/CMakeLists.txt Adds CMake targets for queue unit tests.
setup.py Refactors extension build to new _store/_engine CUDAExtensions + CUTLASS includes.
pyproject.toml Adds ruff format settings and mypy config.
op_builder/prefetch.py Removes legacy DeepSpeed-derived op builder file.
op_builder/all_ops.py Removes legacy op discovery/build system.
op_builder/init.py Removes legacy op-builder package init/closures.
moe_infinity/utils/hf_config.py Adds Qwen3 parsing support and typing improvements.
moe_infinity/utils/config.py Adjusts ArcherConfig types/defaults (trace_path optional, threads default).
moe_infinity/runtime/model_offload.py Switches to loading new _store extension; adds Qwen3 patching and new dispatch path.
moe_infinity/runtime/compile.py Adds TorchScript compilation helper for expert modules.
moe_infinity/ops/op_builder Removes symlink-like path entry.
moe_infinity/ops/core Removes symlink-like path entry.
moe_infinity/models/switch_transformers.py Migrates dispatch to new executor API (dispatch_local + wait).
moe_infinity/models/qwen.py Adds Qwen3MoEBlock implementation using executor + fused routing.
moe_infinity/models/nllb_moe.py Migrates dispatch to new executor API and output handling.
moe_infinity/models/modeling_deepseek_v2/modeling_deepseek.py Alters generation cache max length handling.
moe_infinity/models/mixtral.py Migrates dispatch to new executor API and reshaping/casting.
moe_infinity/models/grok.py Migrates dispatch to new executor API and casting.
moe_infinity/models/deepseek.py Reworks DeepSeek MoE gating/routing and executor dispatch integration.
moe_infinity/models/init.py Exports Qwen3MoEBlock.
moe_infinity/kernel/router.py Adds Triton router kernels (softmax + fused softmax/topk).
moe_infinity/kernel/init.py Exposes router kernel launch helpers.
moe_infinity/entrypoints/openai/protocol.py Maps OpenAI max_tokens to HF max_new_tokens; raises default max_tokens.
moe_infinity/entrypoints/big_modeling.py Updates deepseek module patching to deepseek_v2/deepseek_v3 paths.
moe_infinity/distributed/expert_executor.py Changes executor API to pass weights; adds wait_dispatch_local.
moe_infinity/common/constants.py Adds Qwen3 model mapping and updates deepseek import path.
extensions/test_io/CMakeLists.txt Adds CMake to build/install a test IO Python module.
extensions/kernel/utils.h Adds CUDA/CUTLASS arch detection helpers and tile calculator.
extensions/kernel/ops.h Adds kernel function declarations (activations, topk_softmax).
extensions/kernel/grouped_threadblock_swizzle.h Adds CUTLASS grouped swizzle/visitor for grouped GEMMs.
extensions/kernel/fused_moe_mlp.h Adds fused CUTLASS MoE MLP interface header.
extensions/kernel/epilogue_utils.h Adds custom CUTLASS epilogue utilities for fused SiLU+mul.
extensions/kernel/dispatch_utils.h Adds ATen dispatch helper macros.
extensions/kernel/common_device.h Adds device activation helpers (relu/silu/gelu) + warp activation.
extensions/kernel/b2b_gemm_grouped_problem_visitor.h Adds grouped problem visitor for B2B GEMMs.
extensions/kernel/activation_kernels.cu Adds CUDA activation/gating kernels and PyTorch bindings.
extensions/CMakeLists.txt Adds CMake build/install for prefetch_op and test_io modules.
examples/readme_example.py Makes README example configurable via CLI and updates default model.
examples/interface_example.py Reworks dataset/prompting and moves MoE init earlier; adds Qwen3 handling.
docker/verify_build.py Adds build verification script for produced shared libs and refactor artifacts.
docker/Dockerfile Adds full build/test Dockerfile including CUTLASS build and extension verification.
core/utils/threadsafe_queue.h Adds NotifyAll to wake all waiters.
core/utils/simple_object_pool.h Adds include for new APIs/tests.
core/utils/logger.h Adds LogStream streaming for vector and enum classes + conditional warn/fatal macros.
core/utils/lockfree_queue.h Replaces lock-free queue with mutex-backed queue implementation (API rename).
core/utils/cuda_utils.h Adds CUTLASS error checking and a GPU timer utility.
core/utils/cuda_utils.cpp Adds nullptr handling to IsDevicePointer.
core/utils/cache.h Fixes LFU cache iterator invalidation and reset behavior; adds touch counter.
core/python/py_archer_prefetch.cpp Extends Python bindings with init_moe_layer/topk_softmax and wait_hidden_states.
core/python/expert_gemm.cu Adds cuBLAS-based expert fused MLP kernels and PyBind module.
core/prefetch/archer_prefetch_handle.cpp Adds MOE_IO_THREADS support and passes IO thread count to tensor handle.
core/parallel/expert_dispatcher.h Refactors dispatcher inputs/outputs, threading/queues, adds notify hooks and JIT path.
core/model/moe.cpp Adds MoELayer init/topk-softmax entrypoints for Python.
core/model/model_topology.h Adds SetModuleMemoryFromDisk_Views declaration.
core/model/model_topology.cpp Adds pipelined disk→host→GPU path and view-only host tensor setup.
core/model/fused_mlp.h Adds legacy wrapper declaration pointing at new fused_moe_ffn_into path.
core/model/fused_mlp.cu Implements legacy wrapper delegating to fused_moe_ffn_into.
core/memory/torch_caching_allocator.h Adds Torch allocator integration to use caching allocator on load.
core/memory/torch_caching_allocator.cpp Instantiates allocator replacement singleton.
core/memory/shared_memory.h Adds shared memory attach/detach helpers and RAII deleter.
core/memory/shared_memory.cpp Implements shared memory mmap attach/detach functions.
core/memory/pinned_memory_pool.h Adds pinned host memory pool interface.
core/memory/pinned_memory_pool.cpp Implements pinned pool allocation and acquire/release.
core/memory/device_caching_allocator.cpp Improves cudaMalloc error handling via CUDA_CHECK.
core/memory/caching_allocator_bk.h Adds backup templated caching allocator (CUDA/UM/host variants).
core/memory/caching_allocator.h Adds unified caching allocator API incl. SHM/PIN/CUDA and Torch hooks.
core/core Removes path entry file.
core/common/types.h Adds enum/string helpers, constexpr utilities, and pointer helpers.
core/common/sync.h Adds a Futex wrapper utility.
core/common/pytorch.h Adds blob→tensor helpers and dtype conversion utilities.
core/common/generator.h Adds UUID/time/id generators.
core/common/context.h Adds global Context container for config values.
core/common/constant.h Adds size constants and warp size constant.
core/aio/archer_tensor_handle.h Adds partition size constant and IO-thread configurable ctor.
core/aio/archer_tensor_handle.cpp Adds storage partition rollover and passes IO thread config to prio handle.
core/aio/archer_prio_aio_handle.h Refactors prio AIO to use atomic exit + thread pool + pinned pool.
core/aio/archer_prio_aio_handle.cpp Implements multi-thread IO scheduling, pinned pool writes, clean shutdown, notifications.
core/aio/archer_aio_threadpool.h Adds round-robin enqueue counter.
core/aio/archer_aio_threadpool.cpp Uses round-robin instead of rand for thread selection.
core/aio/archer_aio_thread.h Adds atomics + condition variables for proper waiting.
core/aio/archer_aio_thread.cpp Implements cv-based worker loop and done notification (removes busy-wait).
core/CMakeLists.txt Adds archer_core static library build with CUDA sources and linking.
RELEASE.md Updates release steps to remove BUILD_OPS usage.
README.md Updates benchmark table and citation formatting/authors.
MANIFEST.in Changes packaging includes (drops op_builder, adds extensions/kernel sources).
CMakeLists.txt Adds top-level CMake build for core + extensions, Torch/CUTLASS discovery.
CITATIONS.md Adds an author to citations.
.pre-commit-config.yaml Adds mypy hook and updates codespell allowlist.
.github/workflows/publish.yml Updates runner OS and removes BUILD_OPS from build step.
.github/workflows/publish-test.yml Updates runner OS and removes BUILD_OPS from build step.
.github/workflows/pre-commit-format.yml Updates runner OS.
.github/workflows/claude.yml Adds Claude Code workflow.
.github/workflows/claude-code-review.yml Adds Claude Code Review workflow.
.github/workflows/build-test.yml Updates CI to build sdist on ubuntu-22.04 with CPU torch.
.dockerignore Adds docker ignore rules for build artifacts and large files.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread extensions/kernel/utils.h
Comment on lines +24 to +28

template <typename T>
struct DetectedArch<T, std::enable_if_t<(__CUDA_ARCH__ > 0)>> {
using SM = DetectedArchT<__CUDA_ARCH__>;
};
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in local branch: wrapped the device-only specialization with #ifdef __CUDA_ARCH__ so host compilation no longer evaluates __CUDA_ARCH__ expressions.

Comment thread core/common/context.h

#include <cstdint>
#include <mutex>
#include <unordered_map>
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in local branch by adding missing <string> and <stdexcept> includes directly in core/common/context.h.

Comment thread core/aio/archer_prio_aio_handle.cpp Outdated
Comment on lines +193 to +195
return !io_queue_high_.empty() || !io_queue_low_.empty() || time_to_exit_;
});
if (time_to_exit_) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in local branch: switched time_to_exit_ checks to time_to_exit_.load() in both the wait predicate and immediate exit branch.

Comment on lines +111 to +137
def launch_fused_softmax_topk(hidden_states, weight, bias, top_k):
B, H = hidden_states.shape
E = weight.shape[1]
dtype = hidden_states.dtype

routing_mask = torch.zeros(
(B, E), dtype=torch.bool, device=hidden_states.device
)
routing_weight = torch.zeros(
(B, E), dtype=dtype, device=hidden_states.device
)

BLOCK_E = 32 # Must divide E

fused_softmax_topk_kernel[(B,)](
hidden_states,
weight,
bias,
routing_mask,
routing_weight,
B=B,
H=H,
E=E,
TOPK=top_k,
BLOCK_E=BLOCK_E,
normalize_topk=True,
)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjusted in local branch: BLOCK_E is now triton.next_power_of_2(E) and invalid expert lanes are masked to -inf, so routing is no longer capped at a fixed 32-expert block.

Comment thread moe_infinity/kernel/router.py Outdated
(B, E), dtype=dtype, device=hidden_states.device
)

BLOCK_E = 128
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the same local router update: removed the hard-coded BLOCK_E=128 path and switched to triton.next_power_of_2(E) to avoid the constant cap.

Comment thread tests/cuda/CMakeLists.txt Outdated
cmake_minimum_required(VERSION 3.10)
project(LockFreeQueueTests)

add_compile_definitions(_GLIBCXX_USE_CXX11_ABI=0) # needed for torch backward compatibility
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in local branch: removed hard-coded _GLIBCXX_USE_CXX11_ABI=0 and switched to Torch-provided compile flags via add_compile_options(${TORCH_CXX_FLAGS}).

Comment on lines +6 to +19
// Known bugs found during review:
//
// BUG-1 (use-after-erase in get()):
// `get()` stores an iterator `node = keyNodeMap[key]`, calls
// `freqListMap[freq].erase(node)` which invalidates `node`, then reads
// `node->freq`, `node->value` and pushes `*node` into another list.
// Dereferencing an erased list iterator is undefined behaviour.
// The tests below document the *intended* correct behaviour; they may
// crash or produce wrong results with AddressSanitizer enabled.
//
// BUG-2 (concurrent modification in reset()):
// `reset()` iterates over `freqListMap` while simultaneously calling
// `freqListMap[1].push_back(node)` which can insert a new key=1 entry and
// invalidate the range-for iterator.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not addressed in the current local diff for this round. I will keep this thread open for a follow-up cleanup of stale test commentary.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed locally: the stale "known bugs" wording in tests/cpp/unittest/utils/test_lfu_cache.cpp has been updated to regression-coverage framing, so the comment no longer implies the bug is still expected.

Comment thread core/common/sync.h Outdated
Comment on lines +20 to +21
int ret = syscall(SYS_futex, &value_, FUTEX_WAIT, expected, nullptr,
nullptr, 0);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not addressed in the current local diff for this round. I will track this futex typing/portability concern in a separate follow-up patch.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed locally in core/common/sync.h: Futex is now constrained to futex-compatible 32-bit integral storage via static_asserts, and futex syscalls use explicit futex-address/expected-value casts (reinterpret_cast<int*>(&value_), static_cast(...)).

Comment on lines +24 to +27
void* OpenSharedMemory(const char* name, size_t size);
void CloseSharedMemory(void* ptr, size_t size);
std::tuple<void*, int> AttachSharedMemory(const char* name, size_t size);
void DetachSharedMemory(void* ptr, int fd, size_t size);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not addressed in the current local diff for this round. I will include the explicit <tuple> include in a follow-up header hygiene pass.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed locally: added explicit #include in core/memory/shared_memory.h for the std::tuple<void*, int> AttachSharedMemory declaration.

Comment on lines +3 to +7
void* OpenSharedMemory(const char* name, size_t size) {
int shm_fd = shm_open(name, O_RDWR, 0666);
LOG_FATAL_IF(shm_fd == -1,
"shm_open failed. name: {}, size: {}; errno: {}, message: {}",
name, size, errno, strerror(errno));
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not addressed in the current local diff for this round. I will handle these missing standard includes in a follow-up source hygiene patch.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed locally: added explicit #include <errno.h>, #include <fcntl.h>, and #include <string.h> in core/memory/shared_memory.cpp to remove transitive-include reliance for errno/O_RDWR/strerror usage.

@drunkcoding drunkcoding requested a review from Copilot March 19, 2026 18:06
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@drunkcoding drunkcoding requested a review from Copilot March 19, 2026 19:05
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 131 out of 161 changed files in this pull request and generated 13 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +78 to +94
# Top-k selection (insertion sort)
top_vals = tl.full([TOPK], -float("inf"), dtype=tl.float32)
top_idxs = tl.full([TOPK], -1, dtype=tl.int32)

for i in range(BLOCK_E):
p = probs[i]
idx = i

# insert into sorted list
for j in range(TOPK):
if p > top_vals[j]:
for k in range(TOPK - 1, j, -1):
top_vals[k] = top_vals[k - 1]
top_idxs[k] = top_idxs[k - 1]
top_vals[j] = p
top_idxs[j] = idx
break
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This uses Python if / break control flow inside a @triton.jit kernel based on runtime values (p > top_vals[j]), which Triton does not support. Rewrite top-k selection using vectorized tl.where-based logic (e.g., repeated tl.max with masking for TOPK steps, or sorting via supported primitives), so the kernel can compile and execute correctly.

Copilot uses AI. Check for mistakes.
Comment on lines +161 to +167
for h in range(H):
h_val = tl.load(hidden_ptr + batch_id * H + h)
w_ptr = weight_ptr + off_e * H + h
valid = off_e < E
w_val = tl.load(w_ptr, mask=valid, other=0.0)
logits = tl.where(valid, logits + h_val * w_val, logits)

Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The kernel computes logits via a fully unrolled scalar loop over H (and later over BLOCK_E), which will explode compile time and generate very large kernels for typical MoE hidden sizes (e.g., 2k–8k). Consider blocking over H (e.g., tl.arange tiles + partial reductions) or using a matmul-style approach (or call into an existing optimized kernel) to keep compile time and runtime reasonable.

Copilot uses AI. Check for mistakes.
import torch.nn.functional as F
from transformers.models.qwen3_moe.modeling_qwen3_moe import Qwen3MoeMLP

import moe_infinity._store as prefetch_lib
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Importing moe_infinity._store at module import time will raise ImportError (and break importing moe_infinity.models.qwen) in CPU-only installs or builds where the CUDA extension is not present. Since the block already uses a runtime-injected self.lib, defer importing _store (or remove this import entirely) and load it lazily through the same _load_prefetch_lib() mechanism used elsewhere.

Copilot uses AI. Check for mistakes.
Comment on lines +80 to +83
self.expert_executor.dispatch_local(
self.layer_id, hidden_states, router_mask, combining_weights
)
for output, _, idx, _ in results:
token_indices = router_mask[..., idx].bool()
weights = combining_weights[..., idx]
# print(router_mask.shape, combining_weights.shape, hidden_states.shape, flush=True)
# print(output.shape, weights.shape, token_indices.shape, next_states.shape, flush=True)
# print(output.shape, weights[token_indices].shape, next_states[token_indices].shape, flush=True)
next_states[token_indices] += torch.einsum(
"b,be->be", weights[token_indices], output.to(weights.device)
)
next_states = self.expert_executor.wait_dispatch_local()
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous logic initialized next_states = torch.zeros_like(hidden_states) and then selectively added expert outputs, so treating 0 as a sentinel for 'unrouted' tokens was valid. Now next_states is produced by the executor and may legitimately contain zeros for routed tokens, which would be incorrectly overwritten by the source hidden_states. Use an explicit token participation mask (e.g., derived from router_mask) to merge routed vs unrouted tokens, or ensure the executor returns hidden_states for unrouted tokens internally.

Copilot uses AI. Check for mistakes.
Comment on lines 109 to +110
next_states[next_states == 0] = hidden_states[next_states == 0]
hidden_states = next_states
hidden_states = next_states.to(hidden_states.dtype)
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous logic initialized next_states = torch.zeros_like(hidden_states) and then selectively added expert outputs, so treating 0 as a sentinel for 'unrouted' tokens was valid. Now next_states is produced by the executor and may legitimately contain zeros for routed tokens, which would be incorrectly overwritten by the source hidden_states. Use an explicit token participation mask (e.g., derived from router_mask) to merge routed vs unrouted tokens, or ensure the executor returns hidden_states for unrouted tokens internally.

Copilot uses AI. Check for mistakes.
Comment on lines +173 to +175
fused_gemm_kernel<<<grid, block, shared_mem_size, stream>>>(
X.device_data(), Wg.device_data(), Wu.device_data(), C.device_data(), B,
No, K);
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cutlass::HostTensor typically requires explicit sync_device() before using device_data() to ensure device allocations are populated. As written, X/Wg/Wu are filled on host but never synced to device before launching the kernel, so the kernel can read uninitialized device buffers. Call X.sync_device(); Wg.sync_device(); Wu.sync_device(); (and ensure C is allocated on device as needed) before the launch.

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,45 @@
#pragma once

#include <torch/torch.h>
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This header uses memcpy but does not include <cstring> (or another header guaranteeing the declaration). Add #include <cstring> to avoid compilation failures on stricter toolchains.

Copilot uses AI. Check for mistakes.
Comment on lines +14 to +17
void copy_data(void* dest, const void* src, size_t count) const override {
LOG_DEBUG("Copy data from {:p} to {:p}, size: {}", src, dest, count);
memcpy(dest, src, count);
}
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This header uses memcpy but does not include <cstring> (or another header guaranteeing the declaration). Add #include <cstring> to avoid compilation failures on stricter toolchains.

Copilot uses AI. Check for mistakes.
Comment thread core/common/sync.h
Comment on lines +1 to +5
#pragma once

#include <linux/futex.h>
#include <sys/syscall.h>
#include <unistd.h>
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This header unconditionally includes Linux-only futex headers, which will fail to build on non-Linux platforms. If cross-platform builds are expected, guard this implementation with #if defined(__linux__) and provide a fallback (e.g., condition_variable-based) or move this file behind a Linux-only build option.

Copilot uses AI. Check for mistakes.
Comment thread core/common/constant.h
constexpr int64_t MB = KB * KB;
constexpr int64_t GB = KB * KB * KB;

constexpr int kWrapSize = 32;
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constant name appears to be a typo: kWrapSize should likely be kWarpSize (CUDA warp size). Renaming avoids confusion and improves searchability.

Copilot uses AI. Check for mistakes.
@drunkcoding drunkcoding merged commit 9e42c9d into main Mar 19, 2026
3 checks passed
seanlinmt pushed a commit to seanlinmt/MoE-Infinity that referenced this pull request Apr 2, 2026
…cientMoE#75)

* update table format

* improve table clarity

* init code commit

* doc: add flashattention installation guide and change toc

* feat: remove libaio dependency

* remove spdlog dependency

* misc: remove unused code and dependencies

* misc: remove commented-out code and unused imports

* fix: cuda oom due to safe tensors open

* remove gcc-12 requirement

* gptq disable exllama

* fix: key error in offload set

* add forward and call (EfficientMoE#7)

* add forward and call
* fix a bug

* feat: support grok-1 model

* update API note and install

* Feature/expert parallel (EfficientMoE#9)

* add back expert parallel by id hash

* add grok ep

* fix mistral typo

* accom cuda copy bug

* sync after compute

* fix:sync to make sure that input is ready

---------

Co-authored-by: xly <leyang.xue@ed.ac.uk>
Co-authored-by: luzhan <513964121@qq.com>

* fix tokenizer in example

* Xly/deepseek (EfficientMoE#34)

* add override QuantLinear (EfficientMoE#29)

Co-authored-by: xly <leyang.xue@ed.ac.uk>

* use torch streampool

* format

* working deepspeed backend

* fix: revert apply_rotary_pos_emb in deepseek

* fix busy waiting

* fix deepseek flashattn

* add deepseek v3

* format and fix multigpu deepseek bug

* with device caching allocator

* add on-demand lock cache

---------

Co-authored-by: xly <leyang.xue@ed.ac.uk>
Co-authored-by: lausannel <513964121@qq.com>

* Upstream (EfficientMoE#72)

* Fix: Undefined Symbol Compilation Error (EfficientMoE#37)

* reformat code vllm style

* add threadsafe queues

* fix compilation error

---------

Co-authored-by: xly <leyang.xue@ed.ac.uk>

* Refactor code for better performance (EfficientMoE#38)

* reformat code vllm style

* add threadsafe queues

* fix compilation error

* split files and remove queuing

* performance improvement

* remove error dependency

* add try lock return check

* fix header dependency

* fix hard coded number

---------

Co-authored-by: xly <leyang.xue@ed.ac.uk>

* CI: add pre commit format ci (EfficientMoE#40)

* ci: add pre commit format ci

* fix: add requirements for linting

* fix: format code before merge

* fix: update local clang format version

* Chore: rename organization name & optimize CI (EfficientMoE#41)

* reformat code vllm style

* add threadsafe queues

* fix compilation error

* split files and remove queuing

* performance improvement

* remove error dependency

* add try lock return check

* fix header dependency

* fix hard coded number

* update CI using cuda docker image

* repo consistency

* pr template fix

* format doc

* delete gpu option, add --no-install-recommends

* add cuda matrix and remove cuda full package install

* remove publish container

* change team name to efficient moe

---------

Co-authored-by: xly <leyang.xue@ed.ac.uk>

* CI: fix not a git repository in CI (EfficientMoE#43)

* CI: fix missing sudo in apt install (EfficientMoE#44)

* CI: fix missing sudo (EfficientMoE#45)

* CI: revert os matrix in CI (EfficientMoE#46)

* CI: add missing apt update after installing deb file (EfficientMoE#47)

* Doc: Update README example to DeepSeek and Suppress Warning (EfficientMoE#49)

* reformat code vllm style

* add threadsafe queues

* fix compilation error

* split files and remove queuing

* performance improvement

* remove error dependency

* add try lock return check

* fix header dependency

* fix hard coded number

* update CI using cuda docker image

* repo consistency

* pr template fix

* format doc

* delete gpu option, add --no-install-recommends

* add cuda matrix and remove cuda full package install

* remove publish container

* change team name to efficient moe

* update readme example to deepseek and supress warning

* format

* revert CI changes to main version

---------

Co-authored-by: xly <leyang.xue@ed.ac.uk>

* CI: do not build test if document update (EfficientMoE#52)

* reformat code vllm style

* add threadsafe queues

* fix compilation error

* split files and remove queuing

* performance improvement

* remove error dependency

* add try lock return check

* fix header dependency

* fix hard coded number

* update CI using cuda docker image

* repo consistency

* pr template fix

* format doc

* delete gpu option, add --no-install-recommends

* add cuda matrix and remove cuda full package install

* remove publish container

* change team name to efficient moe

* update readme example to deepseek and supress warning

* format

* revert CI changes to main version

* update readme conda env and ignore doc update in build and release

* fix wildcard

---------

Co-authored-by: xly <leyang.xue@ed.ac.uk>

* feat: Introduce Local Server for OpenAI-Compatible APIs  (EfficientMoE#4)

* update table format

* improve table clarity

* init code commit

* add openai api support

* add test scripts, update readme, update api

* format and change to deepseek in example

* fix format

* remove unused files

* fix api server token id device

---------

Co-authored-by: xly <leyang.xue@ed.ac.uk>
Co-authored-by: Yao <fuyao3860@gmail.com>

* feat: set parameter to device before serving (EfficientMoE#56)

* update table format

* improve table clarity

* init code commit

* add openai api support

* add test scripts, update readme, update api

* format and change to deepseek in example

* fix format

* remove unused files

* fix api server token id device

* fix gen broken

* update readme links

* cancel concurrent job

* set dense node to device

* sparse node set cpu

* remove OS def

* use update to date clang-format

* fix setuptools version

* fix setuptools version for python 3.8

* keep single cuda version in publish

---------

Co-authored-by: xly <leyang.xue@ed.ac.uk>
Co-authored-by: Yao <fuyao3860@gmail.com>

* Chore(deps): Bump pyarrow from 12.0.0 to 14.0.1 (EfficientMoE#69)

Bumps [pyarrow](https://github.com/apache/arrow) from 12.0.0 to 14.0.1.
- [Release notes](https://github.com/apache/arrow/releases)
- [Commits](apache/arrow@go/v12.0.0...go/v14.0.1)

---
updated-dependencies:
- dependency-name: pyarrow
  dependency-version: 14.0.1
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: xly <leyang.xue@ed.ac.uk>
Co-authored-by: Zhan Lu <51200935+lausannel@users.noreply.github.com>
Co-authored-by: Yao Fu <yao.fu.aisys@gmail.com>
Co-authored-by: Yao <fuyao3860@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Audit repository for stale code indicators (EfficientMoE#71)

* Fix: Undefined Symbol Compilation Error (EfficientMoE#37)

* reformat code vllm style

* add threadsafe queues

* fix compilation error

---------

Co-authored-by: xly <leyang.xue@ed.ac.uk>

* Refactor code for better performance (EfficientMoE#38)

* reformat code vllm style

* add threadsafe queues

* fix compilation error

* split files and remove queuing

* performance improvement

* remove error dependency

* add try lock return check

* fix header dependency

* fix hard coded number

---------

Co-authored-by: xly <leyang.xue@ed.ac.uk>

* CI: add pre commit format ci (EfficientMoE#40)

* ci: add pre commit format ci

* fix: add requirements for linting

* fix: format code before merge

* fix: update local clang format version

* Chore: rename organization name & optimize CI (EfficientMoE#41)

* reformat code vllm style

* add threadsafe queues

* fix compilation error

* split files and remove queuing

* performance improvement

* remove error dependency

* add try lock return check

* fix header dependency

* fix hard coded number

* update CI using cuda docker image

* repo consistency

* pr template fix

* format doc

* delete gpu option, add --no-install-recommends

* add cuda matrix and remove cuda full package install

* remove publish container

* change team name to efficient moe

---------

Co-authored-by: xly <leyang.xue@ed.ac.uk>

* CI: fix not a git repository in CI (EfficientMoE#43)

* CI: fix missing sudo in apt install (EfficientMoE#44)

* CI: fix missing sudo (EfficientMoE#45)

* CI: revert os matrix in CI (EfficientMoE#46)

* CI: add missing apt update after installing deb file (EfficientMoE#47)

* Doc: Update README example to DeepSeek and Suppress Warning (EfficientMoE#49)

* reformat code vllm style

* add threadsafe queues

* fix compilation error

* split files and remove queuing

* performance improvement

* remove error dependency

* add try lock return check

* fix header dependency

* fix hard coded number

* update CI using cuda docker image

* repo consistency

* pr template fix

* format doc

* delete gpu option, add --no-install-recommends

* add cuda matrix and remove cuda full package install

* remove publish container

* change team name to efficient moe

* update readme example to deepseek and supress warning

* format

* revert CI changes to main version

---------

Co-authored-by: xly <leyang.xue@ed.ac.uk>

* CI: do not build test if document update (EfficientMoE#52)

* reformat code vllm style

* add threadsafe queues

* fix compilation error

* split files and remove queuing

* performance improvement

* remove error dependency

* add try lock return check

* fix header dependency

* fix hard coded number

* update CI using cuda docker image

* repo consistency

* pr template fix

* format doc

* delete gpu option, add --no-install-recommends

* add cuda matrix and remove cuda full package install

* remove publish container

* change team name to efficient moe

* update readme example to deepseek and supress warning

* format

* revert CI changes to main version

* update readme conda env and ignore doc update in build and release

* fix wildcard

---------

Co-authored-by: xly <leyang.xue@ed.ac.uk>

* feat: Introduce Local Server for OpenAI-Compatible APIs  (EfficientMoE#4)

* update table format

* improve table clarity

* init code commit

* add openai api support

* add test scripts, update readme, update api

* format and change to deepseek in example

* fix format

* remove unused files

* fix api server token id device

---------

Co-authored-by: xly <leyang.xue@ed.ac.uk>
Co-authored-by: Yao <fuyao3860@gmail.com>

* feat: set parameter to device before serving (EfficientMoE#56)

* update table format

* improve table clarity

* init code commit

* add openai api support

* add test scripts, update readme, update api

* format and change to deepseek in example

* fix format

* remove unused files

* fix api server token id device

* fix gen broken

* update readme links

* cancel concurrent job

* set dense node to device

* sparse node set cpu

* remove OS def

* use update to date clang-format

* fix setuptools version

* fix setuptools version for python 3.8

* keep single cuda version in publish

---------

Co-authored-by: xly <leyang.xue@ed.ac.uk>
Co-authored-by: Yao <fuyao3860@gmail.com>

* Initial plan

* Add mypy lint hook

Co-authored-by: drunkcoding <14305648+drunkcoding@users.noreply.github.com>

* Configure mypy settings

Co-authored-by: drunkcoding <14305648+drunkcoding@users.noreply.github.com>

* Adjust mypy scope

Co-authored-by: drunkcoding <14305648+drunkcoding@users.noreply.github.com>

* Scope mypy checks

Co-authored-by: drunkcoding <14305648+drunkcoding@users.noreply.github.com>

---------

Co-authored-by: Leyang Xue <s2062808@ed.ac.uk>
Co-authored-by: xly <leyang.xue@ed.ac.uk>
Co-authored-by: Zhan Lu <51200935+lausannel@users.noreply.github.com>
Co-authored-by: Yao Fu <yao.fu.aisys@gmail.com>
Co-authored-by: Yao <fuyao3860@gmail.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: drunkcoding <14305648+drunkcoding@users.noreply.github.com>

* feat: performance improvement and Qwen3 support (EfficientMoE#60)

* add openai api support

* add test scripts, update readme, update api

* Fix: Undefined Symbol Compilation Error (EfficientMoE#37)

* reformat code vllm style

* add threadsafe queues

* fix compilation error

---------

Co-authored-by: xly <leyang.xue@ed.ac.uk>

* Refactor code for better performance (EfficientMoE#38)

* reformat code vllm style

* add threadsafe queues

* fix compilation error

* split files and remove queuing

* performance improvement

* remove error dependency

* add try lock return check

* fix header dependency

* fix hard coded number

---------

Co-authored-by: xly <leyang.xue@ed.ac.uk>

* CI: add pre commit format ci (EfficientMoE#40)

* ci: add pre commit format ci

* fix: add requirements for linting

* fix: format code before merge

* fix: update local clang format version

* Chore: rename organization name & optimize CI (EfficientMoE#41)

* reformat code vllm style

* add threadsafe queues

* fix compilation error

* split files and remove queuing

* performance improvement

* remove error dependency

* add try lock return check

* fix header dependency

* fix hard coded number

* update CI using cuda docker image

* repo consistency

* pr template fix

* format doc

* delete gpu option, add --no-install-recommends

* add cuda matrix and remove cuda full package install

* remove publish container

* change team name to efficient moe

---------

Co-authored-by: xly <leyang.xue@ed.ac.uk>

* CI: fix not a git repository in CI (EfficientMoE#43)

* CI: fix missing sudo in apt install (EfficientMoE#44)

* CI: fix missing sudo (EfficientMoE#45)

* CI: revert os matrix in CI (EfficientMoE#46)

* CI: add missing apt update after installing deb file (EfficientMoE#47)

* Doc: Update README example to DeepSeek and Suppress Warning (EfficientMoE#49)

* reformat code vllm style

* add threadsafe queues

* fix compilation error

* split files and remove queuing

* performance improvement

* remove error dependency

* add try lock return check

* fix header dependency

* fix hard coded number

* update CI using cuda docker image

* repo consistency

* pr template fix

* format doc

* delete gpu option, add --no-install-recommends

* add cuda matrix and remove cuda full package install

* remove publish container

* change team name to efficient moe

* update readme example to deepseek and supress warning

* format

* revert CI changes to main version

---------

Co-authored-by: xly <leyang.xue@ed.ac.uk>

* CI: do not build test if document update (EfficientMoE#52)

* reformat code vllm style

* add threadsafe queues

* fix compilation error

* split files and remove queuing

* performance improvement

* remove error dependency

* add try lock return check

* fix header dependency

* fix hard coded number

* update CI using cuda docker image

* repo consistency

* pr template fix

* format doc

* delete gpu option, add --no-install-recommends

* add cuda matrix and remove cuda full package install

* remove publish container

* change team name to efficient moe

* update readme example to deepseek and supress warning

* format

* revert CI changes to main version

* update readme conda env and ignore doc update in build and release

* fix wildcard

---------

Co-authored-by: xly <leyang.xue@ed.ac.uk>

* format and change to deepseek in example

* fix format

* remove unused files

* fix api server token id device

* feat: Introduce Local Server for OpenAI-Compatible APIs  (EfficientMoE#4)

* update table format

* improve table clarity

* init code commit

* add openai api support

* add test scripts, update readme, update api

* format and change to deepseek in example

* fix format

* remove unused files

* fix api server token id device

---------

Co-authored-by: xly <leyang.xue@ed.ac.uk>
Co-authored-by: Yao <fuyao3860@gmail.com>

* fix gen broken

* update readme links

* cancel concurrent job

* set dense node to device

* sparse node set cpu

* remove OS def

* use update to date clang-format

* fix setuptools version

* fix setuptools version for python 3.8

* keep single cuda version in publish

* feat: set parameter to device before serving (EfficientMoE#56)

* update table format

* improve table clarity

* init code commit

* add openai api support

* add test scripts, update readme, update api

* format and change to deepseek in example

* fix format

* remove unused files

* fix api server token id device

* fix gen broken

* update readme links

* cancel concurrent job

* set dense node to device

* sparse node set cpu

* remove OS def

* use update to date clang-format

* fix setuptools version

* fix setuptools version for python 3.8

* keep single cuda version in publish

---------

Co-authored-by: xly <leyang.xue@ed.ac.uk>
Co-authored-by: Yao <fuyao3860@gmail.com>

* add max length in gen openai

* fix cache race condition

* all param init at host

* add qwen3

* ubuntu lts and build

* pre-commit ubuntu version

* router weights update overlap

* rename deepseek_v2 and reduce torch kernel launch

* fix import

* fix build and fix bug

* fix citation linebreak

* fix typo

* fix dtype size

* remove comments

* fix example

* pr update init

* remove comment and unify deepseek preroute

* feat: Merge kernels from vLLM and FlashInfer (EfficientMoE#63)

* new allocator

* add kernel compilation

* stable topk

---------

Co-authored-by: xly <leyang.xue@ed.ac.uk>

---------

Co-authored-by: Yao <fuyao3860@gmail.com>
Co-authored-by: xly <leyang.xue@ed.ac.uk>
Co-authored-by: Zhan Lu <51200935+lausannel@users.noreply.github.com>
Co-authored-by: Yao Fu <yao.fu.aisys@gmail.com>

* Add Claude Code GitHub Workflow (EfficientMoE#73)

* "Claude PR Assistant workflow"

* "Claude Code Review workflow"

* Xly/code clean (EfficientMoE#74)

* add openai api support

* add test scripts, update readme, update api

* format and change to deepseek in example

* fix format

* remove unused files

* fix api server token id device

* fix gen broken

* update readme links

* cancel concurrent job

* set dense node to device

* sparse node set cpu

* remove OS def

* use update to date clang-format

* fix setuptools version

* fix setuptools version for python 3.8

* keep single cuda version in publish

* add max length in gen openai

* fix cache race condition

* all param init at host

* add docker and sllm style read

* wrap docker and test coverage

* test

* Clean up symlinks: remove unused op_builder, core/core, and move test_io to extensions

* Replace core/kernel directory with symlink to extensions/kernel

* seperations

* remove ops dependency

* Add CUTLASS fused MoE FFN kernel and supporting infrastructure

- Add extensions/kernel/fused_moe_mlp.cu/h: BF16 CUTLASS 3-GEMM fused path
  (gate → up w/ SiLU-mul epilogue → down) with small-M and large-K tile dispatch
- Add tests/cuda/test_fused_mlp_cutlass.cu: BF16 CUTLASS vs Torch-native benchmark
- Integrate fused kernel into core/parallel/expert_module.cpp via ForwardHelper()
- Update core/model/fused_mlp.cu/h and extensions/kernel/epilogue_utils.h
- Improve core/utils: cache.h, lockfree_queue.h, simple_object_pool.h
- Update tests/cuda/CMakeLists.txt with KERNEL_SRC pattern for CUTLASS tests
- Update CLAUDE.md docs and setup.py build config

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Add prefill-decode collocation benchmark with throughput analysis

Benchmarks five attention colocation strategies for serving decode and
prefill requests on the same GPU time-slice:
  0  serial          — sequential on default stream
  1  varlen-fused    — single flash_attn_varlen_func (continuous batching)
  2  dual-stream     — two CUDA streams, no SM partition
  3  green-ctx-sm    — SM-partitioned green contexts (CUDA ≥ 12.4)
  4  green-ctx-sm-wq — SM + work-queue balanced scope (CUDA 13.1+)

Throughput analysis includes:
- Separate decode-only / prefill-only baselines with TFLOPS and tok/s
- Ideal-overlap bound (perfect concurrency = max(dec, pre))
- Per-mode: TFLOPS, decode tok/s, prefill tok/s, overlap efficiency
- Generation-projection table: decode overhead and Δ vs serial per mode

CUDA 13.1 green context API notes (driver 590.x):
- CUdevResourceDesc is a pointer typedef (c_void_p), not a struct
- cuGreenCtxStreamCreate requires CU_STREAM_NON_BLOCKING flag
- CU_DEV_RESOURCE_TYPE_WORKQUEUE_CONFIG = 1000; configure sharingScope
  to CU_WORKQUEUE_SCOPE_GREEN_CTX_BALANCED for WQ isolation

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Bump pydantic and transformers to resolve Dependabot alerts

- pydantic 1.10.12 → 1.10.13: fixes ReDoS in email validation (GHSA-mr82-8j83-vxmv)
- transformers 4.51.3 → 4.53.0: fixes 14 alerts including 3 HIGH RCE
  (GHSA-wrfc-pvp9-mr9g, GHSA-hxxf-235m-72v3, GHSA-qxrp-vhvm-j765)
  and 11 MEDIUM/LOW ReDoS vulnerabilities
- Remove torch==2.3.1 pin (managed by conda env / base image)
- Add flash-attn to requirements

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix build

* Make readme_example.py testable via --help

Add argparse to readme_example.py so that model-loading code runs only
after parse_args(), allowing `--help` to exit 0 without a GPU or model.
Replace the AST-only test_readme_example_syntax with test_readme_example_help,
which mirrors the existing test_interface_example_help pattern and is verified
passing in Docker.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fix format CI and build-test CI

- requirements.txt: sort flash-attn alphabetically (between fastapi and hjson)
  so requirements-txt-fixer pre-commit hook passes
- build-test.yml: replace Ubuntu 20.04 CUDA container (Python 3.8, broken
  PyTorch wheel) with actions/setup-python Python 3.10 + CPU-only torch;
  switch from full wheel build to sdist-only (--no-isolation) to avoid
  CUTLASS dependency and 20+ min compile time

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* green ctx bench

* Fix CI: guard CUDA extensions behind cuda_available, add statics to codespell ignore

- setup.py: only build CUDAExtension when torch.version.cuda is set;
  the build-test CI installs CPU-only torch and lacks CUDA_HOME, causing
  CUDAExtension to abort with OSError
- .pre-commit-config.yaml: add 'statics' to codespell ignore-words-list;
  the term is valid C++ (module-level static variables) but was flagged
  as a misspelling of 'statistics'

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* tests update

---------

Co-authored-by: Yao <fuyao3860@gmail.com>
Co-authored-by: xly <leyang.xue@ed.ac.uk>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* remove claude

* format

* resolve review

* resolve reviews

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: xly <leyang.xue@ed.ac.uk>
Co-authored-by: Zhan Lu <51200935+lausannel@users.noreply.github.com>
Co-authored-by: lausannel <513964121@qq.com>
Co-authored-by: Yao Fu <fuyao3860@gmail.com>
Co-authored-by: Yao Fu <yao.fu.aisys@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: drunkcoding <14305648+drunkcoding@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants