support ds4 by WANDY666 · Pull Request #1355 · ModelTC/LightLLM

WANDY666 · 2026-06-15T13:32:30Z

No description provided.

Root cause of the historical cudagraph accuracy drop (gsm8k 0.96 -> 0.74, coherent-but-runaway generations; same 0.75 the pre-v5 fullslot_decode experiments worked around): _capture_decode warms up via copy.copy(infer_state), which SHARES decode_att_state. FlashMLASchedMeta is lazily planned at the first kernel call and written back onto that shared state, so the warmup pass locks a schedule planned for the dummy batch (seq=2); the capture pass then binds those stale scheduler tensors and every replay runs real requests with a tile schedule planned for near-empty kv (systematically under-read attention). Fix: reset_sched_meta_for_capture() hook on the nsa decode att state, invoked in both capture paths after warmup, so planning happens INSIDE the captured region and re-plans on every replay from live tensors. Validation (tp4, H200, prompt cache on): batch-1 greedy decode is now character-identical to eager; per-layer probe shows embed+swa layers bitwise equal under replay, benign rounding-class deltas only in compress layers, argmax unchanged. gsm8k 100q/128: cold 0.960/111s, warm 0.960/23.3s 100% hits (eager: 0.95-0.97, cold 141s / warm 50s). Batch-1 decode 20.4ms/token vs 142ms eager. 41/41 unit tests green. Codex review GO (incl. overlap-path symmetry). launch.sh: drop --disable_cudagraph, derive PYTHONPATH from the script dir (hardcoded tree path made a worktree launch silently serve main-tree code). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Graph-sandwich prefill (graphs capture dense ops only; attention/compressor run eagerly between segments) was already in-tree; enabling it exposed that HOLD-pad rows read the racing HOLD slot, making their hiddens nondeterministic and perturbing real rows via MoE expert batching (ulp-level, amplified ~1.9x/layer). Zero the pad rows' attention output. Residual greedy-trajectory divergence vs eager equals the fp4 marlin MoE kernel's own run-to-run reduction-order noise (eager-vs-eager control: 0/4 match), accepted statistically: gsm8k 100q cold 0.980/115.5s warm 0.960/25.9s (eager-baseline parity); batch-1 TTFT 1.86x at 46 tokens.

gemini-code-assist

Code Review

This pull request introduces comprehensive support for serving the DeepSeek-V4-Flash model in LightLLM. Key additions include a custom memory manager (DeepseekV4MemoryManager) and request manager (DeepseekV4ReqManager) to handle packed page-slab storage, sliding window attention (SWA) slot allocation, and compression slot preparation. It also implements the NsaFlashMlaFp8SparseAttBackend for FlashMLA sparse attention, adds FuseMoeMXFP4 for MXFP4 quantized MoE weights, and updates the Radix cache to support page-aligned prefix matching and SWA page reclamation. The review feedback highlights a runtime AttributeError in mxfp4_impl.py due to incorrect attribute access on self instead of self.moe_weight, and suggests parameterizing hardcoded absolute paths in launch.sh to enhance portability.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

I am having trouble creating individual review comments. Click here to see my feedback.

lightllm/common/basemodel/layer_weights/meta_weights/fused_moe/impl/mxfp4_impl.py (41)

The attribute n_routed_experts is accessed on self (which is an instance of FuseMoeMXFP4), but it is actually defined on the weight object self.moe_weight (an instance of FusedMoeWeight). Accessing it directly on self will raise an AttributeError at runtime. Please use self.moe_weight.n_routed_experts instead.

            global_num_experts=self.moe_weight.n_routed_experts,

launch.sh (45-47)

The paths /data/wanzihao/sglang/python and /data/models/DeepSeek-V4-Flash are hardcoded absolute paths specific to a user's environment. This makes the launch script non-portable for other users or environments. Consider using environment variables with these paths as defaults to improve portability and usability.

PYTHONPATH="${REPO_DIR}":"${SGLANG_PATH:-/data/wanzihao/sglang/python}" \
python -m lightllm.server.api_server \
  --model_dir "${MODEL_DIR:-/data/models/DeepSeek-V4-Flash}" \

…rt_ds4

WANDY666 and others added 22 commits June 3, 2026 09:20

one pass

7c7bd61

Optimization

d790ad2

add prompt cache

a161244

support cudagraph

61eed87

refact tokenizer

19866d0

add statement

29c6082

format

ffafdbf

pass gsm8k but need review

e8009cb

fix

b3b8123

fix rope

6002866

fix profile

c09dc6a

support fp8

c07e38c

optimize

ff71706

fix

d7dd6e0

compress infer

3a5dcdc

add c128 to mem_manager

d76450f

refact

07d2308

opt

d4dcd8a

opt

62c16d5

delete launch.sh

69824d0

gemini-code-assist Bot reviewed Jun 15, 2026

View reviewed changes

WANDY666 added 7 commits June 15, 2026 14:15

fix

df70ecb

restore

1ad981d

support parser

7b17bb5

fix

6837abd

Merge branch 'main' of https://github.com/ModelTC/LightLLM into suppo…

e1376fe

…rt_ds4

add c4 paged indexes

02a24ce

fix chunk_size and page_size

52a1528

WANDY666 added 2 commits June 18, 2026 03:08

add sglang third_party

0dbc90b

fix tpsp

e8c49d1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support ds4#1355

support ds4#1355
WANDY666 wants to merge 31 commits into
mainfrom
support_ds4

WANDY666 commented Jun 15, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

WANDY666 commented Jun 15, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

lightllm/common/basemodel/layer_weights/meta_weights/fused_moe/impl/mxfp4_impl.py (41)

launch.sh (45-47)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant