support ds4#1355
Conversation
Root cause of the historical cudagraph accuracy drop (gsm8k 0.96 -> 0.74, coherent-but-runaway generations; same 0.75 the pre-v5 fullslot_decode experiments worked around): _capture_decode warms up via copy.copy(infer_state), which SHARES decode_att_state. FlashMLASchedMeta is lazily planned at the first kernel call and written back onto that shared state, so the warmup pass locks a schedule planned for the dummy batch (seq=2); the capture pass then binds those stale scheduler tensors and every replay runs real requests with a tile schedule planned for near-empty kv (systematically under-read attention). Fix: reset_sched_meta_for_capture() hook on the nsa decode att state, invoked in both capture paths after warmup, so planning happens INSIDE the captured region and re-plans on every replay from live tensors. Validation (tp4, H200, prompt cache on): batch-1 greedy decode is now character-identical to eager; per-layer probe shows embed+swa layers bitwise equal under replay, benign rounding-class deltas only in compress layers, argmax unchanged. gsm8k 100q/128: cold 0.960/111s, warm 0.960/23.3s 100% hits (eager: 0.95-0.97, cold 141s / warm 50s). Batch-1 decode 20.4ms/token vs 142ms eager. 41/41 unit tests green. Codex review GO (incl. overlap-path symmetry). launch.sh: drop --disable_cudagraph, derive PYTHONPATH from the script dir (hardcoded tree path made a worktree launch silently serve main-tree code). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Graph-sandwich prefill (graphs capture dense ops only; attention/compressor run eagerly between segments) was already in-tree; enabling it exposed that HOLD-pad rows read the racing HOLD slot, making their hiddens nondeterministic and perturbing real rows via MoE expert batching (ulp-level, amplified ~1.9x/layer). Zero the pad rows' attention output. Residual greedy-trajectory divergence vs eager equals the fp4 marlin MoE kernel's own run-to-run reduction-order noise (eager-vs-eager control: 0/4 match), accepted statistically: gsm8k 100q cold 0.980/115.5s warm 0.960/25.9s (eager-baseline parity); batch-1 TTFT 1.86x at 46 tokens.
There was a problem hiding this comment.
Code Review
This pull request introduces comprehensive support for serving the DeepSeek-V4-Flash model in LightLLM. Key additions include a custom memory manager (DeepseekV4MemoryManager) and request manager (DeepseekV4ReqManager) to handle packed page-slab storage, sliding window attention (SWA) slot allocation, and compression slot preparation. It also implements the NsaFlashMlaFp8SparseAttBackend for FlashMLA sparse attention, adds FuseMoeMXFP4 for MXFP4 quantized MoE weights, and updates the Radix cache to support page-aligned prefix matching and SWA page reclamation. The review feedback highlights a runtime AttributeError in mxfp4_impl.py due to incorrect attribute access on self instead of self.moe_weight, and suggests parameterizing hardcoded absolute paths in launch.sh to enhance portability.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
I am having trouble creating individual review comments. Click here to see my feedback.
lightllm/common/basemodel/layer_weights/meta_weights/fused_moe/impl/mxfp4_impl.py (41)
The attribute n_routed_experts is accessed on self (which is an instance of FuseMoeMXFP4), but it is actually defined on the weight object self.moe_weight (an instance of FusedMoeWeight). Accessing it directly on self will raise an AttributeError at runtime. Please use self.moe_weight.n_routed_experts instead.
global_num_experts=self.moe_weight.n_routed_experts,
launch.sh (45-47)
The paths /data/wanzihao/sglang/python and /data/models/DeepSeek-V4-Flash are hardcoded absolute paths specific to a user's environment. This makes the launch script non-portable for other users or environments. Consider using environment variables with these paths as defaults to improve portability and usability.
PYTHONPATH="${REPO_DIR}":"${SGLANG_PATH:-/data/wanzihao/sglang/python}" \
python -m lightllm.server.api_server \
--model_dir "${MODEL_DIR:-/data/models/DeepSeek-V4-Flash}" \
No description provided.