Skip to content

support ds4#1355

Open
WANDY666 wants to merge 31 commits into
mainfrom
support_ds4
Open

support ds4#1355
WANDY666 wants to merge 31 commits into
mainfrom
support_ds4

Conversation

@WANDY666

Copy link
Copy Markdown
Contributor

No description provided.

WANDY666 and others added 22 commits June 3, 2026 09:20
Root cause of the historical cudagraph accuracy drop (gsm8k 0.96 -> 0.74,
coherent-but-runaway generations; same 0.75 the pre-v5 fullslot_decode
experiments worked around): _capture_decode warms up via copy.copy(infer_state),
which SHARES decode_att_state. FlashMLASchedMeta is lazily planned at the first
kernel call and written back onto that shared state, so the warmup pass locks a
schedule planned for the dummy batch (seq=2); the capture pass then binds those
stale scheduler tensors and every replay runs real requests with a tile schedule
planned for near-empty kv (systematically under-read attention).

Fix: reset_sched_meta_for_capture() hook on the nsa decode att state, invoked in
both capture paths after warmup, so planning happens INSIDE the captured region
and re-plans on every replay from live tensors.

Validation (tp4, H200, prompt cache on): batch-1 greedy decode is now
character-identical to eager; per-layer probe shows embed+swa layers bitwise
equal under replay, benign rounding-class deltas only in compress layers,
argmax unchanged. gsm8k 100q/128: cold 0.960/111s, warm 0.960/23.3s 100% hits
(eager: 0.95-0.97, cold 141s / warm 50s). Batch-1 decode 20.4ms/token vs 142ms
eager. 41/41 unit tests green. Codex review GO (incl. overlap-path symmetry).

launch.sh: drop --disable_cudagraph, derive PYTHONPATH from the script dir
(hardcoded tree path made a worktree launch silently serve main-tree code).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Graph-sandwich prefill (graphs capture dense ops only; attention/compressor
run eagerly between segments) was already in-tree; enabling it exposed that
HOLD-pad rows read the racing HOLD slot, making their hiddens nondeterministic
and perturbing real rows via MoE expert batching (ulp-level, amplified
~1.9x/layer). Zero the pad rows' attention output.

Residual greedy-trajectory divergence vs eager equals the fp4 marlin MoE
kernel's own run-to-run reduction-order noise (eager-vs-eager control: 0/4
match), accepted statistically: gsm8k 100q cold 0.980/115.5s warm 0.960/25.9s
(eager-baseline parity); batch-1 TTFT 1.86x at 46 tokens.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces comprehensive support for serving the DeepSeek-V4-Flash model in LightLLM. Key additions include a custom memory manager (DeepseekV4MemoryManager) and request manager (DeepseekV4ReqManager) to handle packed page-slab storage, sliding window attention (SWA) slot allocation, and compression slot preparation. It also implements the NsaFlashMlaFp8SparseAttBackend for FlashMLA sparse attention, adds FuseMoeMXFP4 for MXFP4 quantized MoE weights, and updates the Radix cache to support page-aligned prefix matching and SWA page reclamation. The review feedback highlights a runtime AttributeError in mxfp4_impl.py due to incorrect attribute access on self instead of self.moe_weight, and suggests parameterizing hardcoded absolute paths in launch.sh to enhance portability.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

I am having trouble creating individual review comments. Click here to see my feedback.

lightllm/common/basemodel/layer_weights/meta_weights/fused_moe/impl/mxfp4_impl.py (41)

high

The attribute n_routed_experts is accessed on self (which is an instance of FuseMoeMXFP4), but it is actually defined on the weight object self.moe_weight (an instance of FusedMoeWeight). Accessing it directly on self will raise an AttributeError at runtime. Please use self.moe_weight.n_routed_experts instead.

            global_num_experts=self.moe_weight.n_routed_experts,

launch.sh (45-47)

medium

The paths /data/wanzihao/sglang/python and /data/models/DeepSeek-V4-Flash are hardcoded absolute paths specific to a user's environment. This makes the launch script non-portable for other users or environments. Consider using environment variables with these paths as defaults to improve portability and usability.

PYTHONPATH="${REPO_DIR}":"${SGLANG_PATH:-/data/wanzihao/sglang/python}" \
python -m lightllm.server.api_server \
  --model_dir "${MODEL_DIR:-/data/models/DeepSeek-V4-Flash}" \

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant