mtp: make speculation disable skip draft work#206
Open
abhay wants to merge 1 commit into
Open
Conversation
DS4_MTP_SPEC_DISABLE is meant to leave generation on the ordinary greedy path. It already bypassed speculative acceptance, but the session still ran MTP draft probes after each token and one-shot CLI generation still selected the session sampling path whenever MTP was configured. Honor the flag at those remaining decision points so disabled speculation does not pay draft-path work. Keep DS4_MTP_PROBE independent so diagnostic draft logging can still be requested explicitly.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
DS4_MTP_SPEC_DISABLEshould leave greedy generation on the ordinarynon-speculative path. Before this change, the flag bypassed speculative
acceptance, but two pieces of speculative plumbing still ran:
configured
ds4_session_eval()still prepared MTP draft tokens after each decode stepThis patch honors the flag at those remaining decision points.
DS4_MTP_PROBEremains independent, so explicit draft-probe diagnostics still work without
enabling speculative generation.
Why This Is A Correctness Fix
The disable flag previously meant "do not accept speculative tokens" rather
than "do not run speculative work." That made the disabled mode surprising and
kept extra MTP work in a path meant to be ordinary greedy generation. After this
change, the flag has the expected behavior: configured MTP support can remain
loaded, but speculative generation work is skipped unless speculation or probe
diagnostics are explicitly enabled.
Testing
Machine/backend/model:
DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.ggufDeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.ggufCommands run on this branch:
make clean make make testmakepassed.make testran with Metal access. Thelong-context,tool-call-quality,metal-kernels, andservertracks passed. Thelogprob-vectorstrack failedon
long_memory_archivewith 7 failures.I checked the same logprob-vector track on
main:It failed the same
long_memory_archivevector onmainwith the same 7failures, so this branch does not appear to introduce that regression. That
pre-existing vector failure is addressed separately in
#204, which runs the official logprob-vector
check through the quality path.
Targeted Disabled-MTP Smoke Check
I also compared the exact affected mode against
main:env DS4_MTP_SPEC_DISABLE=1 ./ds4 \ -m /Users/abhay/Dev/ds4/ds4flash.gguf \ --mtp /Users/abhay/Dev/ds4/gguf/DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf \ --mtp-draft 2 \ --temp 0 \ --nothink \ --ctx 1024 \ -n 32 \ -p 'Write one concise sentence about testing.'Both
mainand this branch produced the same greedy output. Generation speed inthis narrow smoke check was:
main: 35.67 t/sThis is not a broad throughput claim. It only shows that the disabled-MTP mode
now avoids the extra draft-path overhead this patch removes.