feat(scheduler): ratio-based decode reservation, chunked prefill and retract preemption#7038
Open
Foriv wants to merge 14 commits intoPaddlePaddle:developfrom
Open
feat(scheduler): ratio-based decode reservation, chunked prefill and retract preemption#7038Foriv wants to merge 14 commits intoPaddlePaddle:developfrom
Foriv wants to merge 14 commits intoPaddlePaddle:developfrom
Conversation
…ned token-ratio mechanism - Remove can_relax_prefill_strategy, reserve_output_block_num and fixed block reservation system - Add dynamic current_new_token_ratio with init/decay/preemption-reset (SGLang-aligned) - Prefill threshold: use current chunk blocks only, not full remaining prefill - Remove cross-chunk prefill block reservation (items 5 & 6 in threshold) - Merge token_budget into rem_input_tokens (= FD_REM_INPUT_TOKENS - running_decode_count) - Dual budget: rem_chunk_tokens (8192) + rem_input_tokens (16384 - decode_count) - _get_num_new_tokens: min(remaining, rem_chunk_tokens, rem_input_tokens) with floor-align - _trigger_preempt: SGLang retract_decode (shortest output first, evict prefix cache) - envs.py: add FD_INIT_NEW_TOKEN_RATIO, FD_MIN_NEW_TOKEN_RATIO_FACTOR, FD_NEW_TOKEN_RATIO_DECAY_STEPS, FD_RETRACT_DECODE_STEPS, FD_CLIP_MAX_NEW_TOKENS_ESTIMATION; remove 7 obsolete vars
…unded prefill admission Move cached_running_decode_reserved and scheduled_new_decode_reserved_tokens init before RUNNING loop so both loops share the same reservation state. Add _get_can_schedule_prefill_threshold_block check in RUNNING loop prefill path, aligned with WAITING loop, to enforce decode reservation constraint on chunked prefill continuation. Track last-chunk decode reservation in RUNNING loop and propagate to WAITING loop.
…chitecture RUNNING loop: only continue ONE chunked prefill then break, instead of iterating all running prefills sharing rem_chunk_tokens budget (which caused fragmentation: 8 prefills each getting ~1024 tokens instead of one getting full 8192). WAITING loop: split token consumption - chunked (non-last) requests consume rem_chunk_tokens and trigger break after one admit; non-chunked (last-chunk/short) requests consume rem_input_tokens only and continue to be admitted. Applied to both WAITING and PREEMPTED branches. Remove threshold check from RUNNING loop (no longer needed with single prefill break). Move cached_running_decode_reserved back to WAITING section init.
- Remove prealloc_dec_block_slot_num_threshold and enc_dec_block_num - Allocate exactly 1 block only when current block is exhausted - Evict decode KV cache before preemption - Self-preempt to avoid livelock when only 1 request remains
Decay ratio only when all conditions are met: - has_decode_requests: decode requests running - not has_running_prefill: no chunked prefill in progress in running queue - not self.waiting: waiting queue is empty (no new prefill admitted) - not has_scheduled_prefill: no prefill was scheduled this round - not preempted_reqs: no preemption occurred this round This aligns with SGLang's is_extend_mode = has_running_prefill or bool(waiting), ensuring ratio stays high when prefill is active (running or waiting).
- Add has_scheduled_running_prefill guard in RUNNING loop to prevent multiple in-flight prefill requests from being scheduled in one step - Gate WAITING loop on not has_scheduled_running_prefill so new prefill admission is skipped when an in-flight prefill chunk is scheduled - Use independent rem_chunk_tokens_waiting budget for WAITING loop so RUNNING in-flight prefill consumption cannot starve new requests - Replace break with req_index += 1 after in-flight prefill to allow decode requests to continue being scheduled in the same step - Priority reorder: move in-flight prefill to self.running[0] each step - Fix _fetch_request to use max_prefill_batch directly instead of available_batch() which returns 0 during GPU execution - Split #running-req log field into #running-decode and #running-prefill - Fix decode log #token to show actual tokens generated per step instead of KV cache block-granularity usage (jumps of block_size=64) - Remove unused tokens_used parameter from both log methods
- Add iteration summary: decode-only windows now healthy with 540+ tok/s throughput - Document 4 remaining issues: dead reset_new_token_ratio_on_idle(), active_chunked_prefill_req missing in preempted_all()/wait_worker_inflight_requests_finish(), coarse last-chunk judgment, missing prefill-side priority preemption - Prioritize fixing issues 1 and 2 for next iteration
The max_new variable was already clipped via min() when defined, but scheduled_new_decode_reserved_tokens was applying min() again, causing double clipping. Changed to simply add max_new.
|
Thanks for your contribution! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Scheduler Refactor (ResourceManagerV1): Ratio-based Decode Reservation, Chunked Prefill, and Retract-style Preemption
Changed Files
fastdeploy/engine/sched/resource_manager_v1.pyfastdeploy/envs.pyfastdeploy/scheduler/local_scheduler.pyfastdeploy/engine/sched/scheduler_metrics_logger.py1)
resource_manager_v1.py: Core Implementation Changes1.1 Introduce
active_chunked_prefill_req(single active chunked prefill)self.active_chunked_prefill_req: Request | None = None_num_active_running_requests()(countsactive_chunked_prefill_reqinto running metrics)_ensure_request_slot_allocated(request)(ensurestasks_list/stop_flags/req_dictslot correctness in mixed mode)active_chunked_prefill_reqhandling into lifecycle paths:preempted_all()wait_worker_inflight_requests_finish()pre_recycle_resource()finish_requests()clear_data()update_metrics()andlog_status()(running/waiting counters updated)1.2 Decode reservation: replace fixed block reservation with
new_token_ratio(ratio-based)reserve_output_block_numfamily):init_reserve_output_block_num / decay_output_block_num / min_reserve_output_block_numcurrent_reserve_output_block_num / current_reserve_output_block_num_floatcan_relax_prefill_strategyinit_new_token_ratiomin_new_token_rationew_token_ratio_decaycurrent_new_token_ratioclip_max_new_tokens_estimation_calculate_decode_reserved_tokens_by_ratio()(ratio-based reserved token estimation for running decode-phase requests)_calculate_decode_reserved_tokens_for_new_requests(new_decode_reserved_tokens)(per-cycle accumulation for newly admitted last-chunk requests)_update_new_token_ratio_after_preemption()(update ratio after preemption based on remaining decode state)reset_new_token_ratio_on_idle()(reset ratio back to init when system is fully idle)_get_can_schedule_prefill_threshold_block(...)signature extended with:is_last_chunknew_decode_reserved_tokenscached_running_decode_reserved1.3 Chunked prefill scheduling and budgets (SGLang-aligned)
chunked_prefill_size = envs.FD_CHUNKED_PREFILL_SIZErem_chunk_tokens = chunked_prefill_sizerem_input_tokens = envs.FD_REM_INPUT_TOKENS - running_decode_count_get_paged_prefill_tokens(num_new_tokens)(align prefill budget toblock_size)_is_last_prefill_chunk(request, num_new_tokens)_get_num_new_tokens(...):(request, token_budget)to:(request, rem_chunk_tokens, rem_input_tokens, existing_prefill_in_batch, ignore_rem_input_budget)rem_chunk_tokenswith block-aligned chunk sizingrem_input_tokensgating (more strict once a prefill chunk is already admitted in the same cycle)schedule()main flow changes:running, migrate one intoactive_chunked_prefill_req(enforce single active unfinished chunked prefill)active_chunked_prefill_reqfirst (if admission threshold passes)waitingunder sharedrem_chunk_tokens/rem_input_tokensbudgetswaitingrequest becomes a non-last chunk (i.e., chunked), admit it and break (at most one newly chunked waiting request per cycle)1.4 Preemption refactor: retract-style decode preemption + eviction
_trigger_preempt():(len(output_token_ids), -prompt_token_ids_len)and popped_evict_decode_kv_cache(remaining_req_count)preempted_req.is_retracted = True_update_new_token_ratio_after_preemption()is invokedpreempted_all()to include:active_chunked_prefill_reqtogether withrunninguse_extend_tablesrequests to reinsert intoactive_chunked_prefill_reqorrunningas appropriate1.5 Scheduling order and decode scheduling gate
_schedule_decode_requests()to handle decode-phase scheduling and block allocation/extend logiccurrent_new_token_ratiodecays linearly1.6 Queue behavior change for rescheduled (preempted) requests
reschedule_preempt_task():waiting.appendleft(request)towaiting.append(request)(FIFO append)2)
envs.py: Environment Variables UpdateFD_RESERVE_OUTPUT_BLOCK_NUM_FOR_DECODE_WHEN_SCHEDULE_NEW_PREFILLFD_RESERVE_DECAY_OUTPUT_BLOCK_NUM_FOR_DECODE_WHEN_SCHEDULE_NEW_PREFILLFD_RESERVE_MIN_OUTPUT_BLOCK_NUM_FOR_DECODE_WHEN_SCHEDULE_NEW_PREFILLFD_INIT_NEW_TOKEN_RATIO(default0.7)FD_MIN_NEW_TOKEN_RATIO_FACTOR(default0.30)FD_NEW_TOKEN_RATIO_DECAY_STEPS(default600)FD_RETRACT_DECODE_STEPS(default20)FD_CLIP_MAX_NEW_TOKENS_ESTIMATION(default4096)FD_CHUNKED_PREFILL_SIZE(default8192)FD_REM_INPUT_TOKENS(default16384)3)
local_scheduler.py: Remove block-accumulation admission filteringget_requests(), removed:required_total_blocks/current_prefill_tokensaccumulation andavailable_blocksbreak logicrequests.append(request.raw)), leaving admission control and chunking toresource_manager_v14)
scheduler_metrics_logger.py: Decode logging interval changeDEFAULT_DECODE_LOG_INTERVAL:5 -> 1