[Cherry-Pick][BugFix]Add lock to avoid generating nan (#7046)#7047
[Cherry-Pick][BugFix]Add lock to avoid generating nan (#7046)#7047juncaipeng wants to merge 1 commit intoPaddlePaddle:release/2.5from
Conversation
|
Thanks for your contribution! |
There was a problem hiding this comment.
Pull request overview
该 PR 旨在通过在使用 storage cache 的读写路径上增加 GPU KVCache 互斥锁,避免与 worker 并发访问导致的 NaN 问题,并对 MooncakeStore 的 warmup 行为做了调整。
Changes:
- 在
PrefixCacheManager的 storage prefetch / write-back 任务下发流程中增加 KVCache 锁(并限制为同步模式)。 - 调整
MooncakeStore.warmup()的 warmup 写入大小,并变更 warmup key 的清理行为。
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| fastdeploy/cache_manager/prefix_cache_manager.py | 为 storage 读写任务下发/等待流程增加 KVCache 互斥锁,减少并发访问导致的 NaN 风险 |
| fastdeploy/cache_manager/transfer_factory/mooncake_store/mooncake_store.py | 调整 MooncakeStore 初始化 warmup 的数据大小与清理策略 |
Comments suppressed due to low confidence (1)
fastdeploy/cache_manager/prefix_cache_manager.py:1152
- 这里在获取 gpu_cache_lock 之后,如果触发 keys/block_ids 长度不一致会 raise ValueError,但锁不会被释放,可能导致后续 worker/transfer 进程永久阻塞。建议把 acquire 之后的逻辑放入 try/finally,在 finally 中确保 _release_kvcache_lock() 一定执行。
assert is_sync, "Only support is_sync=True for now."
self._acquire_kvcache_lock()
if len(task.keys) != len(task.gpu_block_ids):
err_msg = (
f"write_back_storage error: hash_keys({len(task.keys)}) != gpu_block_ids({len(task.gpu_block_ids)})"
)
logger.error(err_msg)
raise ValueError(err_msg)
| assert is_sync, "Only support is_sync=True for now." | ||
| self._acquire_kvcache_lock() | ||
|
|
||
| storage_block_ids = [] | ||
| self.task_prefetch_event[task.task_id] = Event() | ||
| # issue task to cache_transfer_manager | ||
| self.cache_task_queue.put_transfer_task((CacheStatus.STORAGE2GPU, task)) | ||
| if is_sync: | ||
| storage_block_ids = self.wait_prefetch_storage_task(task.task_id) | ||
|
|
||
| self._release_kvcache_lock() |
There was a problem hiding this comment.
issue_prefetch_storage_task 在 acquire 锁后到 release 之间缺少 try/finally 保护;一旦 put_transfer_task / wait_prefetch_storage_task / 其他异常抛出,会导致锁不释放并引发死锁。建议用 try/finally 包裹并在 finally 中 release。
| assert is_sync, "Only support is_sync=True for now." | ||
| self._acquire_kvcache_lock() | ||
|
|
There was a problem hiding this comment.
新增的 KVCache 锁逻辑目前缺少单测覆盖,容易在异常路径(例如长度不匹配抛错、transfer 队列异常)下回归为“锁未释放导致死锁”。建议补充单测:mock gpu_cache_lock,断言 acquire/release 成对出现,并覆盖异常分支下也会 release。
fastdeploy/cache_manager/transfer_factory/mooncake_store/mooncake_store.py
Show resolved
Hide resolved
a29803d to
c49898e
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## release/2.5 #7047 +/- ##
==============================================
Coverage ? 68.34%
==============================================
Files ? 390
Lines ? 54078
Branches ? 8519
==============================================
Hits ? 36958
Misses ? 14437
Partials ? 2683
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Motivation
Add lock to avoid generating nan when using storage cache
Modifications
fastdeploy/cache_manager/prefix_cache_manager.py
Usage or Command
none
Accuracy Tests
none
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.