[BugFix] Add lock to avoid generating nan when using storage cache#7046
[BugFix] Add lock to avoid generating nan when using storage cache#7046juncaipeng wants to merge 2 commits intoPaddlePaddle:developfrom
Conversation
|
Thanks for your contribution! |
There was a problem hiding this comment.
Pull request overview
该 PR 旨在通过对 KVCache storage cache 的读/写回任务加锁,避免在特定并发访问场景下出现 NaN(通过 FD_USE_KVCACHE_LOCK 控制启用)。
Changes:
- 在
issue_write_back_storage_task中对写回 storage 的流程加锁,并限制当前仅支持同步模式。 - 在
issue_prefetch_storage_task中对从 storage 预取到 GPU 的流程加锁,并限制当前仅支持同步模式。
Comments suppressed due to low confidence (1)
fastdeploy/cache_manager/prefix_cache_manager.py:1260
- 这里在 acquire lock 之后如果触发后续异常(例如 keys/gpu_block_ids 长度不一致时 raise ValueError,或 put_transfer_task 抛异常),会导致锁永远不释放,后续 worker 可能被永久阻塞。建议把参数校验放到加锁之前,并用 try/finally(或 contextmanager)保证 _release_kvcache_lock() 一定执行。
assert is_sync, "Only support is_sync=True for now."
self._acquire_kvcache_lock()
if len(task.keys) != len(task.gpu_block_ids):
err_msg = (
f"write_back_storage error: hash_keys({len(task.keys)}) != gpu_block_ids({len(task.gpu_block_ids)})"
)
logger.error(err_msg)
raise ValueError(err_msg)
| assert is_sync, "Only support is_sync=True for now." | ||
| self._acquire_kvcache_lock() | ||
|
|
There was a problem hiding this comment.
这里用 assert 来限制 is_sync=True:1) Python -O 会跳过 assert,约束不生效;2) 现有单测 tests/cache_manager/test_prefix_cache_manager.py 覆盖了 is_sync=False 的分支并期望抛 ValueError(key/gpu_block_ids 不匹配),这次改动会直接变成 AssertionError 导致用例失败。建议改为显式的运行时检查并抛出 NotImplementedError/ValueError,或保留对 is_sync=False 的兼容实现并同步更新单测。
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #7046 +/- ##
==========================================
Coverage ? 72.99%
==========================================
Files ? 399
Lines ? 56419
Branches ? 8919
==========================================
Hits ? 41183
Misses ? 12309
Partials ? 2927
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Motivation
Add lock to avoid generating nan when using storage cache
Modifications
fastdeploy/cache_manager/prefix_cache_manager.py
Usage or Command
none
Accuracy Tests
none
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.