彻底修复funasr本身内存泄漏问题(仅剩ORT内部working set、堆不立刻还内核造成的少量内存泄漏)#2842
彻底修复funasr本身内存泄漏问题(仅剩ORT内部working set、堆不立刻还内核造成的少量内存泄漏)#2842liuyt6515 wants to merge 1 commit intomodelscope:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a memory tracing framework and several memory management optimizations, including switching to the ONNX Runtime device allocator to reduce arena retention, using smart pointers, and explicitly releasing vector memory via shrink_to_fit and malloc_trim. Feedback identifies a hardcoded local path in the tracing utility and highlights critical performance bottlenecks caused by synchronous file I/O in the logging system and the excessive use of malloc_trim within high-frequency processing loops.
| FILE* fp = std::fopen(e, "a"); | ||
| if (fp) return fp; | ||
| } | ||
| FILE* fp2 = std::fopen("/data/project/liuyuntao/FUN_ASR/.cursor/debug-08e994.log", "a"); |
| std::lock_guard<std::mutex> lk(g_memtrace_mu); | ||
| FILE* fp = OpenMemtraceLog(); | ||
| if (!fp) return; | ||
| auto ms = std::chrono::duration_cast<std::chrono::milliseconds>( | ||
| std::chrono::system_clock::now().time_since_epoch()) | ||
| .count(); | ||
| std::fprintf(fp, | ||
| "{\"sessionId\":\"08e994\",\"hypothesisId\":\"%s\",\"location\":\"memtrace\",\"message\":\"%s\"," | ||
| "\"data\":{\"trace_id\":%lld,\"rss_kb\":%llu,\"delta_from_decoder_start_kb\":%lld," | ||
| "\"data_a\":%lld,\"data_b\":%lld},\"timestamp\":%lld}\n", | ||
| hypothesisId, phase, (long long)trace_id, (unsigned long long)rss, delta, data_a, data_b, | ||
| (long long)ms); | ||
| std::fclose(fp); |
There was a problem hiding this comment.
The logging implementation in MemtraceLog is a major performance bottleneck and likely contributes significantly to the reported 2.5x slowdown:\n\n1. Global Serialization: The use of g_memtrace_mu serializes all logging across all threads, which is a significant bottleneck in a concurrent server environment.\n2. File I/O Overhead: Opening and closing the log file on every single log entry is extremely expensive.\n3. Frequent System Calls: ReadVmRSSKb (called at line 66) performs file I/O on /proc/self/status every time a log is generated.\n\nConsider using an asynchronous logging approach where logs are buffered and written by a background thread, or keep the file handle open if debugging is enabled.
| #if defined(__linux__) | ||
| malloc_trim(0); | ||
| // #region agent log | ||
| funasr::MemtraceLog("ws_after_chunk_trim", "WS4t", mem_sess.trace_id(), 0, 0); | ||
| // #endregion | ||
| #endif |
There was a problem hiding this comment.
Executing malloc_trim(0) inside the chunk processing loop (while (buffer.size() >= 800 * 2 ...)) is extremely expensive and is likely the primary cause of the 2.5x performance drop mentioned in the PR description. malloc_trim forces the allocator to scan all arenas and release free pages back to the OS, which is a high-latency operation.\n\nIt should be removed from the chunk loop and only called at the end of an utterance (when is_final is true) or at a much lower frequency to balance memory footprint and throughput.
背景与目标
核心目标不是“盲目降 RSS”,而是先通过可观测手段把 2pass WebSocket 链路中的内存行为分段定位,再针对高风险路径做释放与结构优化。重点场景是:
● 单进程、多次 WebSocket 连接(以及不同并发压测下的对比)
● 关注 VmRSS 在 chunk、final、连接结束三类时刻的变化
● 区分“真实泄漏”与“分配器/运行时缓存导致的 RSS 不回落”
采取的主要治理手段
2.1 建立端到端内存埋点体系(NDJSON + trace_id)
在 V4 中新增了统一内存追踪能力,按一次 do_decoder 建立 trace_id,记录关键阶段的 rss_kb 和相对起点增量:
● 入口与基础能力:runtime/onnxruntime/include/memtrace.h、runtime/onnxruntime/src/memtrace.cpp
● WebSocket 分段埋点:runtime/websocket/bin/websocket-server-2pass.cpp
● 2pass 处理链埋点:runtime/onnxruntime/src/funasrruntime.cpp
● 在线/离线模型关键阶段埋点:runtime/onnxruntime/src/paraformer-online.cpp、runtime/onnxruntime/src/paraformer.cpp
价值:把“容器看起来在涨”拆解到具体阶段(在线 encoder、离线 forward、result 处理、trim 后状态等),避免误判。
2.2 在 chunk/final 路径补齐显式释放与 trim
在 WebSocket 2pass 服务路径中,针对每次 chunk 与 final 结果处理后统一执行释放动作(结果对象释放 + Linux 下 malloc_trim)并打点观察:
● 文件:runtime/websocket/bin/websocket-server-2pass.cpp
结果:chunk/final 后可见回落,尤其首轮冷启动后常有明显回收;但并不保证每次都回到起点(这在 glibc/ORT 场景是预期现象)。
2.3 2pass 会话结束时做显式状态重置(input_finished)
在 input_finished(final)路径中,增加在线状态与解码器会话级 reset/end-utterance,避免跨句/跨段状态累积:
● 文件:runtime/onnxruntime/src/funasrruntime.cpp
包括在线 ASR/VAD cache reset、离线模型 utterance 结束、WFST 相关会话重置等。
结果:final 之后的“状态性占用”得到控制,final 链路多数样本回收明显。
2.4 离线 Paraformer 路径的临时内存治理
在离线 Forward 路径中,V4 做了两类处理:
● 文件:runtime/onnxruntime/src/paraformer.cpp
结果:final 路径“峰值高但可回落”的特征更明显;多数情况下残留可压到较低量级。
2.5 在线 Paraformer(chunk 热点)专项优化
针对日志中反复出现的在线 chunk 抬升,V4 进行了针对性调整:
● 文件:runtime/onnxruntime/src/paraformer-online.cpp
● 声明同步:runtime/onnxruntime/src/paraformer-online.h
结果:在线路径有改善,但“encoder 阶段仍是主要抬升点”的结论依旧成立(说明不仅是容器侧 C++ 对象,还有 ORT/分配器层面的常驻行为)。
2.6 连接关闭资源回收链路核查
V4 还检查并依赖了连接关闭时的在线句柄与解码器释放逻辑:
● 文件:runtime/websocket/bin/websocket-server-2pass.cpp
包括 on_close 标志位与清理线程中对 FunTpassOnlineUninit、FunASRWfstDecoderUninit 等释放调用。
结果:连接级资源不会无限堆积,跨请求“纯泄漏”迹象不明显。
通过本轮实测,已“实质解决/缓解”的问题
说明:以下“解决”采用工程语义(已控制风险并有证据),不等于数学意义“RSS 必回初始值”。
3.1 已解决:缺乏可观测性导致无法定位
● 现在可以按 trace_id 精确追踪 WS0→chunk→final→WS9 的每段内存变化。
● 解决了过去只能看 docker stats、难判定原因的问题。
3.2 已缓解:final 路径的大峰值后回收
● 离线 final 的 Run 峰值仍在,但多数样本后续能明显回落。
● final 结束残留整体处于可控区间(相较早期“每次似乎都涨很多”观感,已大幅改善)。
3.3 已缓解:会话结束不重置导致状态积累
● input_finished 会话 reset 补齐后,状态类占用不再持续跨轮累积。
3.4 已缓解:在线路径冗余拷贝导致峰值放大
● 去掉 CifSearch 按值拷贝、缩短中间对象生命周期后,C++ 层冗余内存重叠减少。
仍未完全解决、但已明确归因的问题
4.1 在线 chunk 的 encoder 阶段仍是最大抬升点
● 在多份日志中,主要台阶依然集中在 ParaformerOnline::ForwardChunk 的 encoder Run 前后。
● trim 后仍可能高于当次 WS0,属于当前最主要“未完全收回”来源。
4.2 RSS 不回到起点并不必然等于泄漏
● 已有证据表明:连接级/会话级对象能回收,且整段连接有时末态低于首态。
● 剩余问题更像“运行时/分配器保留 + 工作集”而非明显对象泄漏。
解决或显著缓解了以下问题:
诊断层面泄漏黑盒问题:已从黑盒变白盒,可定位到具体阶段与代码路径。
会话状态类累积问题:input_finished 的显式 reset 基本解决跨轮状态堆积风险。
结果对象/临时对象未及时释放问题:chunk/final 后释放链路完善,临时对象生命周期更可控。
在线路径不必要内存复制问题:已修正按值复制与部分大对象滞留。
同时也应明确:
● 尚未把“在线 encoder 导致的 RSS 台阶”完全消除。
● 当前证据更支持“部分常驻/缓存行为”而非“持续线性泄漏”。
关键文件索引
● FunASR-main/runtime/onnxruntime/include/memtrace.h
● FunASR-main/runtime/onnxruntime/src/memtrace.cpp
● FunASR-main/runtime/websocket/bin/websocket-server-2pass.cpp
● FunASR-main/runtime/onnxruntime/src/funasrruntime.cpp
● FunASR-main/runtime/onnxruntime/src/paraformer.cpp
● FunASR-main/runtime/onnxruntime/src/paraformer-online.cpp
●FunASR-main/runtime/onnxruntime/src/paraformer-online.h
注:实测速度相对于原版本降低了2.5倍,原版单核cpu的RTF为1的条件下并发可以达到5个,现在只能有2个。