HTTP/2 server: write response body in place, drop per-response body-sized buffer (fixes #1271)#1282
Conversation
…JuliaWeb#1271) The HTTP/2 server response-body write paths allocated a fresh Vector{UInt8} sized to the (windowed) response body on every response and copied the body into it before a single socket write: - _write_data_frames_h2_server! (streaming / window-overflow batches) - _write_h2_batch_via_single_buffer! (buffered-response fast path) Under the h2 server's per-stream Threads.@Spawn, those large, short-lived allocations are retained by the glibc malloc per-thread arenas and not returned to the OS, so process RSS climbs in proportion to response body size even though the Julia heap stays flat. The HTTP/1 path does not exhibit this because it writes the body in place (write(io, body)) with no intermediate copy. This is the mechanism behind JuliaWeb#1271 ("HTTP/2 server leaks memory proportional to response body size"). Write each DATA frame as a reused 9-byte header followed by the payload slice taken directly from the body via a view. A unit-range view of a Vector{UInt8} or Base.CodeUnits (both DenseVector) is a stride-1 StridedVector, so write(conn, ::view) takes the zero-copy pointer path straight to the socket -- matching the HTTP/1 behavior. Per-response allocation drops from O(body size) to O(1) (the 9-byte header plus the small HEADERS-frame bytes). Accepted server connections set TCP_NODELAY, so splitting the frame header and payload into separate writes does not incur Nagle latency. All HTTP/2 server tests pass (the 2 pre-existing failures in the timeout middleware testset are present on master with and without this change).
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1282 +/- ##
==========================================
+ Coverage 84.53% 84.62% +0.09%
==========================================
Files 28 28
Lines 10774 10766 -8
==========================================
+ Hits 9108 9111 +3
+ Misses 1666 1655 -11 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
Thanks for digging into this. One tradeoff I want to think through before merging: this extends the v2.1.0 For TLS, I think that may mean more TLS application writes/records; for plain TCP, it likely means more syscalls. A benchmark that would help: compare |
Problem
Fixes #1271. The HTTP/2 server allocates a fresh
Vector{UInt8}sized to the(windowed) response body on every response and copies the body into it
before writing to the socket. Two paths do this:
_write_h2_batch_via_single_buffer!— the buffered-response fast path_write_data_frames_h2_server!— streaming / window-overflow batchesOn the h2 server's per-stream
Threads.@spawn, these body-sized per-responseallocations drive sustained memory growth proportional to traffic × body size —
in both retained live heap and process RSS. The HTTP/1 path does not show
this: it writes the body in place (
write(io, body)), with no intermediate copy.Fix
Write each DATA frame as a reused 9-byte header followed by the payload slice
taken directly from the body via a view. A unit-range view of a
Vector{UInt8}orBase.CodeUnits(bothDenseVector) is a stride-1StridedVector, sowrite(conn, ::view)takes Reseau's zero-copypointerpath straight to the socket — matching the HTTP/1 behavior. Per-response
allocation drops from O(body size) to O(1) (the 9-byte header plus the
small HEADERS-frame bytes).
Accepted server connections enable
TCP_NODELAY, so splitting the frame headerand payload into separate writes does not introduce Nagle latency.
Results
Two-process repro: a standalone HTTP/2 server, driven with 100,000 measured
requests of the same response body. Server-process memory sampled via a
/memendpoint after a settled forced GC.
Before, the server retained ~20 MiB of live heap (after settled forced GC) and
+28 MiB RSS over the run. After, both stay flat on the identical HTTP/2-only
workload, with zero client failures.
Testing
test/http2_server_tests.jlpasses, including the flow-control / window/ large-body DATA-frame-splitting testsets that exercise this code.
masterindependently of this change (verified against the pristine base) — a
separate, pre-existing issue, not introduced here.