Skip to content

HTTP/2 server: write response body in place, drop per-response body-sized buffer (fixes #1271)#1282

Open
canemon-markov wants to merge 1 commit into
JuliaWeb:masterfrom
canemon-markov:fix/h2-server-zerocopy-response-body
Open

HTTP/2 server: write response body in place, drop per-response body-sized buffer (fixes #1271)#1282
canemon-markov wants to merge 1 commit into
JuliaWeb:masterfrom
canemon-markov:fix/h2-server-zerocopy-response-body

Conversation

@canemon-markov

Copy link
Copy Markdown

Problem

Fixes #1271. The HTTP/2 server allocates a fresh Vector{UInt8} sized to the
(windowed) response body on every response and copies the body into it
before writing to the socket. Two paths do this:

  • _write_h2_batch_via_single_buffer! — the buffered-response fast path
  • _write_data_frames_h2_server! — streaming / window-overflow batches

On the h2 server's per-stream Threads.@spawn, these body-sized per-response
allocations drive sustained memory growth proportional to traffic × body size —
in both retained live heap and process RSS. The HTTP/1 path does not show
this: it writes the body in place (write(io, body)), with no intermediate copy.

Fix

Write each DATA frame as a reused 9-byte header followed by the payload slice
taken directly from the body via a view. A unit-range view of a
Vector{UInt8} or Base.CodeUnits (both DenseVector) is a stride-1
StridedVector, so write(conn, ::view) takes Reseau's zero-copy pointer
path straight to the socket — matching the HTTP/1 behavior. Per-response
allocation drops from O(body size) to O(1) (the 9-byte header plus the
small HEADERS-frame bytes).

Accepted server connections enable TCP_NODELAY, so splitting the frame header
and payload into separate writes does not introduce Nagle latency.

Results

Two-process repro: a standalone HTTP/2 server, driven with 100,000 measured
requests of the same response body. Server-process memory sampled via a /mem
endpoint after a settled forced GC.

Scenario Protocol Requests Live heap Δ RSS Δ Server proto counts Client failures
Before (HTTP.jl 2.1.0) h2 100,000 +20,508 KB +28.0 MB h1=0 h2=100,111 other=0 0
After (this PR) h2 100,000 −0.0 KB −2.9 MB h1=0 h2=100,111 other=0 0

Before, the server retained ~20 MiB of live heap (after settled forced GC) and
+28 MiB RSS over the run. After, both stay flat on the identical HTTP/2-only
workload, with zero client failures.

Testing

  • Full test/http2_server_tests.jl passes, including the flow-control / window
    / large-body DATA-frame-splitting testsets that exercise this code.
  • Note: the timeout-middleware testset has 2 failures present on master
    independently of this change
    (verified against the pristine base) — a
    separate, pre-existing issue, not introduced here.

…JuliaWeb#1271)

The HTTP/2 server response-body write paths allocated a fresh Vector{UInt8}
sized to the (windowed) response body on every response and copied the body
into it before a single socket write:

  - _write_data_frames_h2_server!        (streaming / window-overflow batches)
  - _write_h2_batch_via_single_buffer!   (buffered-response fast path)

Under the h2 server's per-stream Threads.@Spawn, those large, short-lived
allocations are retained by the glibc malloc per-thread arenas and not
returned to the OS, so process RSS climbs in proportion to response body size
even though the Julia heap stays flat. The HTTP/1 path does not exhibit this
because it writes the body in place (write(io, body)) with no intermediate
copy. This is the mechanism behind JuliaWeb#1271 ("HTTP/2 server leaks memory
proportional to response body size").

Write each DATA frame as a reused 9-byte header followed by the payload slice
taken directly from the body via a view. A unit-range view of a Vector{UInt8}
or Base.CodeUnits (both DenseVector) is a stride-1 StridedVector, so
write(conn, ::view) takes the zero-copy pointer path straight to the socket --
matching the HTTP/1 behavior. Per-response allocation drops from O(body size)
to O(1) (the 9-byte header plus the small HEADERS-frame bytes).

Accepted server connections set TCP_NODELAY, so splitting the frame header and
payload into separate writes does not incur Nagle latency.

All HTTP/2 server tests pass (the 2 pre-existing failures in the timeout
middleware testset are present on master with and without this change).
@codecov

codecov Bot commented Jun 4, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 84.62%. Comparing base (4e2b698) to head (e9590a8).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1282      +/-   ##
==========================================
+ Coverage   84.53%   84.62%   +0.09%     
==========================================
  Files          28       28              
  Lines       10774    10766       -8     
==========================================
+ Hits         9108     9111       +3     
+ Misses       1666     1655      -11     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@quinnj

quinnj commented Jun 5, 2026

Copy link
Copy Markdown
Member

Thanks for digging into this. One tradeoff I want to think through before merging: this extends the v2.1.0 BytesBody fix down into the lower DATA-frame writers, which is good for avoiding the body-sized copy, but it also changes the write shape from one contiguous buffered write per DATA batch into separate writes for the frame header and payload.

For TLS, I think that may mean more TLS application writes/records; for plain TCP, it likely means more syscalls. A benchmark that would help: compare master/v2.1.0 vs this branch for a large fixed Vector{UInt8} response over both h2c and h2/TLS, tracking RSS/live heap plus req/s/latency, and ideally write/syscall counts. That would tell us whether the RSS win comes with a meaningful throughput/latency cost.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HTTP/2 server leaks memory proportional to response body size (HTTP/1.1 does not)

2 participants