Fix: bound SSL handshake to prevent leaked ESTABLISHED sockets#3296
Open
yerixu wants to merge 1 commit into
Open
Fix: bound SSL handshake to prevent leaked ESTABLISHED sockets#3296yerixu wants to merge 1 commit into
yerixu wants to merge 1 commit into
Conversation
Socket::SSLHandshake polled the fd via unbounded bthread_fd_wait, so a peer that completed the TCP handshake but never sent a TLS Hello (e.g. server not actually configured for SSL) parked the handshake bthread forever. That bthread holds a Socket reference through the in-flight WriteRequest, so the underlying fd was never closed and the connection stayed ESTABLISHED until OS keepalive eventually broke it. Channel destruction does not help: ~Channel only removes the SocketMap ref; SetFailed/OnFailed wake _epollout_butex but not the per-fd butex used by bthread_fd_wait, and the fd close path runs only after the Socket's ref count drops to zero, which the stuck bthread prevents. With a periodic retry, ESTABLISHED sockets accumulate without bound. Switch the two bthread_fd_wait calls in SSLHandshake to bthread_fd_timedwait, with the deadline derived from a new gflag ssl_handshake_timeout_ms (default 5000ms; <=0 keeps the old behavior). On timeout the existing failure path runs: CheckConnected returns -1 -> AfterAppConnected calls SetFailed + ReleaseAllFailedWriteRequests, and the fd_guard in CheckConnectedAndKeepWrite closes the fd at scope exit. The server-side handshake invoked from DoRead benefits from the same bound.
Open
Contributor
There was a problem hiding this comment.
Pull request overview
This pull request adds a bounded wait to Socket::SSLHandshake to prevent bthreads from blocking forever when a peer completes the TCP handshake but never completes the TLS handshake, which can otherwise keep sockets stuck in ESTABLISHED and prevent fd cleanup.
Changes:
- Introduces
--ssl_handshake_timeout_ms(default 5000ms;<=0disables) to cap total SSL handshake duration. - Replaces unbounded
bthread_fd_waitwithbthread_fd_timedwaitin SSL handshake read/write wait paths. - Adds warning logs when the SSL handshake hits the configured timeout.
Comments suppressed due to low confidence (1)
src/brpc/socket.cpp:2052
- Same as the WANT_READ branch: handshake timeout here results in
errno=ETIMEDOUTbubbling up frombthread_fd_timedwait. This may be misinterpreted by higher layers as a TCP connect timeout (and can affectconnect_timeout_as_unreachablelogic). Consider using a dedicated error code for SSL handshake timeouts or ensuring these timeouts are not counted as connect timeouts.
if (errno == ETIMEDOUT) {
LOG(WARNING) << "SSL handshake timed out after "
<< handshake_timeout_ms
<< "ms while waiting to send on fd=" << fd
<< " remote_side=" << _remote_side;
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+2032
to
+2036
| if (errno == ETIMEDOUT) { | ||
| LOG(WARNING) << "SSL handshake timed out after " | ||
| << handshake_timeout_ms | ||
| << "ms while waiting for peer data on fd=" | ||
| << fd << " remote_side=" << _remote_side; |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Socket::SSLHandshake polled the fd via unbounded bthread_fd_wait, so a peer that completed the TCP handshake but never sent a TLS Hello (e.g. server not actually configured for SSL) parked the handshake bthread forever. That bthread holds a Socket reference through the in-flight WriteRequest, so the underlying fd was never closed and the connection stayed ESTABLISHED until OS keepalive eventually broke it.
Channel destruction does not help: ~Channel only removes the SocketMap ref; SetFailed/OnFailed wake _epollout_butex but not the per-fd butex used by bthread_fd_wait, and the fd close path runs only after the Socket's ref count drops to zero, which the stuck bthread prevents. With a periodic retry, ESTABLISHED sockets accumulate without bound.
Switch the two bthread_fd_wait calls in SSLHandshake to bthread_fd_timedwait, with the deadline derived from a new gflag ssl_handshake_timeout_ms (default 5000ms; <=0 keeps the old behavior).
On timeout the existing failure path runs: CheckConnected returns -1 ->
AfterAppConnected calls SetFailed + ReleaseAllFailedWriteRequests, and the fd_guard in CheckConnectedAndKeepWrite closes the fd at scope exit. The server-side handshake invoked from DoRead benefits from the same bound.
What problem does this PR solve?
Issue Number: resolve
Problem Summary:
What is changed and the side effects?
Changed:
Side effects:
Performance effects:
Breaking backward compatibility:
Check List: