Skip to content

Fix: bound SSL handshake to prevent leaked ESTABLISHED sockets#3296

Open
yerixu wants to merge 1 commit into
apache:masterfrom
yerixu:fix/ssl-handshake-timeout
Open

Fix: bound SSL handshake to prevent leaked ESTABLISHED sockets#3296
yerixu wants to merge 1 commit into
apache:masterfrom
yerixu:fix/ssl-handshake-timeout

Conversation

@yerixu
Copy link
Copy Markdown

@yerixu yerixu commented May 13, 2026

Socket::SSLHandshake polled the fd via unbounded bthread_fd_wait, so a peer that completed the TCP handshake but never sent a TLS Hello (e.g. server not actually configured for SSL) parked the handshake bthread forever. That bthread holds a Socket reference through the in-flight WriteRequest, so the underlying fd was never closed and the connection stayed ESTABLISHED until OS keepalive eventually broke it.

Channel destruction does not help: ~Channel only removes the SocketMap ref; SetFailed/OnFailed wake _epollout_butex but not the per-fd butex used by bthread_fd_wait, and the fd close path runs only after the Socket's ref count drops to zero, which the stuck bthread prevents. With a periodic retry, ESTABLISHED sockets accumulate without bound.

Switch the two bthread_fd_wait calls in SSLHandshake to bthread_fd_timedwait, with the deadline derived from a new gflag ssl_handshake_timeout_ms (default 5000ms; <=0 keeps the old behavior).
On timeout the existing failure path runs: CheckConnected returns -1 ->
AfterAppConnected calls SetFailed + ReleaseAllFailedWriteRequests, and the fd_guard in CheckConnectedAndKeepWrite closes the fd at scope exit. The server-side handshake invoked from DoRead benefits from the same bound.

What problem does this PR solve?

Issue Number: resolve

Problem Summary:

What is changed and the side effects?

Changed:

Side effects:

  • Performance effects:

  • Breaking backward compatibility:


Check List:

Socket::SSLHandshake polled the fd via unbounded bthread_fd_wait, so a
peer that completed the TCP handshake but never sent a TLS Hello (e.g.
server not actually configured for SSL) parked the handshake bthread
forever. That bthread holds a Socket reference through the in-flight
WriteRequest, so the underlying fd was never closed and the connection
stayed ESTABLISHED until OS keepalive eventually broke it.

Channel destruction does not help: ~Channel only removes the SocketMap
ref; SetFailed/OnFailed wake _epollout_butex but not the per-fd butex
used by bthread_fd_wait, and the fd close path runs only after the
Socket's ref count drops to zero, which the stuck bthread prevents.
With a periodic retry, ESTABLISHED sockets accumulate without bound.

Switch the two bthread_fd_wait calls in SSLHandshake to
bthread_fd_timedwait, with the deadline derived from a new gflag
ssl_handshake_timeout_ms (default 5000ms; <=0 keeps the old behavior).
On timeout the existing failure path runs: CheckConnected returns -1 ->
AfterAppConnected calls SetFailed + ReleaseAllFailedWriteRequests, and
the fd_guard in CheckConnectedAndKeepWrite closes the fd at scope exit.
The server-side handshake invoked from DoRead benefits from the same
bound.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds a bounded wait to Socket::SSLHandshake to prevent bthreads from blocking forever when a peer completes the TCP handshake but never completes the TLS handshake, which can otherwise keep sockets stuck in ESTABLISHED and prevent fd cleanup.

Changes:

  • Introduces --ssl_handshake_timeout_ms (default 5000ms; <=0 disables) to cap total SSL handshake duration.
  • Replaces unbounded bthread_fd_wait with bthread_fd_timedwait in SSL handshake read/write wait paths.
  • Adds warning logs when the SSL handshake hits the configured timeout.
Comments suppressed due to low confidence (1)

src/brpc/socket.cpp:2052

  • Same as the WANT_READ branch: handshake timeout here results in errno=ETIMEDOUT bubbling up from bthread_fd_timedwait. This may be misinterpreted by higher layers as a TCP connect timeout (and can affect connect_timeout_as_unreachable logic). Consider using a dedicated error code for SSL handshake timeouts or ensuring these timeouts are not counted as connect timeouts.
                if (errno == ETIMEDOUT) {
                    LOG(WARNING) << "SSL handshake timed out after "
                                 << handshake_timeout_ms
                                 << "ms while waiting to send on fd=" << fd
                                 << " remote_side=" << _remote_side;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/brpc/socket.cpp
Comment on lines +2032 to +2036
if (errno == ETIMEDOUT) {
LOG(WARNING) << "SSL handshake timed out after "
<< handshake_timeout_ms
<< "ms while waiting for peer data on fd="
<< fd << " remote_side=" << _remote_side;
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants