[WIP] Fix connection lifecycle bugs, prevent PCB pool exhaustion, and harden closure handling#52
Open
timschneider wants to merge 2 commits intoDuet3D:mainfrom
Open
[WIP] Fix connection lifecycle bugs, prevent PCB pool exhaustion, and harden closure handling#52timschneider wants to merge 2 commits intoDuet3D:mainfrom
timschneider wants to merge 2 commits intoDuet3D:mainfrom
Conversation
Contributor
|
Thanks again for your contributions! It looks like I will adopt this project and maintain it from now on. Are you happy with your PR yet? If yes, I will review it shortly. |
- Fix inverted NETCONN_MORE flag: data was buffered when it should flush and flushed when it should buffer, affecting every TCP write - Add ERR_WOULDBLOCK spin guard in Write loop to prevent busy-wait when send buffer is full with no progress - Classify ERR_ABRT/ERR_CONN as otherEndClosed in Write and Poll instead of falling through to Terminate - Guard Write against terminating connection on ERR_WOULDBLOCK after partial write success - Return actual bytes written (total) instead of requested length - Handle ERR_TIMEOUT in Poll recv to prevent idle connection termination - Detect RST-destroyed PCB in closePending instead of waiting for ack timer, and use Terminate(true) to avoid unnecessary round-trip to RRF - Null-guard listener->Notify() in Close and Terminate to prevent crash on MQTT/outgoing connections where listener is nullptr - Include aborted state in GetSummarySocketStatus bitmap so RRF gets prompt notification of aborted connections - Fix use-after-free in Listener::Stop by saving conn pointer before netconn_delete - Clear connection listener pointer before FTP data listener Stop to prevent dangling pointer dereference on connection close Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
122e19e to
8127a25
Compare
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Author
|
@chrishamm i've updated the PR to only include the minimal needed code changes and the description to explain each change. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
Every claim in this report is verified against source code with file paths and line numbers.
Change 1: NETCONN_MORE Flag Inversion in Write()
File:
src/Connection.cpp:100Base code:
Fixed code:
What NETCONN_MORE means
NETCONN_MOREis defined as0x02inlwip/api.h:64. The lwIP documentation inapi_lib.c:982states:The flag value
0x02maps directly toTCP_WRITE_FLAG_MORE(tcpbase.h:74), which is passed totcp_write(). WhenTCP_WRITE_FLAG_MOREis set, lwIP interprets it as "more data coming, don't flush yet" — the PSH flag is not set on the last segment, telling the TCP stack to buffer rather than push.Why the base code was wrong
The base code sets
NETCONN_MOREwhenpush == true. This means:NETCONN_MOREset → TCP told "don't flush, more coming" → data buffered instead of flushedNETCONN_MOREnot set → TCP told "nothing more" → data flushed prematurelyThe semantics are exactly inverted.
How push is set by RRF
In
SocketServer.cpp:1664:RRF sets
FlagPushwhen it wants the data flushed to the remote client (e.g., end of an HTTP response). The base code defeats this intent on every write.Faulty execution path
FlagPushset →push = trueflag = NETCONN_COPY | NETCONN_MORE→ tcp_write told "more data coming"Impact: Every TCP write in the system has inverted flush semantics, causing latency on HTTP responses and premature flushes on partial data.
Change 2: ERR_WOULDBLOCK Spin Guard in Write Loop
File:
src/Connection.cpp:112-114Base code:
Fixed code:
Why the base code was wrong
When
netconn_write_partlyreturnsERR_WOULDBLOCKwithwritten == 0, the send buffer is full and no data was written. The base loop conditiontotal += writtenadds 0, sototal < lengthremains true. The loop retries immediately — busy-waiting until the send buffer drains or the send timeout triggers anotherERR_WOULDBLOCK.The connection has
netconn_set_sendtimeout(conn, MaxReadWriteTime)set (viaListener.cpp:232for accepted connections,Connection.cpp:265for outgoing connections). So eachERR_WOULDBLOCKiteration blocks for up toMaxReadWriteTimemilliseconds before returning. Without the spin guard, the loop can re-enter the blocking write multiple times.Why the new spin guard is needed for correctness with Change 4
Without this guard, when the loop exits with
rc == ERR_WOULDBLOCK, the post-loop error handling at line 118 entersif (rc != ERR_OK). In the base code, theelsebranch callsTerminate(false)— killing the connection for a transient condition. With Change 4's fix (else if (rc != ERR_WOULDBLOCK)), ERR_WOULDBLOCK is excluded from termination. But without the spin guard, the loop would spin instead of breaking cleanly.Faulty execution path
total = 1460netconn_write_partlyreturnsERR_WOULDBLOCK,written = 0total += 0, still< 2048), retries immediatelyChange 3: ERR_ABRT/ERR_CONN Classification in Write()
File:
src/Connection.cpp:120Base code:
Fixed code:
lwIP error code definitions (lwip/err.h:52-89)
All four indicate the connection is no longer viable. ERR_ABRT occurs when lwIP aborts the connection internally (e.g., PCB killed due to resource pressure). ERR_CONN occurs when writing to a connection that has been disconnected.
Why the base code was wrong
In the base code, ERR_ABRT and ERR_CONN fall through to the
elsebranch:Terminate(false)sets state toConnState::abortedand (in base code) callslistener->Notify(). The correct semantic for "remote end gone" isotherEndClosed, which is what RRF expects for clean socket teardown viaWiFiSocket::Poll()(WiFiSocket.cpp:166-178).Faulty execution path
netconn_write_partly→ returnsERR_ABRTelse→Terminate(false)→ state =abortedlistener->Notify()crashes if listener is nullptr (outgoing connection)abortedstate requires RRF to discover via polling, then send connAbort, rather than the directotherEndClosed→ RRFPoll()→SocketState::peerDisconnectingcleanup pathChange 4: ERR_WOULDBLOCK Non-Termination Guard in Write()
File:
src/Connection.cpp:124Base code:
Fixed code:
Why this is needed
With the spin guard (Change 2), the write loop can now exit with
rc == ERR_WOULDBLOCKandwritten == 0— a partial write where some data was sent (total > 0) but the send buffer is full. This is a transient condition, not an error.Without this guard, ERR_WOULDBLOCK after a partial write falls through to
Terminate(false)— killing a connection that successfully sent data and just needs the remote end to acknowledge.Faulty execution path
written == 0total = 1460,rc = ERR_WOULDBLOCKrc != ERR_OK→ true, enters error handlingERR_WOULDBLOCK != ERR_RST/CLSD/ABRT/CONN→ doesn't match Change 3else→Terminate(false)→ connection killed after successful 1460-byte writeelse if (rc != ERR_WOULDBLOCK)→ skips termination, returnstotal = 1460What happens to the unsent 588 bytes
SocketServer.cpp:1668-1672:The return value is compared against
acceptedLength. If less was written,lastErroris set. RRF does not see this — it already receivedacceptedLengthvia SPI (hspi.transfer32at line 1666) beforeWrite()was called. RRF decremented itstxBufferSpacebyacceptedLength(WiFiSocket.cpp:349). On the nextPoll(), RRF refreshestxBufferSpacefromresp.writeBufferSpace(WiFiSocket.cpp:237), which self-corrects any accounting discrepancy.Change 5: Return
totalInstead ofreturn lengthFile:
src/Connection.cpp:141Base code:
return length;Fixed code:
return total;Why the base code was wrong
lengthis the requested write size.totalis the actual bytes written. After a partial write (e.g., send buffer full), the base returns the requested amount, not the actual amount. This makesWrite()lie about its result.Where the return value is used
SocketServer.cpp:1668-1672:With the base code,
writtenalways equalsacceptedLength(sincelength == acceptedLength), solastErroris never set even when a partial write occurs. The fix makes partial writes observable throughlastError.RRF does not see
Write()'s return value —acceptedLengthis sent via SPI beforeWrite()executes (SocketServer.cpp:1666). However,lastErroris reported via the diagnostics command and aids debugging.Change 6: ERR_TIMEOUT and ERR_ABRT Handling in Poll() Recv
File:
src/Connection.cpp:170-171Base code:
Fixed code:
ERR_TIMEOUT: How it can be returned
Accepted connections have timeouts configured in
Listener.cpp:231-232:Outgoing connections similarly in
Connection.cpp:263-264:In lwIP's
netconn_recv_data(api_lib.c:584-686):netconn_is_nonblocking()is true, asys_arch_mbox_tryfetchis used → returnsERR_WOULDBLOCKon empty mailboxrecv_timeoutset, asys_arch_mbox_fetchwith timeout is used → returnsERR_TIMEOUTon timeoutWhen
netconn_is_nonblockingis true, the non-blocking path is taken first. However, the lwIP implementation checks non-blocking OR theNETCONN_DONTBLOCKapiflags. The actual control flow depends on the lwIP version and platform-specificsys_arch_mbox_tryfetchbehavior. If for any reason the non-blocking tryfetch succeeds but returns a timeout indication (platform-specific), or if the non-blocking flag is cleared during connection state transitions, ERR_TIMEOUT can surface.Why ERR_TIMEOUT must not enter the error handler
In the base code, ERR_TIMEOUT passes the
rc != ERR_WOULDBLOCKcheck, enters the error handler, doesn't matchERR_RST/CLSD/CONN, and falls through to:This kills the connection for a timeout — a transient, non-error condition.
ERR_ABRT in Poll() recv
ERR_ABRT (
err.h:82, value -13: "Connection aborted") can be returned when lwIP internally aborts the connection (e.g., PCB killed under memory pressure viatcp_kill_prio). In the base code, ERR_ABRT doesn't matchERR_RST/CLSD/CONNand falls through toTerminate(false). The fix classifies it asotherEndClosed, which is the correct semantic — the connection is gone.Faulty execution path (ERR_TIMEOUT)
netconn_recv_tcp_pbuf_flagsreturnsERR_TIMEOUTERR_TIMEOUT != ERR_WOULDBLOCK→ enters error handlerERR_TIMEOUT != ERR_RST/CLSD/CONN→ falls toTerminate(false)Faulty execution path (ERR_ABRT)
tcp_kill_prioto free PCBsnetconn_recv_tcp_pbuf_flagsreturnsERR_ABRTERR_RST/CLSD/CONNTerminate(false)→ state =abortedERR_ABRT→ state =otherEndClosed→ cleaner RRF teardownChange 7: closePending — Null PCB Detection and Terminate(true)
File:
src/Connection.cpp:199-214Base code:
Fixed code:
Three bugs fixed
Bug A: Null PCB causes MaxAckTime delay
When the remote end sends RST while in closePending, lwIP frees the PCB —
conn->pcb.tcpbecomes NULL.Base code:
conn->pcb.tcp && !conn->pcb.tcp->unackedevaluates tofalse(short-circuit on null). Falls through to the timeout branch, which waits up toMaxAckTimebefore callingTerminate(false). The connection sits in closePending doing nothing for the full timeout duration.Fixed code:
!conn->pcb.tcpevaluates totrue→ immediately transitions tocloseReady. No delay.Bug B: Unsent queue not checked
The base code only checks
!conn->pcb.tcp->unacked. The TCP send pipeline is:tcp_write()→ unsent →tcp_output()→ unacked → ACK received → freed. It is possible forunsent != 0 && unacked == 0— e.g., when an ACK drains the in-flight data buttcp_outputhasn't yet moved queued segments fromunsentto the wire in the same cycle, or the Nagle algorithm is holding data.Base code transitions to
closeReadywhenunackedis empty, regardless ofunsent. The subsequentnetconn_close()in the closeReady path does calltcp_close()which attempts to flush, but this is less controlled than allowing the normal send path to drain both queues.Fixed code:
(!conn->pcb.tcp->unsent && !conn->pcb.tcp->unacked)— only transitions to closeReady when both queues are fully drained.Bug C: Terminate(false) causes unnecessary round-trip to RRF
How closePending is entered —
Close()atConnection.cpp:225:Close()is called fromSocketServer.cpp:1637via theconnCloseSPI command:So closePending means RRF already requested the close. When the ack timer expires:
Terminate(false)→SetState(ConnState::aborted)→ RRF must discover the aborted state viaGetSummarySocketStatusbitmap or round-robin polling, then sendconnAbort→Terminate(true)→ finallyfreeTerminate(true)→SetState(ConnState::free)→ slot immediately availableThe round-trip is:
abortedWiFiSocket::Poll()discoversConnState::aborted(WiFiSocket.cpp:242-248):Terminate()sendsconnAbortback to ESP (WiFiSocket.cpp:284-298)Terminate(true)→ finallyfreeThis round-trip is pointless because RRF already asked to close.
Terminate(true)skips it.Faulty execution path
conn->pcb.tcpis null → first condition false → waits MaxAckTimeTerminate(false)→ state =aborted!conn->pcb.tcp→ immediatecloseReady→Close()→freein next Poll()Change 8: Listener Null Guard in Close()
File:
src/Connection.cpp:244-247Base code:
Fixed code:
Why
listenercan be nullOutgoing (MQTT) connections:
ConnectCallbackatConnection.cpp:497:Connected()stores the listener parameter directly atConnection.cpp:322:FTP data connections after accept: After Change 11 (FTP listener nullptr),
c->listeneris explicitly set tonullptratListener.cpp:237before the listener is deleted.Constructor: The constructor (
Connection.cpp:19-23) does not initializelistener:listeneris not in the initializer list. It is only set inConnected().How Close() is reached for outgoing connections
The
connCloseSPI command (SocketServer.cpp:1637) callsClose()on any valid socket number, including outgoing MQTT connections. Also,Write()callsClose()at line 137 whencloseAfterSendingis true.Faulty execution path
Connect()→ConnectCallback→Connected(nullptr, conn)listenerisnullptrotherEndClosedClose()entered with stateotherEndClosedlistener->Notify()→ null pointer dereference → crashChange 9: Listener Null Guard and Conditional Notify in Terminate()
File:
src/Connection.cpp:313-317Base code:
Fixed code:
Two fixes
Fix A: Null guard — same rationale as Change 8. Outgoing connections have
listener == nullptr.Fix B: Only notify on external (free) — when
external == false, state is set toaborted. The connection slot is not freed — it transitions from a live state toaborted, which still occupies the slot.Listener::Notify()wakes theListenerTaskto accept new connections, but no slot is available since the aborted connection hasn't been freed yet. Notifying is pointless and potentially harmful (listener attempts accept, fails to find free slot).When
external == true, state is set tofree. The slot is actually available, so notifying the listener to accept new connections is correct.Where Terminate(true) is called
SocketServer.cpp:1625— connAbort command:Connection::Get(...).Terminate(true)Connection.cpp:288—Connect()failure path:Terminate(true)Where Terminate(false) is called
Connection.cpp:186—Poll()recv error fallthroughConnection.cpp:127—Write()error (with Change 4, only for non-WOULDBLOCK errors)Connection.cpp:212— closePending timeout (in base code; fixed toTerminate(true)by Change 7)Faulty execution path
listener == nullptrPoll()recv → unclassified errorTerminate(false)→ state =abortedlistener->Notify()→ null pointer dereference → crashChange 10: Aborted State in GetSummarySocketStatus Bitmap
File:
src/Connection.cpp:435-437Base code:
Fixed code:
How the bitmap flows to RRF
GetSummarySocketStatus()is called fromSocketServer.cpp:1687:Connection::GetSummarySocketStatus(resp.connectedSockets, resp.otherEndClosedSockets);ConnStatusResponsevia SPI to RRFWiFiInterface::UpdateSocketStatus()(WiFiInterface.cpp:1722-1732):SetNeedsPolling()causes RRF to poll that specific socket in the next cycleWiFiSocket::Poll()discoversConnState::aborted(WiFiSocket.cpp:242-248) and cleans upWhy the base code was wrong
Without the fix, aborted sockets are invisible in the bitmap. RRF discovers them only through round-robin polling — iterating through all sockets sequentially. This delays cleanup of aborted connections, which hold resources (connection slots, memory) until discovered.
The
connGetStatusSPI command is sent by RRF for each socket individually. The bitmap is a piggyback optimization that tells RRF which sockets need urgent attention. Excludingabortedfrom the bitmap defeats this optimization.Faulty execution path
Terminate(false)→ state =abortedconnGetStatusfor socket 3 → bitmap returnedabortedstate not flagged inotherEndClosedSocketsUpdateSocketStatusdoesn't callSetNeedsPolling()for aborted socketChange 11: Use-After-Free in Listener::Stop()
File:
src/Listener.cpp:130-137Base code:
Fixed code:
Why the base code was wrong
Listener::Stop()is an instance method.connisthis->conn— the listener's ownnetconnpointer. Afternetconn_delete(conn):connis a dangling pointerforloop iterateslisteners[]looking forlistener->conn == connlistener == this,listener->connreadsthis->conn— the same dangling pointerlistener->conn == conncompares two copies of the same dangling pointer valuedelete listenerdestroysthisdelete this, the next loop iteration accessesthis->connagain (UB —thisis freed)Using a stack-local
savedConnpreserves the pointer value beforenetconn_deletefrees the object and beforedelete listenerfreesthis. The comparison uses a stable stack value instead of reading freed memory.Why Stop() is not re-entrant safe without this fix
Stop()is called from:Listener::Start()line 35 (when replacing a listener)Listener::Stop(uint16_t port)static method line 160 (closing all listeners on a port)Listener::ListenerTaskline 238 (after accepting FTP data connection)In the ListenerTask path (Change 12),
listener->Stop()deletes the listener. Any subsequent access to the listener's members throughthisis UB. ThesavedConnpattern ensures the netconn comparison uses a stack variable that survives thedelete.Faulty execution path
listener->Stop()callednetconn_delete(conn)frees the netconnlistener->conn == conn— both are the same dangling pointer, comparison "works" by luckdelete listenerfreesthislisteners[i]access is fine (static array), but if another listener'sconnhappened to be allocated at the same address as the freed netconn → false match → wrong listener deletedChange 12: FTP Data Listener Dangling Pointer
File:
src/Listener.cpp:237Base code:
Fixed code:
Why the base code was wrong
This code is in
ListenerTask, after a connectioncis accepted on an FTP data port.listener->Stop()atListener.cpp:139doesdelete listener, which frees the Listener object. Butc->listenerstill points to the deleted listener — a dangling pointer.When the FTP data connection later calls
Close()orTerminate():Close()at line 244:listener->Notify()→ dereferences freed memoryTerminate()at line 313:listener->Notify()→ dereferences freed memoryWith Change 8 and 9's null guards, setting
c->listener = nullptrbeforeStop()prevents the dangling pointer dereference. The null guard skipsNotify()entirely.Connection lifecycle for FTP data
FTP data connections are one-shot: a listener is created for a specific data transfer, accepts one connection, then the listener is stopped. The connection outlives the listener (it still needs to transfer the data). After the transfer completes, the connection closes — at which point it must not reference the deleted listener.
WiFiInterface::OpenDataPort()(WiFiInterface.cpp:1735-1749) creates a fresh listener for each FTP data transfer, confirming the one-shot pattern.Faulty execution path
cwithc->listener = listenerlistener->Stop()→delete listener→ listener freedc->listeneris now a dangling pointer to freed memoryc->Close()→ reacheslistener->Notify()→ dereferences freed memory → crash or corruptionSummary