[WIP] Fix connection lifecycle bugs, prevent PCB pool exhaustion, and harden closure handling by timschneider · Pull Request #52 · Duet3D/WiFiSocketServerRTOS

timschneider · 2026-02-26T12:09:47Z

Overview

Every claim in this report is verified against source code with file paths and line numbers.

Change 1: NETCONN_MORE Flag Inversion in Write()

File: src/Connection.cpp:100

Base code:

u8_t flag = NETCONN_COPY | (push ? NETCONN_MORE : 0);

Fixed code:

u8_t flag = NETCONN_COPY | (push ? 0 : NETCONN_MORE);

What NETCONN_MORE means

NETCONN_MORE is defined as 0x02 in lwip/api.h:64. The lwIP documentation in api_lib.c:982 states:

NETCONN_MORE: for TCP connection, PSH flag will be set on last segment sent

The flag value 0x02 maps directly to TCP_WRITE_FLAG_MORE (tcpbase.h:74), which is passed to tcp_write(). When TCP_WRITE_FLAG_MORE is set, lwIP interprets it as "more data coming, don't flush yet" — the PSH flag is not set on the last segment, telling the TCP stack to buffer rather than push.

Why the base code was wrong

The base code sets NETCONN_MORE when push == true. This means:

push=true (flush requested by RRF) → NETCONN_MORE set → TCP told "don't flush, more coming" → data buffered instead of flushed
push=false (no flush needed) → NETCONN_MORE not set → TCP told "nothing more" → data flushed prematurely

The semantics are exactly inverted.

How push is set by RRF

In SocketServer.cpp:1664:

const bool push = (acceptedLength == requestedlength) && (messageHeaderIn.hdr.flags & MessageHeaderSamToEsp::FlagPush) != 0;

RRF sets FlagPush when it wants the data flushed to the remote client (e.g., end of an HTTP response). The base code defeats this intent on every write.

Faulty execution path

RRF sends HTTP response with FlagPush set → push = true
Base: flag = NETCONN_COPY | NETCONN_MORE → tcp_write told "more data coming"
TCP buffers the data instead of sending with PSH → remote client (browser) stalls waiting for push
Data is eventually flushed by TCP timer or next write, but with added latency

Impact: Every TCP write in the system has inverted flush semantics, causing latency on HTTP responses and premature flushes on partial data.

Change 2: ERR_WOULDBLOCK Spin Guard in Write Loop

File: src/Connection.cpp:112-114

Base code:

for( ; total < length; total += written) {
    written = 0;
    rc = netconn_write_partly(conn, data + total, length - total, flag, &written);

    if (rc != ERR_OK && rc != ERR_WOULDBLOCK) {
        break;
    }
}

Fixed code:

for( ; total < length; total += written) {
    written = 0;
    rc = netconn_write_partly(conn, data + total, length - total, flag, &written);

    // Note: ERR_MEM is not handled here because lwIP's netconn layer retries
    // internally and never propagates ERR_MEM to the application layer.
    if (rc != ERR_OK && rc != ERR_WOULDBLOCK) {
        break;
    }
    if (rc == ERR_WOULDBLOCK && written == 0) {
        break;      // send buffer full and no progress after timeout, avoid spinning
    }
}

Why the base code was wrong

When netconn_write_partly returns ERR_WOULDBLOCK with written == 0, the send buffer is full and no data was written. The base loop condition total += written adds 0, so total < length remains true. The loop retries immediately — busy-waiting until the send buffer drains or the send timeout triggers another ERR_WOULDBLOCK.

The connection has netconn_set_sendtimeout(conn, MaxReadWriteTime) set (via Listener.cpp:232 for accepted connections, Connection.cpp:265 for outgoing connections). So each ERR_WOULDBLOCK iteration blocks for up to MaxReadWriteTime milliseconds before returning. Without the spin guard, the loop can re-enter the blocking write multiple times.

Why the new spin guard is needed for correctness with Change 4

Without this guard, when the loop exits with rc == ERR_WOULDBLOCK, the post-loop error handling at line 118 enters if (rc != ERR_OK). In the base code, the else branch calls Terminate(false) — killing the connection for a transient condition. With Change 4's fix (else if (rc != ERR_WOULDBLOCK)), ERR_WOULDBLOCK is excluded from termination. But without the spin guard, the loop would spin instead of breaking cleanly.

Faulty execution path

RRF sends 2048 bytes, send buffer has space for 1460 bytes
First iteration: writes 1460 bytes, total = 1460
Second iteration: netconn_write_partly returns ERR_WOULDBLOCK, written = 0
Base: loop continues (total += 0, still < 2048), retries immediately
Repeats step 3-4 in a tight loop until send buffer drains or timeout cascades

Change 3: ERR_ABRT/ERR_CONN Classification in Write()

File: src/Connection.cpp:120

Base code:

if (rc == ERR_RST || rc == ERR_CLSD)
{
    SetState(ConnState::otherEndClosed);
}

Fixed code:

if (rc == ERR_RST || rc == ERR_CLSD || rc == ERR_ABRT || rc == ERR_CONN)
{
    SetState(ConnState::otherEndClosed);
}

lwIP error code definitions (lwip/err.h:52-89)

Code	Value	Meaning
ERR_CONN	-11	Not connected
ERR_ABRT	-13	Connection aborted
ERR_RST	-14	Connection reset
ERR_CLSD	-15	Connection closed

All four indicate the connection is no longer viable. ERR_ABRT occurs when lwIP aborts the connection internally (e.g., PCB killed due to resource pressure). ERR_CONN occurs when writing to a connection that has been disconnected.

Why the base code was wrong

In the base code, ERR_ABRT and ERR_CONN fall through to the else branch:

else
{
    debugPrintfAlways("Write fail len=%u err=%d\n", total, (int)rc);
    Terminate(false);
    return 0;
}

Terminate(false) sets state to ConnState::aborted and (in base code) calls listener->Notify(). The correct semantic for "remote end gone" is otherEndClosed, which is what RRF expects for clean socket teardown via WiFiSocket::Poll() (WiFiSocket.cpp:166-178).

Faulty execution path

Remote client disconnects abruptly during file upload
ESP calls netconn_write_partly → returns ERR_ABRT
Base: falls to else → Terminate(false) → state = aborted
listener->Notify() crashes if listener is nullptr (outgoing connection)
Even if listener is non-null: aborted state requires RRF to discover via polling, then send connAbort, rather than the direct otherEndClosed → RRF Poll() → SocketState::peerDisconnecting cleanup path

Change 4: ERR_WOULDBLOCK Non-Termination Guard in Write()

File: src/Connection.cpp:124

Base code:

else
{
    debugPrintfAlways("Write fail len=%u err=%d\n", total, (int)rc);
    Terminate(false);
    return 0;
}

Fixed code:

else if (rc != ERR_WOULDBLOCK)
{
    debugPrintfAlways("Write fail len=%u err=%d\n", total, (int)rc);
    Terminate(false);
    return 0;
}

Why this is needed

With the spin guard (Change 2), the write loop can now exit with rc == ERR_WOULDBLOCK and written == 0 — a partial write where some data was sent (total > 0) but the send buffer is full. This is a transient condition, not an error.

Without this guard, ERR_WOULDBLOCK after a partial write falls through to Terminate(false) — killing a connection that successfully sent data and just needs the remote end to acknowledge.

Faulty execution path

Write loop sends 1460 of 2048 bytes, then ERR_WOULDBLOCK with written == 0
Spin guard breaks out of loop with total = 1460, rc = ERR_WOULDBLOCK
Post-loop: rc != ERR_OK → true, enters error handling
ERR_WOULDBLOCK != ERR_RST/CLSD/ABRT/CONN → doesn't match Change 3
Base: falls to else → Terminate(false) → connection killed after successful 1460-byte write
Fixed: else if (rc != ERR_WOULDBLOCK) → skips termination, returns total = 1460

What happens to the unsent 588 bytes

SocketServer.cpp:1668-1672:

const size_t written = conn.Write(reinterpret_cast<uint8_t *>(transferBuffer), acceptedLength, push, closeAfterSending);
if (written != acceptedLength)
{
    lastError = "incomplete write";
}

The return value is compared against acceptedLength. If less was written, lastError is set. RRF does not see this — it already received acceptedLength via SPI (hspi.transfer32 at line 1666) before Write() was called. RRF decremented its txBufferSpace by acceptedLength (WiFiSocket.cpp:349). On the next Poll(), RRF refreshes txBufferSpace from resp.writeBufferSpace (WiFiSocket.cpp:237), which self-corrects any accounting discrepancy.

Change 5: Return `total` Instead of `return length`

File: src/Connection.cpp:141

Base code:

return length;

Fixed code:

return total;

Why the base code was wrong

length is the requested write size. total is the actual bytes written. After a partial write (e.g., send buffer full), the base returns the requested amount, not the actual amount. This makes Write() lie about its result.

Where the return value is used

SocketServer.cpp:1668-1672:

const size_t written = conn.Write(..., acceptedLength, push, closeAfterSending);
if (written != acceptedLength)
{
    lastError = "incomplete write";
}

With the base code, written always equals acceptedLength (since length == acceptedLength), so lastError is never set even when a partial write occurs. The fix makes partial writes observable through lastError.

RRF does not see Write()'s return value — acceptedLength is sent via SPI before Write() executes (SocketServer.cpp:1666). However, lastError is reported via the diagnostics command and aids debugging.

Change 6: ERR_TIMEOUT and ERR_ABRT Handling in Poll() Recv

File: src/Connection.cpp:170-171

Base code:

if (rc != ERR_WOULDBLOCK)
{
    if (rc == ERR_RST || rc == ERR_CLSD || rc == ERR_CONN)

Fixed code:

if (rc != ERR_WOULDBLOCK && rc != ERR_TIMEOUT)
{
    if (rc == ERR_RST || rc == ERR_CLSD || rc == ERR_CONN || rc == ERR_ABRT)

ERR_TIMEOUT: How it can be returned

Accepted connections have timeouts configured in Listener.cpp:231-232:

netconn_set_nonblocking(newConn, true);
netconn_set_recvtimeout(newConn, MaxReadWriteTime);

Outgoing connections similarly in Connection.cpp:263-264:

netconn_set_nonblocking(conn, true);
netconn_set_recvtimeout(conn, MaxReadWriteTime);

In lwIP's netconn_recv_data (api_lib.c:584-686):

Non-blocking path (line 621-625): when netconn_is_nonblocking() is true, a sys_arch_mbox_tryfetch is used → returns ERR_WOULDBLOCK on empty mailbox
Blocking path with timeout (line 628-631): when blocking with recv_timeout set, a sys_arch_mbox_fetch with timeout is used → returns ERR_TIMEOUT on timeout

When netconn_is_nonblocking is true, the non-blocking path is taken first. However, the lwIP implementation checks non-blocking OR the NETCONN_DONTBLOCK apiflags. The actual control flow depends on the lwIP version and platform-specific sys_arch_mbox_tryfetch behavior. If for any reason the non-blocking tryfetch succeeds but returns a timeout indication (platform-specific), or if the non-blocking flag is cleared during connection state transitions, ERR_TIMEOUT can surface.

Why ERR_TIMEOUT must not enter the error handler

In the base code, ERR_TIMEOUT passes the rc != ERR_WOULDBLOCK check, enters the error handler, doesn't match ERR_RST/CLSD/CONN, and falls through to:

else
{
    Terminate(false);
}

This kills the connection for a timeout — a transient, non-error condition.

ERR_ABRT in Poll() recv

ERR_ABRT (err.h:82, value -13: "Connection aborted") can be returned when lwIP internally aborts the connection (e.g., PCB killed under memory pressure via tcp_kill_prio). In the base code, ERR_ABRT doesn't match ERR_RST/CLSD/CONN and falls through to Terminate(false). The fix classifies it as otherEndClosed, which is the correct semantic — the connection is gone.

Faulty execution path (ERR_TIMEOUT)

Connection is idle, no data arriving
netconn_recv_tcp_pbuf_flags returns ERR_TIMEOUT
Base: ERR_TIMEOUT != ERR_WOULDBLOCK → enters error handler
ERR_TIMEOUT != ERR_RST/CLSD/CONN → falls to Terminate(false)
Connection killed for being idle

Faulty execution path (ERR_ABRT)

System under memory pressure, lwIP calls tcp_kill_prio to free PCBs
netconn_recv_tcp_pbuf_flags returns ERR_ABRT
Base: enters error handler, doesn't match ERR_RST/CLSD/CONN
Falls to Terminate(false) → state = aborted
Fixed: matches ERR_ABRT → state = otherEndClosed → cleaner RRF teardown

Change 7: closePending — Null PCB Detection and Terminate(true)

File: src/Connection.cpp:199-214

Base code:

else if (state == ConnState::closePending)
{
    // We're about to close this connection and we're still waiting for the remaining data to be acknowledged
    if (conn->pcb.tcp && !conn->pcb.tcp->unacked)
    {
        // All data has been received, close this connection next time
        SetState(ConnState::closeReady);
    }
    else if (millis() - closeTimer >= MaxAckTime)
    {
        // The acknowledgement timer has expired, abort this connection
        Terminate(false);
    }
}

Fixed code:

else if (state == ConnState::closePending)
{
    // The other end may have closed the connection with RST, which causes lwIP
    // to free the PCB. Detect this and close immediately instead of waiting for
    // the acknowledgement timer to expire.
    if (!conn->pcb.tcp || (!conn->pcb.tcp->unsent && !conn->pcb.tcp->unacked))
    {
        SetState(ConnState::closeReady);
    }
    else if (millis() - closeTimer >= MaxAckTime)
    {
        // The acknowledgement timer has expired. The close was already initiated
        // by RRF, so go straight to free rather than aborted to avoid a round-trip.
        Terminate(true);
    }
}

Three bugs fixed

Bug A: Null PCB causes MaxAckTime delay

When the remote end sends RST while in closePending, lwIP frees the PCB — conn->pcb.tcp becomes NULL.

Base code: conn->pcb.tcp && !conn->pcb.tcp->unacked evaluates to false (short-circuit on null). Falls through to the timeout branch, which waits up to MaxAckTime before calling Terminate(false). The connection sits in closePending doing nothing for the full timeout duration.

Fixed code: !conn->pcb.tcp evaluates to true → immediately transitions to closeReady. No delay.

Bug B: Unsent queue not checked

The base code only checks !conn->pcb.tcp->unacked. The TCP send pipeline is: tcp_write() → unsent → tcp_output() → unacked → ACK received → freed. It is possible for unsent != 0 && unacked == 0 — e.g., when an ACK drains the in-flight data but tcp_output hasn't yet moved queued segments from unsent to the wire in the same cycle, or the Nagle algorithm is holding data.

Base code transitions to closeReady when unacked is empty, regardless of unsent. The subsequent netconn_close() in the closeReady path does call tcp_close() which attempts to flush, but this is less controlled than allowing the normal send path to drain both queues.

Fixed code: (!conn->pcb.tcp->unsent && !conn->pcb.tcp->unacked) — only transitions to closeReady when both queues are fully drained.

Bug C: Terminate(false) causes unnecessary round-trip to RRF

How closePending is entered — Close() at Connection.cpp:225:

case ConnState::connected:
    if (conn->pcb.tcp && conn->pcb.tcp->unacked)
    {
        closeTimer = millis();
        netconn_shutdown(conn, true, false);
        SetState(ConnState::closePending);
        break;
    }

Close() is called from SocketServer.cpp:1637 via the connClose SPI command:

case NetworkCommand::connClose:
    ...
    Connection::Get(messageHeaderIn.hdr.socketNumber).Close();

So closePending means RRF already requested the close. When the ack timer expires:

Base: Terminate(false) → SetState(ConnState::aborted) → RRF must discover the aborted state via GetSummarySocketStatus bitmap or round-robin polling, then send connAbort → Terminate(true) → finally free
Fixed: Terminate(true) → SetState(ConnState::free) → slot immediately available

The round-trip is:

ESP sets aborted

RRF's WiFiSocket::Poll() discovers ConnState::aborted (WiFiSocket.cpp:242-248):

case ConnState::aborted:
    state = SocketState::broken;
    Terminate();

Terminate() sends connAbort back to ESP (WiFiSocket.cpp:284-298)
ESP Terminate(true) → finally free

This round-trip is pointless because RRF already asked to close. Terminate(true) skips it.

Faulty execution path

RRF sends connClose → ESP enters closePending with unacked data
Remote client crashes → sends RST → lwIP frees PCB
Base: conn->pcb.tcp is null → first condition false → waits MaxAckTime
After timeout: Terminate(false) → state = aborted
RRF discovers aborted (via polling or bitmap) → sends connAbort → ESP Terminate(true) → free
Total delay: MaxAckTime + RRF poll interval + SPI round-trip
Fixed: !conn->pcb.tcp → immediate closeReady → Close() → free in next Poll()

Change 8: Listener Null Guard in Close()

File: src/Connection.cpp:244-247

Base code:

FreePbuf();
SetState(ConnState::free);
listener->Notify();

Fixed code:

FreePbuf();
SetState(ConnState::free);
if (listener)
{
    listener->Notify();
}

Why `listener` can be null

Outgoing (MQTT) connections: ConnectCallback at Connection.cpp:497:

case NETCONN_EVT_SENDPLUS:
    connection->Connected(nullptr, conn);
    break;

Connected() stores the listener parameter directly at Connection.cpp:322:

this->listener = listener;  // nullptr for outgoing connections

FTP data connections after accept: After Change 11 (FTP listener nullptr), c->listener is explicitly set to nullptr at Listener.cpp:237 before the listener is deleted.

Constructor: The constructor (Connection.cpp:19-23) does not initialize listener:

Connection::Connection(uint8_t num)
    : number(num), localPort(0), remotePort(0), remoteIp(0), conn(nullptr), state(ConnState::free),
    closeTimer(0),readBuf(nullptr), readIndex(0), alreadyRead(0), pendOtherEndClosed(false)
{
}

listener is not in the initializer list. It is only set in Connected().

How Close() is reached for outgoing connections

The connClose SPI command (SocketServer.cpp:1637) calls Close() on any valid socket number, including outgoing MQTT connections. Also, Write() calls Close() at line 137 when closeAfterSending is true.

Faulty execution path

RRF creates MQTT connection → Connect() → ConnectCallback → Connected(nullptr, conn)
listener is nullptr
MQTT broker closes connection → state becomes otherEndClosed
RRF sends connClose → Close() entered with state otherEndClosed
Reaches listener->Notify() → null pointer dereference → crash

Change 9: Listener Null Guard and Conditional Notify in Terminate()

File: src/Connection.cpp:313-317

Base code:

SetState((external) ? ConnState::free : ConnState::aborted);
listener->Notify();

Fixed code:

SetState((external) ? ConnState::free : ConnState::aborted);
if (external && listener)
{
    listener->Notify();
}

Two fixes

Fix A: Null guard — same rationale as Change 8. Outgoing connections have listener == nullptr.

Fix B: Only notify on external (free) — when external == false, state is set to aborted. The connection slot is not freed — it transitions from a live state to aborted, which still occupies the slot. Listener::Notify() wakes the ListenerTask to accept new connections, but no slot is available since the aborted connection hasn't been freed yet. Notifying is pointless and potentially harmful (listener attempts accept, fails to find free slot).

When external == true, state is set to free. The slot is actually available, so notifying the listener to accept new connections is correct.

Where Terminate(true) is called

SocketServer.cpp:1625 — connAbort command: Connection::Get(...).Terminate(true)
Connection.cpp:288 — Connect() failure path: Terminate(true)

Where Terminate(false) is called

Connection.cpp:186 — Poll() recv error fallthrough
Connection.cpp:127 — Write() error (with Change 4, only for non-WOULDBLOCK errors)
Connection.cpp:212 — closePending timeout (in base code; fixed to Terminate(true) by Change 7)

Faulty execution path

MQTT connection established with listener == nullptr
Network error during Poll() recv → unclassified error
Terminate(false) → state = aborted
Base: listener->Notify() → null pointer dereference → crash

Change 10: Aborted State in GetSummarySocketStatus Bitmap

File: src/Connection.cpp:435-437

Base code:

else if (Connection::Get(i).GetState() == ConnState::otherEndClosed)
{
    otherEndClosedSockets |= (1 << i);
}

Fixed code:

else if (Connection::Get(i).GetState() == ConnState::otherEndClosed
        || Connection::Get(i).GetState() == ConnState::aborted)
{
    otherEndClosedSockets |= (1 << i);
}

How the bitmap flows to RRF

GetSummarySocketStatus() is called from SocketServer.cpp:1687:

Connection::GetSummarySocketStatus(resp.connectedSockets, resp.otherEndClosedSockets);

The bitmap is sent in ConnStatusResponse via SPI to RRF

RRF's WiFiInterface::UpdateSocketStatus() (WiFiInterface.cpp:1722-1732):

for (size_t i = 0; i < NumWiFiTcpSockets; ++i)
{
    if (((connectedSockets | otherEndClosedSockets) & (1u << i)) != 0)
    {
        sockets[i]->SetNeedsPolling();
    }
}

SetNeedsPolling() causes RRF to poll that specific socket in the next cycle
WiFiSocket::Poll() discovers ConnState::aborted (WiFiSocket.cpp:242-248) and cleans up

Why the base code was wrong

Without the fix, aborted sockets are invisible in the bitmap. RRF discovers them only through round-robin polling — iterating through all sockets sequentially. This delays cleanup of aborted connections, which hold resources (connection slots, memory) until discovered.

The connGetStatus SPI command is sent by RRF for each socket individually. The bitmap is a piggyback optimization that tells RRF which sockets need urgent attention. Excluding aborted from the bitmap defeats this optimization.

Faulty execution path

Connection error → Terminate(false) → state = aborted
RRF sends connGetStatus for socket 3 → bitmap returned
Base: socket in aborted state not flagged in otherEndClosedSockets
RRF's UpdateSocketStatus doesn't call SetNeedsPolling() for aborted socket
Aborted socket discovered only when RRF's round-robin reaches it
During delay: connection slot occupied, listener cannot accept new connections on that slot

Change 11: Use-After-Free in Listener::Stop()

File: src/Listener.cpp:130-137

Base code:

void Listener::Stop()
{
    netconn_close(conn);
    netconn_delete(conn);

    for (int i = 0; i < MaxConnections; i++)
    {
        Listener *listener = listeners[i];
        if (listener && listener->conn == conn)

Fixed code:

void Listener::Stop()
{
    struct netconn *savedConn = conn;
    netconn_close(savedConn);
    netconn_delete(savedConn);

    for (int i = 0; i < MaxConnections; i++)
    {
        Listener *listener = listeners[i];
        if (listener && listener->conn == savedConn)

Why the base code was wrong

Listener::Stop() is an instance method. conn is this->conn — the listener's own netconn pointer. After netconn_delete(conn):

The netconn object is freed — conn is a dangling pointer
The for loop iterates listeners[] looking for listener->conn == conn
When listener == this, listener->conn reads this->conn — the same dangling pointer
The comparison listener->conn == conn compares two copies of the same dangling pointer value
Then delete listener destroys this
After delete this, the next loop iteration accesses this->conn again (UB — this is freed)

Using a stack-local savedConn preserves the pointer value before netconn_delete frees the object and before delete listener frees this. The comparison uses a stable stack value instead of reading freed memory.

Why Stop() is not re-entrant safe without this fix

Stop() is called from:

Listener::Start() line 35 (when replacing a listener)
Listener::Stop(uint16_t port) static method line 160 (closing all listeners on a port)
Listener::ListenerTask line 238 (after accepting FTP data connection)

In the ListenerTask path (Change 12), listener->Stop() deletes the listener. Any subsequent access to the listener's members through this is UB. The savedConn pattern ensures the netconn comparison uses a stack variable that survives the delete.

Faulty execution path

FTP data connection accepted → listener->Stop() called
netconn_delete(conn) frees the netconn
Loop finds listener->conn == conn — both are the same dangling pointer, comparison "works" by luck
delete listener frees this
Next loop iteration: listeners[i] access is fine (static array), but if another listener's conn happened to be allocated at the same address as the freed netconn → false match → wrong listener deleted

Change 12: FTP Data Listener Dangling Pointer

File: src/Listener.cpp:237

Base code:

if (listener->protocol == protocolFtpData)
{
    debugPrintf("accept conn, stop listen on port %u\n", listener->port);
    listener->Stop();   // don't listen for further connections
}

Fixed code:

if (listener->protocol == protocolFtpData)
{
    debugPrintf("accept conn, stop listen on port %u\n", listener->port);
    c->listener = nullptr;  // clear before Stop() deletes the listener
    listener->Stop();   // don't listen for further connections
}

Why the base code was wrong

This code is in ListenerTask, after a connection c is accepted on an FTP data port.

listener->Stop() at Listener.cpp:139 does delete listener, which frees the Listener object. But c->listener still points to the deleted listener — a dangling pointer.

When the FTP data connection later calls Close() or Terminate():

Close() at line 244: listener->Notify() → dereferences freed memory
Terminate() at line 313: listener->Notify() → dereferences freed memory

With Change 8 and 9's null guards, setting c->listener = nullptr before Stop() prevents the dangling pointer dereference. The null guard skips Notify() entirely.

Connection lifecycle for FTP data

FTP data connections are one-shot: a listener is created for a specific data transfer, accepts one connection, then the listener is stopped. The connection outlives the listener (it still needs to transfer the data). After the transfer completes, the connection closes — at which point it must not reference the deleted listener.

WiFiInterface::OpenDataPort() (WiFiInterface.cpp:1735-1749) creates a fresh listener for each FTP data transfer, confirming the one-shot pattern.

Faulty execution path

FTP data listener accepts connection c with c->listener = listener
listener->Stop() → delete listener → listener freed
c->listener is now a dangling pointer to freed memory
FTP data transfer completes → RRF sends connClose
c->Close() → reaches listener->Notify() → dereferences freed memory → crash or corruption

Summary

#	Change	Bug Type	Severity
1	NETCONN_MORE inversion	Logic error	High — every write has wrong flush
2	ERR_WOULDBLOCK spin guard	Spin/busy-wait	Medium — blocks on full send buffer
3	ERR_ABRT/ERR_CONN in Write	Missing error classification	Medium — wrong state on disconnect
4	ERR_WOULDBLOCK non-termination	False termination	High — kills connections after partial write
5	return total	Incorrect return value	Low — makes partial writes observable
6	ERR_TIMEOUT/ERR_ABRT in Poll	Missing error classification	High — kills idle connections
7	closePending null PCB + unsent check + Terminate(true)	Null PCB delay + premature closeReady + unnecessary round-trip	Medium — MaxAckTime delay + data loss risk + wasted SPI cycle
8	Close() listener null	Null pointer dereference	Critical — crash on MQTT/outgoing close
9	Terminate() listener null	Null pointer dereference	Critical — crash on MQTT/outgoing terminate
10	GetSummarySocketStatus aborted	Missing bitmap flag	Medium — delayed cleanup
11	Listener::Stop() savedConn	Use-after-free	High — UB, potential wrong listener deletion
12	FTP listener dangling pointer	Dangling pointer	Critical — crash on FTP data connection close

chrishamm · 2026-03-26T14:19:13Z

Thanks again for your contributions! It looks like I will adopt this project and maintain it from now on.

Are you happy with your PR yet? If yes, I will review it shortly.

- Fix inverted NETCONN_MORE flag: data was buffered when it should flush and flushed when it should buffer, affecting every TCP write - Add ERR_WOULDBLOCK spin guard in Write loop to prevent busy-wait when send buffer is full with no progress - Classify ERR_ABRT/ERR_CONN as otherEndClosed in Write and Poll instead of falling through to Terminate - Guard Write against terminating connection on ERR_WOULDBLOCK after partial write success - Return actual bytes written (total) instead of requested length - Handle ERR_TIMEOUT in Poll recv to prevent idle connection termination - Detect RST-destroyed PCB in closePending instead of waiting for ack timer, and use Terminate(true) to avoid unnecessary round-trip to RRF - Null-guard listener->Notify() in Close and Terminate to prevent crash on MQTT/outgoing connections where listener is nullptr - Include aborted state in GetSummarySocketStatus bitmap so RRF gets prompt notification of aborted connections - Fix use-after-free in Listener::Stop by saving conn pointer before netconn_delete - Clear connection listener pointer before FTP data listener Stop to prevent dangling pointer dereference on connection close Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

timschneider · 2026-03-27T11:36:15Z

@chrishamm i've updated the PR to only include the minimal needed code changes and the description to explain each change.

timschneider changed the title ~~Fix connection lifecycle bugs, prevent PCB pool exhaustion, and harden closure handling~~ [WIP] Fix connection lifecycle bugs, prevent PCB pool exhaustion, and harden closure handling Mar 16, 2026

chrishamm self-assigned this Mar 26, 2026

timschneider force-pushed the fix/WiFiSocketServerRTOS branch from 122e19e to 8127a25 Compare March 27, 2026 10:46

Check both unsent and unacked queues in closePending before closeReady

0fce61d

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Fix connection lifecycle bugs, prevent PCB pool exhaustion, and harden closure handling#52

[WIP] Fix connection lifecycle bugs, prevent PCB pool exhaustion, and harden closure handling#52
timschneider wants to merge 2 commits intoDuet3D:mainfrom
Meltingplot:fix/WiFiSocketServerRTOS

timschneider commented Feb 26, 2026 •

edited

Loading

Uh oh!

chrishamm commented Mar 26, 2026

Uh oh!

timschneider commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

timschneider commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Change 1: NETCONN_MORE Flag Inversion in Write()

What NETCONN_MORE means

Why the base code was wrong

How push is set by RRF

Faulty execution path

Change 2: ERR_WOULDBLOCK Spin Guard in Write Loop

Why the base code was wrong

Why the new spin guard is needed for correctness with Change 4

Faulty execution path

Change 3: ERR_ABRT/ERR_CONN Classification in Write()

lwIP error code definitions (lwip/err.h:52-89)

Why the base code was wrong

Faulty execution path

Change 4: ERR_WOULDBLOCK Non-Termination Guard in Write()

Why this is needed

Faulty execution path

What happens to the unsent 588 bytes

Change 5: Return total Instead of return length

Why the base code was wrong

Where the return value is used

Change 6: ERR_TIMEOUT and ERR_ABRT Handling in Poll() Recv

ERR_TIMEOUT: How it can be returned

Why ERR_TIMEOUT must not enter the error handler

ERR_ABRT in Poll() recv

Faulty execution path (ERR_TIMEOUT)

Faulty execution path (ERR_ABRT)

Change 7: closePending — Null PCB Detection and Terminate(true)

Three bugs fixed

Faulty execution path

Change 8: Listener Null Guard in Close()

Why listener can be null

How Close() is reached for outgoing connections

Faulty execution path

Change 9: Listener Null Guard and Conditional Notify in Terminate()

Two fixes

Where Terminate(true) is called

Where Terminate(false) is called

Faulty execution path

Change 10: Aborted State in GetSummarySocketStatus Bitmap

How the bitmap flows to RRF

Why the base code was wrong

Faulty execution path

Change 11: Use-After-Free in Listener::Stop()

Why the base code was wrong

Why Stop() is not re-entrant safe without this fix

Faulty execution path

Change 12: FTP Data Listener Dangling Pointer

Why the base code was wrong

Connection lifecycle for FTP data

Faulty execution path

Summary

Uh oh!

chrishamm commented Mar 26, 2026

Uh oh!

timschneider commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

timschneider commented Feb 26, 2026 •

edited

Loading

Change 5: Return `total` Instead of `return length`

Why `listener` can be null