Skip to content

[WIP] Fix connection lifecycle bugs, prevent PCB pool exhaustion, and harden closure handling#52

Open
timschneider wants to merge 2 commits intoDuet3D:mainfrom
Meltingplot:fix/WiFiSocketServerRTOS
Open

[WIP] Fix connection lifecycle bugs, prevent PCB pool exhaustion, and harden closure handling#52
timschneider wants to merge 2 commits intoDuet3D:mainfrom
Meltingplot:fix/WiFiSocketServerRTOS

Conversation

@timschneider
Copy link
Copy Markdown

@timschneider timschneider commented Feb 26, 2026

Overview

Every claim in this report is verified against source code with file paths and line numbers.


Change 1: NETCONN_MORE Flag Inversion in Write()

File: src/Connection.cpp:100

Base code:

u8_t flag = NETCONN_COPY | (push ? NETCONN_MORE : 0);

Fixed code:

u8_t flag = NETCONN_COPY | (push ? 0 : NETCONN_MORE);

What NETCONN_MORE means

NETCONN_MORE is defined as 0x02 in lwip/api.h:64. The lwIP documentation in api_lib.c:982 states:

NETCONN_MORE: for TCP connection, PSH flag will be set on last segment sent

The flag value 0x02 maps directly to TCP_WRITE_FLAG_MORE (tcpbase.h:74), which is passed to tcp_write(). When TCP_WRITE_FLAG_MORE is set, lwIP interprets it as "more data coming, don't flush yet" — the PSH flag is not set on the last segment, telling the TCP stack to buffer rather than push.

Why the base code was wrong

The base code sets NETCONN_MORE when push == true. This means:

  • push=true (flush requested by RRF) → NETCONN_MORE set → TCP told "don't flush, more coming" → data buffered instead of flushed
  • push=false (no flush needed) → NETCONN_MORE not set → TCP told "nothing more" → data flushed prematurely

The semantics are exactly inverted.

How push is set by RRF

In SocketServer.cpp:1664:

const bool push = (acceptedLength == requestedlength) && (messageHeaderIn.hdr.flags & MessageHeaderSamToEsp::FlagPush) != 0;

RRF sets FlagPush when it wants the data flushed to the remote client (e.g., end of an HTTP response). The base code defeats this intent on every write.

Faulty execution path

  1. RRF sends HTTP response with FlagPush set → push = true
  2. Base: flag = NETCONN_COPY | NETCONN_MORE → tcp_write told "more data coming"
  3. TCP buffers the data instead of sending with PSH → remote client (browser) stalls waiting for push
  4. Data is eventually flushed by TCP timer or next write, but with added latency

Impact: Every TCP write in the system has inverted flush semantics, causing latency on HTTP responses and premature flushes on partial data.


Change 2: ERR_WOULDBLOCK Spin Guard in Write Loop

File: src/Connection.cpp:112-114

Base code:

for( ; total < length; total += written) {
    written = 0;
    rc = netconn_write_partly(conn, data + total, length - total, flag, &written);

    if (rc != ERR_OK && rc != ERR_WOULDBLOCK) {
        break;
    }
}

Fixed code:

for( ; total < length; total += written) {
    written = 0;
    rc = netconn_write_partly(conn, data + total, length - total, flag, &written);

    // Note: ERR_MEM is not handled here because lwIP's netconn layer retries
    // internally and never propagates ERR_MEM to the application layer.
    if (rc != ERR_OK && rc != ERR_WOULDBLOCK) {
        break;
    }
    if (rc == ERR_WOULDBLOCK && written == 0) {
        break;      // send buffer full and no progress after timeout, avoid spinning
    }
}

Why the base code was wrong

When netconn_write_partly returns ERR_WOULDBLOCK with written == 0, the send buffer is full and no data was written. The base loop condition total += written adds 0, so total < length remains true. The loop retries immediately — busy-waiting until the send buffer drains or the send timeout triggers another ERR_WOULDBLOCK.

The connection has netconn_set_sendtimeout(conn, MaxReadWriteTime) set (via Listener.cpp:232 for accepted connections, Connection.cpp:265 for outgoing connections). So each ERR_WOULDBLOCK iteration blocks for up to MaxReadWriteTime milliseconds before returning. Without the spin guard, the loop can re-enter the blocking write multiple times.

Why the new spin guard is needed for correctness with Change 4

Without this guard, when the loop exits with rc == ERR_WOULDBLOCK, the post-loop error handling at line 118 enters if (rc != ERR_OK). In the base code, the else branch calls Terminate(false) — killing the connection for a transient condition. With Change 4's fix (else if (rc != ERR_WOULDBLOCK)), ERR_WOULDBLOCK is excluded from termination. But without the spin guard, the loop would spin instead of breaking cleanly.

Faulty execution path

  1. RRF sends 2048 bytes, send buffer has space for 1460 bytes
  2. First iteration: writes 1460 bytes, total = 1460
  3. Second iteration: netconn_write_partly returns ERR_WOULDBLOCK, written = 0
  4. Base: loop continues (total += 0, still < 2048), retries immediately
  5. Repeats step 3-4 in a tight loop until send buffer drains or timeout cascades

Change 3: ERR_ABRT/ERR_CONN Classification in Write()

File: src/Connection.cpp:120

Base code:

if (rc == ERR_RST || rc == ERR_CLSD)
{
    SetState(ConnState::otherEndClosed);
}

Fixed code:

if (rc == ERR_RST || rc == ERR_CLSD || rc == ERR_ABRT || rc == ERR_CONN)
{
    SetState(ConnState::otherEndClosed);
}

lwIP error code definitions (lwip/err.h:52-89)

Code Value Meaning
ERR_CONN -11 Not connected
ERR_ABRT -13 Connection aborted
ERR_RST -14 Connection reset
ERR_CLSD -15 Connection closed

All four indicate the connection is no longer viable. ERR_ABRT occurs when lwIP aborts the connection internally (e.g., PCB killed due to resource pressure). ERR_CONN occurs when writing to a connection that has been disconnected.

Why the base code was wrong

In the base code, ERR_ABRT and ERR_CONN fall through to the else branch:

else
{
    debugPrintfAlways("Write fail len=%u err=%d\n", total, (int)rc);
    Terminate(false);
    return 0;
}

Terminate(false) sets state to ConnState::aborted and (in base code) calls listener->Notify(). The correct semantic for "remote end gone" is otherEndClosed, which is what RRF expects for clean socket teardown via WiFiSocket::Poll() (WiFiSocket.cpp:166-178).

Faulty execution path

  1. Remote client disconnects abruptly during file upload
  2. ESP calls netconn_write_partly → returns ERR_ABRT
  3. Base: falls to elseTerminate(false) → state = aborted
  4. listener->Notify() crashes if listener is nullptr (outgoing connection)
  5. Even if listener is non-null: aborted state requires RRF to discover via polling, then send connAbort, rather than the direct otherEndClosed → RRF Poll()SocketState::peerDisconnecting cleanup path

Change 4: ERR_WOULDBLOCK Non-Termination Guard in Write()

File: src/Connection.cpp:124

Base code:

else
{
    debugPrintfAlways("Write fail len=%u err=%d\n", total, (int)rc);
    Terminate(false);
    return 0;
}

Fixed code:

else if (rc != ERR_WOULDBLOCK)
{
    debugPrintfAlways("Write fail len=%u err=%d\n", total, (int)rc);
    Terminate(false);
    return 0;
}

Why this is needed

With the spin guard (Change 2), the write loop can now exit with rc == ERR_WOULDBLOCK and written == 0 — a partial write where some data was sent (total > 0) but the send buffer is full. This is a transient condition, not an error.

Without this guard, ERR_WOULDBLOCK after a partial write falls through to Terminate(false) — killing a connection that successfully sent data and just needs the remote end to acknowledge.

Faulty execution path

  1. Write loop sends 1460 of 2048 bytes, then ERR_WOULDBLOCK with written == 0
  2. Spin guard breaks out of loop with total = 1460, rc = ERR_WOULDBLOCK
  3. Post-loop: rc != ERR_OK → true, enters error handling
  4. ERR_WOULDBLOCK != ERR_RST/CLSD/ABRT/CONN → doesn't match Change 3
  5. Base: falls to elseTerminate(false) → connection killed after successful 1460-byte write
  6. Fixed: else if (rc != ERR_WOULDBLOCK) → skips termination, returns total = 1460

What happens to the unsent 588 bytes

SocketServer.cpp:1668-1672:

const size_t written = conn.Write(reinterpret_cast<uint8_t *>(transferBuffer), acceptedLength, push, closeAfterSending);
if (written != acceptedLength)
{
    lastError = "incomplete write";
}

The return value is compared against acceptedLength. If less was written, lastError is set. RRF does not see this — it already received acceptedLength via SPI (hspi.transfer32 at line 1666) before Write() was called. RRF decremented its txBufferSpace by acceptedLength (WiFiSocket.cpp:349). On the next Poll(), RRF refreshes txBufferSpace from resp.writeBufferSpace (WiFiSocket.cpp:237), which self-corrects any accounting discrepancy.


Change 5: Return total Instead of return length

File: src/Connection.cpp:141

Base code:

return length;

Fixed code:

return total;

Why the base code was wrong

length is the requested write size. total is the actual bytes written. After a partial write (e.g., send buffer full), the base returns the requested amount, not the actual amount. This makes Write() lie about its result.

Where the return value is used

SocketServer.cpp:1668-1672:

const size_t written = conn.Write(..., acceptedLength, push, closeAfterSending);
if (written != acceptedLength)
{
    lastError = "incomplete write";
}

With the base code, written always equals acceptedLength (since length == acceptedLength), so lastError is never set even when a partial write occurs. The fix makes partial writes observable through lastError.

RRF does not see Write()'s return value — acceptedLength is sent via SPI before Write() executes (SocketServer.cpp:1666). However, lastError is reported via the diagnostics command and aids debugging.


Change 6: ERR_TIMEOUT and ERR_ABRT Handling in Poll() Recv

File: src/Connection.cpp:170-171

Base code:

if (rc != ERR_WOULDBLOCK)
{
    if (rc == ERR_RST || rc == ERR_CLSD || rc == ERR_CONN)

Fixed code:

if (rc != ERR_WOULDBLOCK && rc != ERR_TIMEOUT)
{
    if (rc == ERR_RST || rc == ERR_CLSD || rc == ERR_CONN || rc == ERR_ABRT)

ERR_TIMEOUT: How it can be returned

Accepted connections have timeouts configured in Listener.cpp:231-232:

netconn_set_nonblocking(newConn, true);
netconn_set_recvtimeout(newConn, MaxReadWriteTime);

Outgoing connections similarly in Connection.cpp:263-264:

netconn_set_nonblocking(conn, true);
netconn_set_recvtimeout(conn, MaxReadWriteTime);

In lwIP's netconn_recv_data (api_lib.c:584-686):

  • Non-blocking path (line 621-625): when netconn_is_nonblocking() is true, a sys_arch_mbox_tryfetch is used → returns ERR_WOULDBLOCK on empty mailbox
  • Blocking path with timeout (line 628-631): when blocking with recv_timeout set, a sys_arch_mbox_fetch with timeout is used → returns ERR_TIMEOUT on timeout

When netconn_is_nonblocking is true, the non-blocking path is taken first. However, the lwIP implementation checks non-blocking OR the NETCONN_DONTBLOCK apiflags. The actual control flow depends on the lwIP version and platform-specific sys_arch_mbox_tryfetch behavior. If for any reason the non-blocking tryfetch succeeds but returns a timeout indication (platform-specific), or if the non-blocking flag is cleared during connection state transitions, ERR_TIMEOUT can surface.

Why ERR_TIMEOUT must not enter the error handler

In the base code, ERR_TIMEOUT passes the rc != ERR_WOULDBLOCK check, enters the error handler, doesn't match ERR_RST/CLSD/CONN, and falls through to:

else
{
    Terminate(false);
}

This kills the connection for a timeout — a transient, non-error condition.

ERR_ABRT in Poll() recv

ERR_ABRT (err.h:82, value -13: "Connection aborted") can be returned when lwIP internally aborts the connection (e.g., PCB killed under memory pressure via tcp_kill_prio). In the base code, ERR_ABRT doesn't match ERR_RST/CLSD/CONN and falls through to Terminate(false). The fix classifies it as otherEndClosed, which is the correct semantic — the connection is gone.

Faulty execution path (ERR_TIMEOUT)

  1. Connection is idle, no data arriving
  2. netconn_recv_tcp_pbuf_flags returns ERR_TIMEOUT
  3. Base: ERR_TIMEOUT != ERR_WOULDBLOCK → enters error handler
  4. ERR_TIMEOUT != ERR_RST/CLSD/CONN → falls to Terminate(false)
  5. Connection killed for being idle

Faulty execution path (ERR_ABRT)

  1. System under memory pressure, lwIP calls tcp_kill_prio to free PCBs
  2. netconn_recv_tcp_pbuf_flags returns ERR_ABRT
  3. Base: enters error handler, doesn't match ERR_RST/CLSD/CONN
  4. Falls to Terminate(false) → state = aborted
  5. Fixed: matches ERR_ABRT → state = otherEndClosed → cleaner RRF teardown

Change 7: closePending — Null PCB Detection and Terminate(true)

File: src/Connection.cpp:199-214

Base code:

else if (state == ConnState::closePending)
{
    // We're about to close this connection and we're still waiting for the remaining data to be acknowledged
    if (conn->pcb.tcp && !conn->pcb.tcp->unacked)
    {
        // All data has been received, close this connection next time
        SetState(ConnState::closeReady);
    }
    else if (millis() - closeTimer >= MaxAckTime)
    {
        // The acknowledgement timer has expired, abort this connection
        Terminate(false);
    }
}

Fixed code:

else if (state == ConnState::closePending)
{
    // The other end may have closed the connection with RST, which causes lwIP
    // to free the PCB. Detect this and close immediately instead of waiting for
    // the acknowledgement timer to expire.
    if (!conn->pcb.tcp || (!conn->pcb.tcp->unsent && !conn->pcb.tcp->unacked))
    {
        SetState(ConnState::closeReady);
    }
    else if (millis() - closeTimer >= MaxAckTime)
    {
        // The acknowledgement timer has expired. The close was already initiated
        // by RRF, so go straight to free rather than aborted to avoid a round-trip.
        Terminate(true);
    }
}

Three bugs fixed

Bug A: Null PCB causes MaxAckTime delay

When the remote end sends RST while in closePending, lwIP frees the PCB — conn->pcb.tcp becomes NULL.

Base code: conn->pcb.tcp && !conn->pcb.tcp->unacked evaluates to false (short-circuit on null). Falls through to the timeout branch, which waits up to MaxAckTime before calling Terminate(false). The connection sits in closePending doing nothing for the full timeout duration.

Fixed code: !conn->pcb.tcp evaluates to true → immediately transitions to closeReady. No delay.

Bug B: Unsent queue not checked

The base code only checks !conn->pcb.tcp->unacked. The TCP send pipeline is: tcp_write()unsenttcp_output()unacked → ACK received → freed. It is possible for unsent != 0 && unacked == 0 — e.g., when an ACK drains the in-flight data but tcp_output hasn't yet moved queued segments from unsent to the wire in the same cycle, or the Nagle algorithm is holding data.

Base code transitions to closeReady when unacked is empty, regardless of unsent. The subsequent netconn_close() in the closeReady path does call tcp_close() which attempts to flush, but this is less controlled than allowing the normal send path to drain both queues.

Fixed code: (!conn->pcb.tcp->unsent && !conn->pcb.tcp->unacked) — only transitions to closeReady when both queues are fully drained.

Bug C: Terminate(false) causes unnecessary round-trip to RRF

How closePending is entered — Close() at Connection.cpp:225:

case ConnState::connected:
    if (conn->pcb.tcp && conn->pcb.tcp->unacked)
    {
        closeTimer = millis();
        netconn_shutdown(conn, true, false);
        SetState(ConnState::closePending);
        break;
    }

Close() is called from SocketServer.cpp:1637 via the connClose SPI command:

case NetworkCommand::connClose:
    ...
    Connection::Get(messageHeaderIn.hdr.socketNumber).Close();

So closePending means RRF already requested the close. When the ack timer expires:

  • Base: Terminate(false)SetState(ConnState::aborted) → RRF must discover the aborted state via GetSummarySocketStatus bitmap or round-robin polling, then send connAbortTerminate(true) → finally free
  • Fixed: Terminate(true)SetState(ConnState::free) → slot immediately available

The round-trip is:

  1. ESP sets aborted
  2. RRF's WiFiSocket::Poll() discovers ConnState::aborted (WiFiSocket.cpp:242-248):
    case ConnState::aborted:
        state = SocketState::broken;
        Terminate();
  3. Terminate() sends connAbort back to ESP (WiFiSocket.cpp:284-298)
  4. ESP Terminate(true) → finally free

This round-trip is pointless because RRF already asked to close. Terminate(true) skips it.

Faulty execution path

  1. RRF sends connClose → ESP enters closePending with unacked data
  2. Remote client crashes → sends RST → lwIP frees PCB
  3. Base: conn->pcb.tcp is null → first condition false → waits MaxAckTime
  4. After timeout: Terminate(false) → state = aborted
  5. RRF discovers aborted (via polling or bitmap) → sends connAbort → ESP Terminate(true) → free
  6. Total delay: MaxAckTime + RRF poll interval + SPI round-trip
  7. Fixed: !conn->pcb.tcp → immediate closeReadyClose()free in next Poll()

Change 8: Listener Null Guard in Close()

File: src/Connection.cpp:244-247

Base code:

FreePbuf();
SetState(ConnState::free);
listener->Notify();

Fixed code:

FreePbuf();
SetState(ConnState::free);
if (listener)
{
    listener->Notify();
}

Why listener can be null

Outgoing (MQTT) connections: ConnectCallback at Connection.cpp:497:

case NETCONN_EVT_SENDPLUS:
    connection->Connected(nullptr, conn);
    break;

Connected() stores the listener parameter directly at Connection.cpp:322:

this->listener = listener;  // nullptr for outgoing connections

FTP data connections after accept: After Change 11 (FTP listener nullptr), c->listener is explicitly set to nullptr at Listener.cpp:237 before the listener is deleted.

Constructor: The constructor (Connection.cpp:19-23) does not initialize listener:

Connection::Connection(uint8_t num)
    : number(num), localPort(0), remotePort(0), remoteIp(0), conn(nullptr), state(ConnState::free),
    closeTimer(0),readBuf(nullptr), readIndex(0), alreadyRead(0), pendOtherEndClosed(false)
{
}

listener is not in the initializer list. It is only set in Connected().

How Close() is reached for outgoing connections

The connClose SPI command (SocketServer.cpp:1637) calls Close() on any valid socket number, including outgoing MQTT connections. Also, Write() calls Close() at line 137 when closeAfterSending is true.

Faulty execution path

  1. RRF creates MQTT connection → Connect()ConnectCallbackConnected(nullptr, conn)
  2. listener is nullptr
  3. MQTT broker closes connection → state becomes otherEndClosed
  4. RRF sends connClose → Close() entered with state otherEndClosed
  5. Reaches listener->Notify()null pointer dereference → crash

Change 9: Listener Null Guard and Conditional Notify in Terminate()

File: src/Connection.cpp:313-317

Base code:

SetState((external) ? ConnState::free : ConnState::aborted);
listener->Notify();

Fixed code:

SetState((external) ? ConnState::free : ConnState::aborted);
if (external && listener)
{
    listener->Notify();
}

Two fixes

Fix A: Null guard — same rationale as Change 8. Outgoing connections have listener == nullptr.

Fix B: Only notify on external (free) — when external == false, state is set to aborted. The connection slot is not freed — it transitions from a live state to aborted, which still occupies the slot. Listener::Notify() wakes the ListenerTask to accept new connections, but no slot is available since the aborted connection hasn't been freed yet. Notifying is pointless and potentially harmful (listener attempts accept, fails to find free slot).

When external == true, state is set to free. The slot is actually available, so notifying the listener to accept new connections is correct.

Where Terminate(true) is called

  • SocketServer.cpp:1625 — connAbort command: Connection::Get(...).Terminate(true)
  • Connection.cpp:288Connect() failure path: Terminate(true)

Where Terminate(false) is called

  • Connection.cpp:186Poll() recv error fallthrough
  • Connection.cpp:127Write() error (with Change 4, only for non-WOULDBLOCK errors)
  • Connection.cpp:212 — closePending timeout (in base code; fixed to Terminate(true) by Change 7)

Faulty execution path

  1. MQTT connection established with listener == nullptr
  2. Network error during Poll() recv → unclassified error
  3. Terminate(false) → state = aborted
  4. Base: listener->Notify()null pointer dereference → crash

Change 10: Aborted State in GetSummarySocketStatus Bitmap

File: src/Connection.cpp:435-437

Base code:

else if (Connection::Get(i).GetState() == ConnState::otherEndClosed)
{
    otherEndClosedSockets |= (1 << i);
}

Fixed code:

else if (Connection::Get(i).GetState() == ConnState::otherEndClosed
        || Connection::Get(i).GetState() == ConnState::aborted)
{
    otherEndClosedSockets |= (1 << i);
}

How the bitmap flows to RRF

  1. GetSummarySocketStatus() is called from SocketServer.cpp:1687:
    Connection::GetSummarySocketStatus(resp.connectedSockets, resp.otherEndClosedSockets);
  2. The bitmap is sent in ConnStatusResponse via SPI to RRF
  3. RRF's WiFiInterface::UpdateSocketStatus() (WiFiInterface.cpp:1722-1732):
    for (size_t i = 0; i < NumWiFiTcpSockets; ++i)
    {
        if (((connectedSockets | otherEndClosedSockets) & (1u << i)) != 0)
        {
            sockets[i]->SetNeedsPolling();
        }
    }
  4. SetNeedsPolling() causes RRF to poll that specific socket in the next cycle
  5. WiFiSocket::Poll() discovers ConnState::aborted (WiFiSocket.cpp:242-248) and cleans up

Why the base code was wrong

Without the fix, aborted sockets are invisible in the bitmap. RRF discovers them only through round-robin polling — iterating through all sockets sequentially. This delays cleanup of aborted connections, which hold resources (connection slots, memory) until discovered.

The connGetStatus SPI command is sent by RRF for each socket individually. The bitmap is a piggyback optimization that tells RRF which sockets need urgent attention. Excluding aborted from the bitmap defeats this optimization.

Faulty execution path

  1. Connection error → Terminate(false) → state = aborted
  2. RRF sends connGetStatus for socket 3 → bitmap returned
  3. Base: socket in aborted state not flagged in otherEndClosedSockets
  4. RRF's UpdateSocketStatus doesn't call SetNeedsPolling() for aborted socket
  5. Aborted socket discovered only when RRF's round-robin reaches it
  6. During delay: connection slot occupied, listener cannot accept new connections on that slot

Change 11: Use-After-Free in Listener::Stop()

File: src/Listener.cpp:130-137

Base code:

void Listener::Stop()
{
    netconn_close(conn);
    netconn_delete(conn);

    for (int i = 0; i < MaxConnections; i++)
    {
        Listener *listener = listeners[i];
        if (listener && listener->conn == conn)

Fixed code:

void Listener::Stop()
{
    struct netconn *savedConn = conn;
    netconn_close(savedConn);
    netconn_delete(savedConn);

    for (int i = 0; i < MaxConnections; i++)
    {
        Listener *listener = listeners[i];
        if (listener && listener->conn == savedConn)

Why the base code was wrong

Listener::Stop() is an instance method. conn is this->conn — the listener's own netconn pointer. After netconn_delete(conn):

  1. The netconn object is freed — conn is a dangling pointer
  2. The for loop iterates listeners[] looking for listener->conn == conn
  3. When listener == this, listener->conn reads this->conn — the same dangling pointer
  4. The comparison listener->conn == conn compares two copies of the same dangling pointer value
  5. Then delete listener destroys this
  6. After delete this, the next loop iteration accesses this->conn again (UB — this is freed)

Using a stack-local savedConn preserves the pointer value before netconn_delete frees the object and before delete listener frees this. The comparison uses a stable stack value instead of reading freed memory.

Why Stop() is not re-entrant safe without this fix

Stop() is called from:

  • Listener::Start() line 35 (when replacing a listener)
  • Listener::Stop(uint16_t port) static method line 160 (closing all listeners on a port)
  • Listener::ListenerTask line 238 (after accepting FTP data connection)

In the ListenerTask path (Change 12), listener->Stop() deletes the listener. Any subsequent access to the listener's members through this is UB. The savedConn pattern ensures the netconn comparison uses a stack variable that survives the delete.

Faulty execution path

  1. FTP data connection accepted → listener->Stop() called
  2. netconn_delete(conn) frees the netconn
  3. Loop finds listener->conn == conn — both are the same dangling pointer, comparison "works" by luck
  4. delete listener frees this
  5. Next loop iteration: listeners[i] access is fine (static array), but if another listener's conn happened to be allocated at the same address as the freed netconn → false match → wrong listener deleted

Change 12: FTP Data Listener Dangling Pointer

File: src/Listener.cpp:237

Base code:

if (listener->protocol == protocolFtpData)
{
    debugPrintf("accept conn, stop listen on port %u\n", listener->port);
    listener->Stop();   // don't listen for further connections
}

Fixed code:

if (listener->protocol == protocolFtpData)
{
    debugPrintf("accept conn, stop listen on port %u\n", listener->port);
    c->listener = nullptr;  // clear before Stop() deletes the listener
    listener->Stop();   // don't listen for further connections
}

Why the base code was wrong

This code is in ListenerTask, after a connection c is accepted on an FTP data port.

listener->Stop() at Listener.cpp:139 does delete listener, which frees the Listener object. But c->listener still points to the deleted listener — a dangling pointer.

When the FTP data connection later calls Close() or Terminate():

  • Close() at line 244: listener->Notify() → dereferences freed memory
  • Terminate() at line 313: listener->Notify() → dereferences freed memory

With Change 8 and 9's null guards, setting c->listener = nullptr before Stop() prevents the dangling pointer dereference. The null guard skips Notify() entirely.

Connection lifecycle for FTP data

FTP data connections are one-shot: a listener is created for a specific data transfer, accepts one connection, then the listener is stopped. The connection outlives the listener (it still needs to transfer the data). After the transfer completes, the connection closes — at which point it must not reference the deleted listener.

WiFiInterface::OpenDataPort() (WiFiInterface.cpp:1735-1749) creates a fresh listener for each FTP data transfer, confirming the one-shot pattern.

Faulty execution path

  1. FTP data listener accepts connection c with c->listener = listener
  2. listener->Stop()delete listener → listener freed
  3. c->listener is now a dangling pointer to freed memory
  4. FTP data transfer completes → RRF sends connClose
  5. c->Close() → reaches listener->Notify()dereferences freed memory → crash or corruption

Summary

# Change Bug Type Severity
1 NETCONN_MORE inversion Logic error High — every write has wrong flush
2 ERR_WOULDBLOCK spin guard Spin/busy-wait Medium — blocks on full send buffer
3 ERR_ABRT/ERR_CONN in Write Missing error classification Medium — wrong state on disconnect
4 ERR_WOULDBLOCK non-termination False termination High — kills connections after partial write
5 return total Incorrect return value Low — makes partial writes observable
6 ERR_TIMEOUT/ERR_ABRT in Poll Missing error classification High — kills idle connections
7 closePending null PCB + unsent check + Terminate(true) Null PCB delay + premature closeReady + unnecessary round-trip Medium — MaxAckTime delay + data loss risk + wasted SPI cycle
8 Close() listener null Null pointer dereference Critical — crash on MQTT/outgoing close
9 Terminate() listener null Null pointer dereference Critical — crash on MQTT/outgoing terminate
10 GetSummarySocketStatus aborted Missing bitmap flag Medium — delayed cleanup
11 Listener::Stop() savedConn Use-after-free High — UB, potential wrong listener deletion
12 FTP listener dangling pointer Dangling pointer Critical — crash on FTP data connection close

@timschneider timschneider changed the title Fix connection lifecycle bugs, prevent PCB pool exhaustion, and harden closure handling [WIP] Fix connection lifecycle bugs, prevent PCB pool exhaustion, and harden closure handling Mar 16, 2026
@chrishamm chrishamm self-assigned this Mar 26, 2026
@chrishamm
Copy link
Copy Markdown
Contributor

Thanks again for your contributions! It looks like I will adopt this project and maintain it from now on.

Are you happy with your PR yet? If yes, I will review it shortly.

- Fix inverted NETCONN_MORE flag: data was buffered when it should flush
  and flushed when it should buffer, affecting every TCP write
- Add ERR_WOULDBLOCK spin guard in Write loop to prevent busy-wait when
  send buffer is full with no progress
- Classify ERR_ABRT/ERR_CONN as otherEndClosed in Write and Poll instead
  of falling through to Terminate
- Guard Write against terminating connection on ERR_WOULDBLOCK after
  partial write success
- Return actual bytes written (total) instead of requested length
- Handle ERR_TIMEOUT in Poll recv to prevent idle connection termination
- Detect RST-destroyed PCB in closePending instead of waiting for ack
  timer, and use Terminate(true) to avoid unnecessary round-trip to RRF
- Null-guard listener->Notify() in Close and Terminate to prevent crash
  on MQTT/outgoing connections where listener is nullptr
- Include aborted state in GetSummarySocketStatus bitmap so RRF gets
  prompt notification of aborted connections
- Fix use-after-free in Listener::Stop by saving conn pointer before
  netconn_delete
- Clear connection listener pointer before FTP data listener Stop to
  prevent dangling pointer dereference on connection close

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@timschneider timschneider force-pushed the fix/WiFiSocketServerRTOS branch from 122e19e to 8127a25 Compare March 27, 2026 10:46
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@timschneider
Copy link
Copy Markdown
Author

@chrishamm i've updated the PR to only include the minimal needed code changes and the description to explain each change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants