Skip to content

Outstanding work: ledger removal, send_op errors, async API, B5a hardening, espflash native, sector-selective flashing #18

@zackees

Description

@zackees

Tracking outstanding items from ISSUES.md that have been investigated and scoped but not yet implemented. Each section is independently actionable.

1. Remove client-side firmware ledger; make device-side verify-flash authoritative

Severity: CORRECTNESS — the disk ledger can drift from reality.

The current crates/fbuild-deploy/src/firmware_ledger.rs records SHA256 hashes of firmware.bin / bootloader.bin / partitions.bin after each successful deploy and consults that record on the next deploy to short-circuit the flash. This is fast (no device round-trip) but wrong when anything else writes to flash between fbuild deploys:

  • Manual esptool write-flash from a terminal
  • Arduino IDE flash
  • OTA update
  • Another fbuild instance on a different machine using the same device

In any of those cases the ledger says ''current'' but the device runs a different image, and fbuild silently skips a needed deploy.

Fix: Delete the ledger entirely. The device-side verify-flash pre-check (already wired into the daemon's deploy handler — see ISSUES.md ''Performance: Fast deploy via verify-then-skip'') becomes the sole authoritative pre-check. It uses esptool's FLASH_MD5SUM stub command, so the device tells us byte-for-byte whether each region matches. Measured cost: ~6 s for a 2.4 MB ESP32-S3 image; the 76% speedup over a full re-flash is preserved.

Concrete steps:

  1. Delete crates/fbuild-deploy/src/firmware_ledger.rs
  2. Remove the firmware_ledger field from DaemonContext (crates/fbuild-daemon/src/context.rs)
  3. Remove the ledger pre-check and post-deploy record_deployment call from crates/fbuild-daemon/src/handlers/operations.rs
  4. Remove the compute_boot_parts_hashes helper added in Issue B4 (no longer needed — verify-flash covers all 3 regions)
  5. Update fbuild-deploy/src/lib.rs to drop the firmware_ledger re-export
  6. Mark Issue B4 in ISSUES.md as superseded by device-side verify

Side effect: the deploy handler no longer needs to compute SHA256s of source files / build flags / boot+parts artefacts. That work disappears entirely. The verify-flash call (~6 s when device matches) replaces the ledger-skip path (~0 s when ledger says match).


2. Structured error returns from send_op in fbuild-python (Issue F follow-up)

Severity: USABILITY — Python callers can't branch on failure modes.

send_op in crates/fbuild-python/src/lib.rs currently returns a bare bool and prints [fbuild] operation failed: ... / [fbuild] stderr: ... to stderr. Python consumers (FastLED, autoresearch) can only check ''did it succeed?''; they cannot distinguish ''port not found'' from ''build failed'' from ''timeout'' programmatically.

Fix: Either return a result struct (OperationResult { success, message, exit_code, stdout, stderr }) or raise a typed Python exception (FbuildDeployError, FbuildBuildError, FbuildPortError). Breaking API change — schedule for the next minor version bump.


3. Native async API in fbuild-python

Severity: USABILITY — async callers pay for thread-executor wrapping.

Today the Python bindings expose only synchronous methods. FbuildSerialAdapter._run_in_thread wraps every call in a thread executor so async callers don't block their event loop, but this generates an ''asyncio'' warning under some configurations and adds latency.

Fix: Add AsyncSerialMonitor, AsyncDaemon, etc. that use PyO3's pyo3-asyncio (or the newer pyo3-async-runtimes) to expose async def methods callable directly under asyncio.run(...). Existing sync API stays for compatibility.


4. B5a hardening leftovers: SO_LINGER 0 + SetConsoleCtrlHandler

Severity: ROBUSTNESS — covered by listener-level B5a fix; these close remaining edge cases.

The listener-level B5a fix (SO_EXCLUSIVEADDRUSE on Windows + bind retry + stale-PID cleanup) is in. Two deeper hooks remain deferred:

  • SO_LINGER 0 on accepted client sockets — currently axum's accept loop doesn't expose per-connection socket options. After a hard kill, the dangling CLOSE_WAIT state on accepted sockets can outlive the daemon. Setting SO_LINGER 0 on every accepted socket would force an immediate RST on close instead of the FIN/CLOSE_WAIT/TIME_WAIT dance. Requires hooking into the axum accept loop or using a custom IncomingStream.
  • SetConsoleCtrlHandler on Windowstokio::signal::ctrl_c() covers CTRL_C_EVENT but not CTRL_CLOSE_EVENT (window close), CTRL_LOGOFF_EVENT, or CTRL_SHUTDOWN_EVENT. The daemon currently dies without running its graceful shutdown path on those events.

Regression test exists at crates/fbuild-daemon/tests/port_recovery.rs (run with --ignored).


5. espflash native library integration (replace esptool subprocess)

Severity: PERFORMANCE — could cut verify cost from ~6 s to ~1.5 s.

The current verify-flash pre-check shells out to esptool (Python). Cost breakdown for the ~5.9 s verify of a 2.4 MB image:

Phase Estimated cost
Python interpreter startup ~1 s
Subprocess spawn + esptool init ~0.5 s
Connect + reset chip into bootloader ~1 s
SYNC handshake + stub flasher upload ~1 s
Baud rate change ~0.5 s
FLASH_MD5SUM execution (3 regions, 2.4 MB) ~1 s
Cleanup + reset back to app ~0.5 s

The actual MD5 work is ~1 s; the rest is process / Python overhead.

Alternative: the espflash crate (4.3.0) is a Rust-native ESP32 flasher protocol implementation maintained by ESP-RS. It exposes the SLIP protocol, stub flasher loading, and FLASH_MD5SUM natively. Add espflash = { version = ''4'', default-features = false, features = [''serialport''] } and reuse the daemon's existing serial port lease instead of spawning a subprocess.

Estimated win: ~5.9 s → ~1.5–2 s for verify. Subsequent verifies in the same session could reuse the loaded stub flasher and drop further to <1 s.

Effort: medium. The espflash library API is documented but not as stable as the CLI. Might need an adapter layer to convert between fbuild's Esp32Deployer and espflash's Flasher type. Worth a spike.


6. Sector-selective flashing (only write regions that differ)

Severity: PERFORMANCE — minor win, but trivial to add once #1 is done.

Currently, when verify-flash reports a mismatch, the daemon falls through to write-flash which writes all three regions (bootloader + partitions + firmware), even if only one differs. For the common case ''only firmware changed,'' we waste ~1 s rewriting bootloader.bin and partitions.bin.

Fix: parse the verify-flash output (or run three separate verify calls with --diff) to determine which regions matched, then call write-flash with only the offset/file pairs for the mismatched regions. Saves ~1 s on the typical ''firmware-only'' deploy.

Should be tackled after #1 (ledger removal) so the flow is: device tells us what differs → we write only what differs → device verifies the write.


Cross-references

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions