Skip to content

confrisk: preparatory changes#1154

Open
hieblmi wants to merge 5 commits into
lightninglabs:masterfrom
hieblmi:pre-dyn-conf-fixes
Open

confrisk: preparatory changes#1154
hieblmi wants to merge 5 commits into
lightninglabs:masterfrom
hieblmi:pre-dyn-conf-fixes

Conversation

@hieblmi

@hieblmi hieblmi commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

extracted commits from #1141
This PR pulls out general correctness and notification-delivery fixes ahead of the dynamic confirmation work.

It fixes a few independent issues in static address loop-in/deposit handling and hardens notification fanout so slow subscribers cannot
block unrelated notification delivery.

What

  • Cancel static loop-in invoices when initialization fails before the swap is stored.
  • Finalize withdrawn deposits asynchronously so final-state cleanup cannot block the FSM.
  • Fix static loop-in state queries so failed swaps are included in final-state results.
  • Prevent slow notification subscribers from blocking manager fanout.
  • Queue must-deliver notification types while keeping optional reservation notifications best-effort.

Why

These changes reduce failure modes that are not specific to dynamic confirmations, but become more important as static loop-ins rely more
heavily on notification-driven flows.

Moving them first keeps the dynamic confirmation PR focused on the actual feature behavior while landing independent fixes separately.

hieblmi added 4 commits June 19, 2026 15:04
What:

- Cancel a private swap invoice if init fails before storage.

- Use a detached timeout context for invoice cleanup.

- Reuse the helper for monitor-driven invoice cancellation.

- Add tests for early-init invoice cancellation.

Why:

- A failed pre-storage swap cannot be recovered on restart.

- Leaving its invoice active can accumulate orphan invoices.
What:

- Send finalization notifications from a goroutine.

- Stop blocking the deposit FSM on the manager receive loop.

- Add tests for blocked manager and shutdown races.

Why:

- Final states should not stall while deposit locks are held.

- Startup recovery can handle shutdown before notification delivery.
What:

- Fix state-list query formatting for final loop-in states.

- Add a failed loop-in to the final-state store test.

Why:

- The previous formatting excluded the last state in the list.

- Failed swaps must remain visible to callers querying final states.
What:

- Make reservation fanout nonblocking for slow subscribers.

- Keep recovery and sweep notifications blocking until delivered.

- Make required fanout cancellation-aware.

- Cover slow and required subscribers in manager tests.

Why:

- A slow optional subscriber must not stall the manager lock.

- Required work requests still need reliable local delivery.
@hieblmi

hieblmi commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator Author

@claude please review

@gemini-code-assist

Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces preparatory fixes for dynamic confirmation work by improving the robustness of static loop-in and notification delivery flows. The changes focus on preventing system stalls caused by slow subscribers, ensuring proper cleanup of transient resources like invoices during failure scenarios, and improving the reliability of state tracking for loop-in swaps.

Highlights

  • Notification Fanout Hardening: Implemented a per-subscriber FIFO queue to prevent slow subscribers from blocking the notification manager, and added logic to drop reservation notifications for slow subscribers.
  • Static Loop-In Invoice Cleanup: Added automatic cancellation of static loop-in invoices when initialization fails before the swap is persisted, preventing orphan invoice accumulation.
  • Asynchronous Deposit Finalization: Modified deposit finalization to occur asynchronously, ensuring that busy manager loops do not stall withdrawal confirmation.
  • State Query Improvements: Updated static loop-in state queries to correctly include failed swaps in final-state results.
New Features

🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@hieblmi hieblmi changed the title confrisk: preparatory PR confrisk: preparatory changes Jun 19, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces non-blocking notification delivery for subscribers, asynchronous deposit finalization, and automatic invoice cleanup on initialization failures. Specifically, a per-subscriber FIFO queue is added to prevent slow subscribers from blocking the main loop, and FinalizeDepositAction is updated to run asynchronously. Additionally, InitHtlcAction now ensures that pre-created invoices are canceled if an error occurs before persistence. Feedback on the changes highlights a potential memory leak in the FIFO queue implementation where sliced elements are not zeroed out, which could prevent garbage collection of pointer types.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread notifications/manager.go
Comment on lines +145 to +147
ntfn := q.pending[0]
q.pending = q.pending[1:]
q.Unlock()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In Go, slicing a slice (e.g., q.pending = q.pending[1:]) does not zero out the retired element. If T is a pointer type, the underlying array will retain a reference to the popped notification, preventing it from being garbage collected until that index in the underlying array is overwritten or the queue is garbage collected. To avoid this memory leak, explicitly zero out the popped element before slicing.

			ntfn := q.pending[0]
			var zero T
			q.pending[0] = zero
			q.pending = q.pending[1:]
			q.Unlock()

@claude

claude Bot commented Jun 19, 2026

Copy link
Copy Markdown

Code Review

Overall this is a well-scoped preparatory PR that addresses real failure modes. The fixes are independent and the PR description maps cleanly to the diffs. A few things worth flagging before merge:


notifications/manager.go — notification queue

removeSubscriber fix is required for compilation. Adding enqueue func(any) to the subscriber struct makes it non-comparable, so the old s != sub is a compile error in Go. The fix to s.recvChan != sub.recvChan is correct and necessary.

Race between queue goroutine and close(notifChan).
SubscribeStaticLoopInSweepRequests and SubscribeUnfinishedSwaps both close notifChan via context.AfterFunc. The queue goroutine concurrently sends to that same channel. There is a window where the goroutine is in case recvChan <- ntfn: while context.AfterFunc fires close(notifChan), causing a "send on closed channel" panic. The recover() block is there exactly for this reason, but using recover() to absorb a closed-channel panic is a code smell — it would silently swallow a double-close bug too. Consider using a dedicated done chan struct{} that the subscriber owns and closes independently of notifChan, or use sync.Once around the close. At minimum add a comment explaining why the recover is needed (the AfterFunc race), not just what it catches.

Unbounded queue growth.
q.pending is an unbounded slice. A subscriber that never drains (e.g. its read loop is stuck) will accumulate notifications indefinitely. For must-deliver types (StaticLoopInSweep, UnfinishedSwap) this is intentional, but worth a brief comment and perhaps a log.Warnf when the queue depth exceeds some threshold.

queueNotification fallback path is unreachable.
The enqueue == nil branch falls back to a blocking send. Looking at callers, every subscriber that goes through queueNotification (StaticLoopInSweep, UnfinishedSwap) sets enqueue. The nil branch exists for defensive reasons — a comment or a log.Warnf("enqueue is nil, falling back to blocking send") would clarify intent.

Test: TestManager_SlowSubscriberDoesNotBlock.
The test uses require.Len(t, slowChan, 1) at the end. This only works because SubscribeReservations creates a buffered-1 channel and the slow subscriber never reads. The assertion is correct but it silently depends on channel capacity being exactly 1. A brief comment would help.


staticaddr/loopin/sql_store.gotoJointStringStates fix

This is a subtle but significant bug. The SQL query uses:

(',' || $1 || ',') LIKE ('%,' || u.update_state || ',%')

With the old {...} wrapper, the parameter was e.g. ",{HtlcTimeoutSwept,Succeeded,SucceededTransitioningFailed,Failed},". The curly braces break matching for:

  • The first element (HtlcTimeoutSwept): surrounded by {... so ,HtlcTimeoutSwept, never appears.
  • The last element (Failed): trailed by } so ,Failed, never appears (only ,Failed},).

Middle elements like Succeeded worked by coincidence (flanked by real commas). The fix correctly produces a plain comma-separated string that the LIKE pattern can match uniformly.

The new test adds a Failed-state swap and asserts it appears in finalizedSwaps — good direct coverage of the bug.

Minor: result ordering. The test asserts finalizedSwaps[0] is Succeeded and finalizedSwaps[1] is Failed based on insertion order. The SQL has ORDER BY swaps.id, so insertion order should be stable, but it's worth a brief comment in the test explaining the expected order.


staticaddr/deposit/actions.go — async FinalizeDepositAction

The goroutine is correctly bounded by f.quitChan rather than a request context, and the comment explains the shutdown-race reasoning. A few notes:

  • If the manager's read loop of finalizedDepositChan is slow or blocked for a long time (not shutdown), the goroutine lingers. This is probably fine in practice but worth a line in the comment.
  • The action now always returns fsm.NoOp — the previous fsm.OnError on context cancellation is intentionally dropped. The test TestFinalizeDepositActionIgnoresRequestCancellation covers this, but the comment in FinalizeDepositAction could explicitly note this behaviour change.

staticaddr/loopin/actions.go — invoice cleanup

The deferred cleanup via the invoiceNeedsCleanup flag and returnError helper is clean. context.WithoutCancel + timeout in cancelSwapInvoice is the right approach to ensure cleanup runs even when the caller's context is already done. The refactoring to reuse cancelSwapInvoice in MonitorInvoiceAndHtlcTxAction is a good simplification.

One readability nit: event is set by returnError and later read inside the defer. The closed-over variable makes the flow non-obvious on first read. A short comment above the defer (e.g. // event is set by returnError below; defer reads it to decide whether to cancel) would help.


Test coverage

Coverage is solid: the three notification tests (SlowSubscriberDoesNotBlock, UnfinishedSwapNotificationWaitsForSubscriber, StaticLoopInSweepNotificationQueuesForSlowSubscriber) directly exercise the new queue semantics. The three FinalizeDepositAction tests cover the non-blocking guarantee, request-context independence, and already-canceled context. The two InitHtlcAction tests verify invoice cancellation on server error and fee-guard failure. No gaps noticed.


Summary

All five issues in the PR description appear correctly fixed. The SQL bug fix in particular was non-obvious and well-caught. Main asks before merge:

  1. Add a comment (or fix the race more cleanly) around the recover() in newNotificationQueue.
  2. Add the ordering comment in TestGetStaticAddressLoopInSwapsByStates.

Neither is a blocker but both would reduce confusion for future readers.

@hieblmi hieblmi force-pushed the pre-dyn-conf-fixes branch from adfd82a to 283bc5f Compare June 19, 2026 14:25
@hieblmi hieblmi requested a review from starius June 19, 2026 14:30
What:

- Add per-subscriber queues for must-deliver notifications.

- Use queued delivery for sweep requests and unfinished swaps.

- Match subscribers by receive channel when removing them.

- Add tests for queued delivery and cleanup.

Why:

- Required fanout should not block the manager lock.

- Subscribers still need ordered delivery after brief backpressure.
@hieblmi hieblmi force-pushed the pre-dyn-conf-fixes branch from f321b97 to cb83ba9 Compare June 19, 2026 14:44
@hieblmi

hieblmi commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator Author

@claude please re-review after I added queued notification handling of reservations as well.

@hieblmi hieblmi self-assigned this Jun 19, 2026
@claude

claude Bot commented Jun 19, 2026

Copy link
Copy Markdown

Code Review

PR: confrisk: preparatory changes
Summary: This PR extracts correctness fixes from #1141 ahead of the dynamic confirmation feature. It addresses four independent issues: orphaned invoices on init failure, blocking deposit finalization, failed swaps missing from state queries, and slow subscribers blocking notification fanout.

The overall direction is sound — each fix targets a real failure mode, test coverage is strong, and the changes are well-scoped. A few areas worth discussing:


notifications/manager.go — Per-subscriber FIFO queue

Positive:

  • The queue design is clean: a per-subscriber goroutine drains a local []T slice, so a slow subscriber cannot stall the fanout loop.
  • The removeSubscriber comparison change from s != sub to s.recvChan != sub.recvChan is a required correctness fix, not optional — adding enqueue func(any) to subscriber makes the struct non-comparable in Go, so the old == comparison would be a compile error.

Concerns:

  1. PR description vs. code mismatch — The description says "Queue must-deliver notification types while keeping optional reservation notifications best-effort." But SubscribeReservations also receives an enqueue queue. Only SubscribeHtlcConfirmed remains best-effort (using its existing timeout select). Either the description needs updating or the intent is for reservations to stay best-effort (i.e., SubscribeReservations should not set enqueue).

  2. recover() for closed-channel panic — The goroutine guards against writing to a closed channel using recover():

    defer func() {
        if recover() != nil {
            log.Debugf("subscriber channel closed before notification delivery")
        }
    }()

    This race arises because context.AfterFunc closes notifChan while the queue goroutine may still be writing to it. Giving the goroutine ownership of the channel's close (close it in the goroutine's deferred cleanup after ctx.Done()) would eliminate the panic path entirely. Using recover() to paper over a write-to-closed-channel panic can also mask unrelated panics in the goroutine.

  3. Unbounded queueq.pending grows without limit. A subscriber that is consistently slow (e.g., stuck in a long DB operation) will accumulate an ever-growing backlog. Adding a configurable cap with a drop-and-log policy would prevent unbounded memory growth in degraded conditions.

  4. Head element not nil'd after dequeue — After q.pending = q.pending[1:] the old position still holds a reference in the backing array, delaying GC for pointer-typed notifications. Consider:

    ntfn := q.pending[0]
    q.pending[0] = nil   // release pointer for GC
    q.pending = q.pending[1:]

staticaddr/loopin/actions.go — Invoice cleanup on init failure

Positive:

  • The deferred invoiceNeedsCleanup flag is a clean pattern: it avoids duplicating cancel calls on every error path, and the final invoiceNeedsCleanup = false after successful persistence makes the lifecycle clear.
  • context.WithoutCancel(ctx) with a 5-second timeout is the right idiom for cleanup that must run even after a canceled request context.

Minor:

  • cancelSwapInvoice accepts a ctx but immediately discards it via context.WithoutCancel(ctx). The function only uses the passed context to derive a detached one. The function could just take no context argument and use context.Background() directly, making the intent clearer and the caller simpler. Alternatively, the docstring should explain why the parent context is intentionally abandoned.

staticaddr/deposit/actions.go — Async finalization

Positive:

  • Sending on finalizedDepositChan in a goroutine prevents the FSM from stalling while the manager loop is busy holding other locks.
  • Using quitChan (tied to the FSM lifetime, not the request context) as the cancellation signal is correct — the shutdown comment explains the intended restart behavior clearly.

Test coverage: TestFinalizeDepositActionDoesNotBlock, TestFinalizeDepositActionIgnoresRequestCancellation, and TestFinalizeDepositActionIgnoresCanceledContext together cover the important edge cases well.


staticaddr/loopin/sql_store.go — State list query fix

This is a real bug fix. The SQL query uses:

(',' || $1 || ',') LIKE ('%,' || u.update_state || ',%')

The old toJointStringStates produced {State1,State2}, which after wrapping became ',{State1,State2},'. The curly braces broke the comma-separated pattern match for the first and last states (e.g., Succeeded became {Succeeded and Failed became Failed}), silently excluding them from results. The fix (removing {}) is correct.

Minor: The updated sql_store_test.go asserts on specific slice indices (finalizedSwaps[0], finalizedSwaps[1]) relying on the ORDER BY swaps.id clause preserving insertion order. This is reasonable but implicit — a brief comment noting the dependency on the SQL ordering would make the test intent clearer.


Test quality

Tests are well-structured and cover the important scenarios:

  • Notification queue: slow-subscriber non-blocking, queued delivery for unfinished-swap and sweep notifications.
  • Invoice cleanup: server error path, fee guard failure path.
  • Deposit finalization: blocks vs. non-blocking, context cancellation resilience.
  • State query: failed swap now appears in FinalStates results.

The shared assertQueuedSwapHashNotifications helper is a nice DRY abstraction across the two queue test cases.


Summary

Area Verdict
SQL state query bug fix ✅ Correct and important
Async deposit finalization ✅ Clean fix, good tests
Invoice cleanup on init failure ✅ Correct, minor naming suggestion
Notification queue design ⚠️ Works, but see: recover() concern, unbounded growth, PR description mismatch
Test coverage ✅ Good

The main items worth resolving before merge are the PR description/code mismatch on reservation notification queuing, and the recover() approach for the closed-channel race.

@starius starius left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 💾
A couple of comments left.


func toJointStringStates(states []fsm.StateType) string {
return "{" + strings.Join(toStrings(states), ",") + "}"
return strings.Join(toStrings(states), ",")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good fix! Can you adjust the commit message to explain that now it covers the first and the last states as well, and add a test for first state, please?

return
}

f.cancelSwapInvoice(ctx)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fix is correct for pre-storage situations.

What happens with an invoice in SignHtlcTx/MonitorInvoiceAndHtlcTx → OnError → UnlockDeposits paths?

Comment on lines 648 to 660
case <-deadlineChan:
// If the server didn't pay the invoice on time, we
// cancel the invoice and keep monitoring the htlc tx
// confirmation. We also need to unlock the deposits to
// re-enable them for loop-ins and withdrawals.
cancelInvoice()

event := f.UnlockDepositsAction(ctx, nil)
if event != fsm.OnError {
f.Errorf("unable to unlock deposits after " +
"payment deadline")
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This guard is dead code. UnlockDepositsAction returns fsm.OnError on both paths — the success path returns fsm.OnError (line 879), and the failure path returns f.HandleError(err), which is also fsm.OnError. So event != fsm.OnError is never true and the "unable to unlock deposits after payment deadline" error is never logged, including when the unlock genuinely fails.

Concretely: when the payment deadline fires we cancel the invoice, try to unlock the deposits, and keep monitoring the htlc. If TransitionDeposits(..., deposit.Deposited) fails here we get no signal for it — the contextual log can't fire — and the FSM keeps running as though the deposits were freed while they actually stay locked until the swap resolves another way (e.g. the htlc timeout sweep). The deposits are the user's funds, so this is a "don't silently fail an operation" issue: low severity, but strictly worse than what's trivially available, since we can just check the real error.

I confirmed the premise on a live regtest run: on a failed init the loop-in FSM entered UnlockDeposits, the deposit transitioned LoopingIn -> Deposited (a successful unlock), and the FSM still emitted OnError -> Failed — i.e. UnlockDepositsAction returns fsm.OnError even on success, which is exactly why this guard can never fire.

Suggested fix: https://gist.github.com/starius/915dee5b117c07f2042170d992274ec6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants