Fail HTLCs from late counterparty commitment updates in ChannelMonitor #4434
Fail HTLCs from late counterparty commitment updates in ChannelMonitor #4434joostjager wants to merge 6 commits intolightningdevkit:mainfrom
Conversation
|
👋 Thanks for assigning @wpaulino as a reviewer! |
2d36c03 to
cce8ccb
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4434 +/- ##
==========================================
+ Coverage 85.97% 86.05% +0.08%
==========================================
Files 159 159
Lines 104722 105759 +1037
Branches 104722 105759 +1037
==========================================
+ Hits 90030 91013 +983
- Misses 12191 12227 +36
- Partials 2501 2519 +18
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
👋 The first review has been submitted! Do you think this PR is ready for a second reviewer? If so, click here to assign a second reviewer. |
cce8ccb to
ad3036e
Compare
TheBlueMatt
left a comment
There was a problem hiding this comment.
Hmmmmmmmmm, its really not quite so simple. If a monitor update is InProgress we generally expect the in-memory ChannelMonitor to have actually seen the new state update (though we expect that on restart there's a good chance it won't have that information and it'll be replayed). That means the monitor is theoretically "in charge" of resolving the HTLCs in question. While I think this patch is correct, I'm a bit hesitant to end up with two things in charge of resolving a specific HTLC. ISTM it also wouldn't be so complicated to immediately mark HTLCs as failed in the monitor upon receiving updates that try to add new HTLCs after a channel has been closed.
Isn't the problem with this that the monitor can't see whether the commitment was already sent to the peer or blocked by async persistence? |
|
Right, the monitor has to assume it may have been sent to the peer. That's fine, though, it should be able to fail the HTLC(s) once the commitment transaction is locked. |
|
Hmm, it seems that the original issue isn't an issue then. I confirmed that mining the force close commitment tx indeed generates the payment failed event. Or is there still something missing? |
|
In the general case ( |
|
The one that I identified wasn't a crash case. An assertion in the fuzzer identified that a payment was lost, but with current understanding I think it's just that the fuzzer didn't push the force-close forward enough (mine). Will close it for now and can look again if the payment remains lost even with confirmation of the commit tx. Unit test did confirm that it should just work. |
|
Reopening as this fix is now required for #4351. |
Now that we've decided that for deferred writes we no longer want to expect the in-memory |
6421a28 to
8dac2fa
Compare
8dac2fa to
da66b8e
Compare
TheBlueMatt
left a comment
There was a problem hiding this comment.
Now that we've decided that for deferred writes we no longer want to expect the in-memory ChannelMonitor to have seen the changes, I think the above doesn't apply anymore, and we might have to accept that two things are in charge for resolving a specific HTLC.
I don't think so. Instead, what we'll need to do is detect a new local counterparty state ChannelMontiorUpdate after the channel has close and track it to fail the HTLCs from it once the commitment transaction has 6 confs.
|
Oh actually no I don't think we need to wait for 6 confs since we never sent the message, ie we can fail it immediately. |
Does the monitor know that a message was never sent? #4434 (comment) suggests that we should assume we did send. I steered Claude to a potential fix: main...joostjager:rust-lightning:chain-mon-internal-deferred-writes-fix-contract. I based it on the deferred writes branch so that tests can get delayed in-memory updates. Is this the direction you have in mind? It is unfortunate that we need to add this extra logic to a system that is already complicated. |
TheBlueMatt
left a comment
There was a problem hiding this comment.
Okay two more commits then the last commit looks good, will circle back to the dependent PR so we can land the stack.
0ee49e2 to
631a3c1
Compare
|
🔔 1st Reminder Hey @wpaulino! This PR has been waiting for your review. |
631a3c1 to
56c813f
Compare
56c813f to
25c9ca3
Compare
|
🔔 1st Reminder Hey @wpaulino! This PR has been waiting for your review. |
|
🔔 2nd Reminder Hey @wpaulino! This PR has been waiting for your review. |
|
🔔 3rd Reminder Hey @wpaulino! This PR has been waiting for your review. |
wpaulino
left a comment
There was a problem hiding this comment.
LGTM, though CI is sad likely due to the failed_back_htlc_ids change.
baa58e4 to
f1364ba
Compare
|
The round-trip check failed because |
| None => { | ||
| debug_assert!(false, "flush count exceeded queue length"); | ||
| return; | ||
| }, |
There was a problem hiding this comment.
In release builds, this silently returns without logging when count exceeds the queue length. This can happen if flush is called concurrently (e.g., from TestChainMonitor::release_pending_monitor_events auto-flush racing with the background processor's explicit flush). A log_error! before the return would make such races observable in production, rather than silently under-flushing:
| None => { | |
| debug_assert!(false, "flush count exceeded queue length"); | |
| return; | |
| }, | |
| None => { | |
| debug_assert!(false, "flush count exceeded queue length"); | |
| log_error!(logger, "flush count exceeded queue length"); | |
| return; | |
| }, |
Automated ReviewAfter thorough analysis of the deferred monitor operation queuing/flushing mechanism and the Prior review comments about the breaking public API ( No new major issues found in this pass. 🤖 Generated with Claude Code |
|
I think feel free to squash so we can land this. Claude had one comment that might be worth fixing, up to you. |
f1364ba to
9999385
Compare
When a ChannelMonitorUpdate containing a new counterparty commitment is dispatched (e.g. via deferred writes) before a channel force-closes but only applied to the in-memory monitor after the commitment transaction has already confirmed on-chain, the outbound HTLCs in that update must be failed back. Add fail_htlcs_from_update_after_funding_spend to ChannelMonitorImpl which detects this race condition during update_monitor. When a LatestCounterpartyCommitmentTXInfo or LatestCounterpartyCommitment update is applied and the funding output has already been spent, the function iterates all outbound HTLCs from the update and creates OnchainEvent::HTLCUpdate entries for those that need to be failed back. These entries mature after ANTI_REORG_DELAY blocks, giving time for the peer to potentially broadcast the newer commitment. HTLCs that appear as non-dust outputs in the confirmed commitment (whether counterparty or holder) are skipped, as they will be resolved on-chain via the normal HTLC timeout/success path. HTLCs already fulfilled by the counterparty (tracked in counterparty_fulfilled_htlcs) are also skipped. Duplicate failures from previously-known counterparty commitments are handled gracefully by the ChannelManager. AI tools were used in preparing this commit.
9999385 to
337bef1
Compare
|
Squashed, but needs to be rebased after #4351. Added the Claude suggestion to also log excessive flush count. |
| }); | ||
| } else { | ||
| // Funding spend still awaiting ANTI_REORG_DELAY: queue the failure. | ||
| let (txid, transaction, height, block_hash) = pending_spend_entry.clone().unwrap(); |
There was a problem hiding this comment.
Nit/perf: pending_spend_entry.clone().unwrap() is called inside the per-HTLC loop, cloning the Option<Transaction> each iteration. For a commitment with many HTLCs, this does N clones of potentially large transaction data. Consider destructuring once before the loop:
| let (txid, transaction, height, block_hash) = pending_spend_entry.clone().unwrap(); | |
| let pending_data = pending_spend_entry.clone(); | |
| for (source, payment_hash, amount_msat) in htlcs { | |
| if is_source_known(source) { | |
| continue; | |
| } | |
| if self.counterparty_fulfilled_htlcs.get(&SentHTLCId::from_source(source)).is_some() { | |
| continue; | |
| } | |
| let htlc_value_satoshis = Some(amount_msat / 1000); | |
| let logger = WithContext::from(logger, None, None, Some(payment_hash)); | |
| // Defensively mark the HTLC as failed back so the expiry-based failure | |
| // path in `block_connected` doesn't generate a duplicate `HTLCUpdate` | |
| // event for the same source. | |
| self.failed_back_htlc_ids.insert(SentHTLCId::from_source(source)); | |
| if let Some(confirmed_txid) = self.funding_spend_confirmed { |
Or alternatively, extract the pending_data tuple once and clone only transaction per entry (Txid and BlockHash are Copy).
| fn update_channel( | ||
| &self, channel_id: ChannelId, update: &ChannelMonitorUpdate, | ||
| ) -> ChannelMonitorUpdateStatus { | ||
| // `ChannelMonitorUpdate`'s `channel_id` is `None` prior to 0.0.121 and all channels in those | ||
| // versions are V1-established. For 0.0.121+ the `channel_id` fields is always `Some`. | ||
| debug_assert_eq!(update.channel_id.unwrap(), channel_id); | ||
| // Update the monitor that watches the channel referred to by the given outpoint. | ||
| let monitors = self.monitors.read().unwrap(); | ||
| match monitors.get(&channel_id) { | ||
| None => { | ||
| let logger = WithContext::from(&self.logger, None, Some(channel_id), None); | ||
| log_error!(logger, "Failed to update channel monitor: no such monitor registered"); | ||
|
|
||
| // We should never ever trigger this from within ChannelManager. Technically a | ||
| // user could use this object with some proxying in between which makes this | ||
| // possible, but in tests and fuzzing, this should be a panic. | ||
| #[cfg(debug_assertions)] | ||
| panic!("ChannelManager generated a channel update for a channel that was not yet registered!"); | ||
| #[cfg(not(debug_assertions))] | ||
| ChannelMonitorUpdateStatus::InProgress | ||
| }, | ||
| Some(monitor_state) => { | ||
| let monitor = &monitor_state.monitor; | ||
| let logger = WithChannelMonitor::from(&self.logger, &monitor, None); | ||
| log_trace!(logger, "Updating ChannelMonitor to id {}", update.update_id,); | ||
|
|
||
| // We hold a `pending_monitor_updates` lock through `update_monitor` to ensure we | ||
| // have well-ordered updates from the users' point of view. See the | ||
| // `pending_monitor_updates` docs for more. | ||
| let mut pending_monitor_updates = | ||
| monitor_state.pending_monitor_updates.lock().unwrap(); | ||
| let update_res = monitor.update_monitor( | ||
| update, | ||
| &self.broadcaster, | ||
| &self.fee_estimator, | ||
| &self.logger, | ||
| ); | ||
|
|
||
| let update_id = update.update_id; | ||
| let persist_res = if update_res.is_err() { | ||
| // Even if updating the monitor returns an error, the monitor's state will | ||
| // still be changed. Therefore, we should persist the updated monitor despite the error. | ||
| // We don't want to persist a `monitor_update` which results in a failure to apply later | ||
| // while reading `channel_monitor` with updates from storage. Instead, we should persist | ||
| // the entire `channel_monitor` here. | ||
| log_warn!(logger, "Failed to update ChannelMonitor. Going ahead and persisting the entire ChannelMonitor"); | ||
| self.persister.update_persisted_channel( | ||
| monitor.persistence_key(), | ||
| None, | ||
| monitor, | ||
| ) | ||
| } else { | ||
| self.persister.update_persisted_channel( | ||
| monitor.persistence_key(), | ||
| Some(update), | ||
| monitor, | ||
| ) | ||
| }; | ||
| match persist_res { | ||
| ChannelMonitorUpdateStatus::InProgress => { | ||
| pending_monitor_updates.push(update_id); | ||
| log_debug!( | ||
| logger, | ||
| "Persistence of ChannelMonitorUpdate id {:?} in progress", | ||
| update_id, | ||
| ); | ||
| }, | ||
| ChannelMonitorUpdateStatus::Completed => { | ||
| log_debug!( | ||
| logger, | ||
| "Persistence of ChannelMonitorUpdate id {:?} completed", | ||
| update_id, | ||
| ); | ||
| }, | ||
| ChannelMonitorUpdateStatus::UnrecoverableError => { | ||
| // Take the monitors lock for writing so that we poison it and any future | ||
| // operations going forward fail immediately. | ||
| core::mem::drop(pending_monitor_updates); | ||
| core::mem::drop(monitors); | ||
| let _poison = self.monitors.write().unwrap(); | ||
| let err_str = "ChannelMonitor[Update] persistence failed unrecoverably. This indicates we cannot continue normal operation and must shut down."; | ||
| log_error!(logger, "{}", err_str); | ||
| panic!("{}", err_str); | ||
| }, | ||
| } | ||
|
|
||
| // We may need to start monitoring for any alternative funding transactions. | ||
| if let Some(ref chain_source) = self.chain_source { | ||
| for (funding_outpoint, funding_script) in | ||
| update.internal_renegotiated_funding_data() | ||
| { | ||
| log_trace!( | ||
| logger, | ||
| "Registering renegotiated funding outpoint {} with the filter to monitor confirmations and spends", | ||
| funding_outpoint | ||
| ); | ||
| chain_source.register_tx(&funding_outpoint.txid, &funding_script); | ||
| chain_source.register_output(WatchedOutput { | ||
| block_hash: None, | ||
| outpoint: funding_outpoint, | ||
| script_pubkey: funding_script, | ||
| }); | ||
| } | ||
| } | ||
|
|
||
| if update_res.is_err() { | ||
| ChannelMonitorUpdateStatus::InProgress | ||
| } else { | ||
| persist_res | ||
| } | ||
| }, | ||
| if !self.deferred { | ||
| return self.update_channel_internal(channel_id, update); | ||
| } | ||
|
|
||
| let mut pending_ops = self.pending_ops.lock().unwrap(); | ||
| pending_ops.push_back(PendingMonitorOp::Update { channel_id, update: update.clone() }); | ||
| ChannelMonitorUpdateStatus::InProgress | ||
| } |
There was a problem hiding this comment.
The deferred update_channel path blindly queues without checking whether channel_id exists in self.monitors or has a corresponding NewMonitor in pending_ops. Compare with watch_channel which carefully validates against both.
While the ChannelManager guarantees ordering (always watch_channel before update_channel), an explicit check here would catch bugs earlier — especially since update_channel_internal panics in debug builds if the monitor doesn't exist, and that panic would be deferred to flush time (far from the call site), making it harder to diagnose.
Consider adding a debug_assert! that the channel_id is either in self.monitors or has a pending NewMonitor:
debug_assert!(
self.monitors.read().unwrap().contains_key(&channel_id)
|| pending_ops.iter().any(|op| matches!(op, PendingMonitorOp::NewMonitor { channel_id: id, .. } if *id == channel_id)),
"update_channel called for unknown channel"
);| // Check HTLC sources against all previously-known commitments to find truly new | ||
| // ones. After the update has been applied, `prev_counterparty_commitment_txid` holds | ||
| // what was `current` before this update, so it represents the already-known | ||
| // counterparty state. HTLCs already present in any of these will be handled by | ||
| // `fail_unbroadcast_htlcs` when the spending transaction confirms. | ||
| let is_source_known = |source: &HTLCSource| { | ||
| if let Some(ref txid) = self.funding.prev_counterparty_commitment_txid { | ||
| if let Some(htlc_list) = self.funding.counterparty_claimable_outpoints.get(txid) { | ||
| if htlc_list.iter().any(|(_, s)| s.as_ref().map(|s| s.as_ref()) == Some(source)) | ||
| { | ||
| return true; | ||
| } | ||
| } | ||
| } | ||
| // Note that we don't care about the case where a counterparty sent us a fresh local commitment transaction | ||
| // post-closure (with the `ChannelManager` still operating the channel). First of all we only care about | ||
| // resolving outbound HTLCs, which fundamentally have to be initiated by us. However we also don't mind | ||
| // looking at the current holder commitment transaction's HTLCs as any fresh outbound HTLCs will have to | ||
| // first come in a locally-initiated update to the counterparty's commitment transaction which we can, by | ||
| // refusing to apply the update, prevent the counterparty from ever seeing (as no messages can be sent until | ||
| // the monitor is updated). Thus, the HTLCs we care about can never appear in the holder commitment | ||
| // transaction. | ||
| if holder_commitment_htlcs!(self, CURRENT_WITH_SOURCES).any(|(_, s)| s == Some(source)) | ||
| { | ||
| return true; | ||
| } | ||
| if let Some(mut iter) = holder_commitment_htlcs!(self, PREV_WITH_SOURCES) { | ||
| if iter.any(|(_, s)| s == Some(source)) { | ||
| return true; | ||
| } | ||
| } | ||
| false | ||
| }; |
There was a problem hiding this comment.
The is_source_known closure only checks prev_counterparty_commitment_txid (what was current before this update), not all entries in counterparty_claimable_outpoints. The comment at line 4413 explains the reasoning, and the holder commitment fallback covers additional cases.
Just want to confirm a subtle assumption: this correctness relies on the fact that the monitor always holds the immediately-prior counterparty commitment in prev after the update is applied (line 3502). If a single ChannelMonitorUpdate ever contained two LatestCounterpartyCommitment steps, the second step's iteration here would check prev (which is now the first step's commitment, not the original pre-batch state), and HTLCs from the first step would appear "known" even if they were truly new. The code is correct as long as each ChannelMonitorUpdate contains at most one counterparty commitment step — is that guaranteed at the ChannelManager level?
Closes #4431
Builds on top of #4351