dont commit while holding the lock on cleanup thread#1813
Conversation
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8e4903309c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| removeStaleEntries(zk_children, entries_by_key); | ||
|
|
||
| LOG_INFO(storage.log, "ExportPartition Manifest Updating task: finished polling for new entries. Number of entries: {}", entries_by_key.size()); | ||
| } |
There was a problem hiding this comment.
Keep cleanup lock alive during deferred commit recovery
The cleanup_lock is created inside the inner scope and is destroyed when that scope ends, before deferred commit recovery starts; this allows another replica to acquire exports_cleanup_lock and run the same recovery path in parallel for the same task. In transient destination failures, each replica then calls handleCommitFailure and increments the shared commit_attempts, so max_retries can be exhausted much faster than intended and the task can flip to FAILED prematurely.
Useful? React with 👍 / 👎.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1061618543
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| /// A replica exported the last part but the commit never landed. Try to fix it. | ||
| try | ||
| { | ||
| ExportPartitionUtils::commit(work.metadata, work.destination_storage, zk, log_ptr, work.entry_path, work.context, storage); |
There was a problem hiding this comment.
Serialize deferred commit with cancellation
Running ExportPartitionUtils::commit after releasing export_merge_tree_partition_mutex introduces a race with StorageReplicatedMergeTree::killExportPartition: the kill path can now acquire the mutex and set /status to KILLED while this deferred commit is in flight, and commit then writes COMPLETED with an unconditional trySet(..., -1), effectively undoing a successful cancellation. Before this change, this cleanup-commit path was mutex-serialized with killExportPartition, so this regression can cause user-issued kill requests to be lost under concurrent recovery commits.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
I think we can live with it for now
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Don't commit while holding the export manifests locks. This avoid contention when the network / commit process is slow.
Documentation entry for user-facing changes
...
CI/CD Options
Exclude tests:
Regression jobs to run: