out_azure_kusto: fix timer/flush race condition causing SIGSEGV#11303
out_azure_kusto: fix timer/flush race condition causing SIGSEGV#11303yaananth wants to merge 2 commits into
Conversation
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR introduces thread synchronization and timer management to the Azure Kusto output plugin. Changes add a mutex ( Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60–90 minutes
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
plugins/out_azure_kusto/azure_kusto.c (1)
668-683: Potential use-after-free in retry loop after unlocking file.After gzip compression fails:
- Line 674:
azure_kusto_store_file_unlock(file)setsfile->locked = FLB_FALSE- Line 675:
file->failures += 1(small window but OK)- Line 681:
sleep(backoff_time + jitter)- Line 683:
continueretries using the samefilepointerDuring the sleep, another thread (e.g.,
cb_azure_kusto_exitor another flush) could callazure_kusto_store_file_deletewhich now seeslocked=FALSEand proceeds to free the file. When the retry loop continues, it accesses freed memory.The same pattern exists at lines 710-721 and lines 739-750.
Consider keeping the file locked during retries and only unlocking on final failure or success:
🔎 Suggested approach
Keep the file locked during the entire retry loop. Only unlock when:
- Maximum retries reached (before delete/inactive), or
- Successful ingestion (before delete)
if (ret != 0) { flb_plg_error(ctx->ins, "scheduler_kusto_ingest :: cannot gzip payload"); flb_free(buffer); flb_sds_destroy(payload); flb_sds_destroy(tag_sds); retry_count++; - if (file){ - azure_kusto_store_file_unlock(file); + if (file) { + /* Keep file locked during retries to prevent deletion by other threads */ file->failures += 1; } /* Add jitter: random value between 0 and backoff_time */Apply similar changes to the other retry failure paths (lines 710-721, 739-750).
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
plugins/out_azure_kusto/azure_kusto.c(18 hunks)plugins/out_azure_kusto/azure_kusto.h(2 hunks)plugins/out_azure_kusto/azure_kusto_ingest.c(1 hunks)plugins/out_azure_kusto/azure_kusto_store.c(11 hunks)tests/runtime/out_azure_kusto.c(4 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-17T18:45:02.283Z
Learnt from: edsiper
Repo: fluent/fluent-bit PR: 11286
File: src/flb_http_client.c:1657-1715
Timestamp: 2025-12-17T18:45:02.283Z
Learning: In the Fluent Bit OAuth2 implementation (src/flb_oauth2.c), the function flb_oauth2_get_access_token returns a pointer to the cached token stored in the OAuth2 context (ctx->access_token), not a new allocation. The OAuth2 context manages the token lifecycle, so callers must not free the returned token pointer.
Applied to files:
plugins/out_azure_kusto/azure_kusto.c
🧬 Code graph analysis (2)
plugins/out_azure_kusto/azure_kusto_store.c (1)
src/flb_fstore.c (3)
flb_fstore_file_delete(259-277)flb_fstore_file_create(153-190)flb_fstore_file_inactive(238-256)
plugins/out_azure_kusto/azure_kusto.c (1)
plugins/out_azure_kusto/azure_kusto_store.c (5)
azure_kusto_store_file_lock(808-811)azure_kusto_store_file_unlock(822-825)azure_kusto_store_file_cleanup(694-716)azure_kusto_store_file_delete(732-755)azure_kusto_store_file_inactive(658-680)
🔇 Additional comments (15)
plugins/out_azure_kusto/azure_kusto.h (1)
130-147: LGTM! Well-documented synchronization primitives.The addition of
upload_timerfor timer lifecycle management andfiles_mutexfor protecting concurrent access to the file list are appropriate for addressing the race condition. The comment clearly documents the mutex's purpose.plugins/out_azure_kusto/azure_kusto_ingest.c (1)
641-651: LGTM! Correct ordering prevents use-after-free on queue failure.Moving the file deletion to after successful queue enqueue is the right fix. The previous approach deleted the file after blob creation but before queue enqueue, which could lose data if queue enqueue failed. The unlock-before-delete sequence aligns with the new
azure_kusto_store_file_deletesemantics that skip locked files.plugins/out_azure_kusto/azure_kusto_store.c (4)
143-153: Verify safe list iteration after deletion inside the loop.When
fsf->data == NULL, you delete the file viaflb_fstore_file_deleteand thencontinue. Themk_list_foreach_safemacro should handle safe deletion during iteration, but verify that thetmppointer (frommk_list_foreach_safe) correctly allows continuing after deletion. Looking at the code, this function uses a regularmk_list_foreach_safewhich captures the next pointer before the iteration body, so this should be safe.
248-286: LGTM! Proper synchronization during file creation.The mutex correctly guards the critical section where the file is created and
fsf->datais assigned. All error paths properly release the lock before returning. This prevents race conditions where another thread could access a partially initialized file.
736-752: Verify all callers handle the -1 return when file is locked.The skip-if-locked guard returns -1 when the file cannot be deleted because it's locked. This is correct for preventing races, but callers must handle this return value appropriately. For example,
azure_kusto_queued_ingestionlogs an error if delete fails, which is acceptable. Ensure all other call sites either:
- Already unlocked the file before calling (as done in azure_kusto.c line 755)
- Handle the -1 return appropriately
528-548: LGTM! Proper cleanup with dangling pointer mitigation.Setting
fsf->data = NULLafter freeing theazure_kusto_fileis a good defensive measure to prevent use-after-free via dangling pointers. The mutex protection around the cleanup iteration is appropriate.tests/runtime/out_azure_kusto.c (2)
31-45: LGTM! Appropriate test helper for directory cleanup.The
nftw-based recursive removal is a clean approach for test cleanup. Thestatcheck beforenftwprevents errors on non-existent paths.
329-416: Good regression test for the race condition.The test uses aggressive timing parameters (
Flush=0.5,upload_timeout=1s) to maximize the chance of triggering the race between the timer callback and flush/exit paths. The test comments clearly document the expected behavior before and after the fix. The Windows guard is appropriate since the test relies on POSIX-specific APIs (nftw,ftw.h).Note: This test relies on ASan to detect the UAF. Consider adding a comment noting that this test should be run with ASan enabled for full verification.
plugins/out_azure_kusto/azure_kusto.c (7)
146-147: LGTM! Proper mutex release on error path.The mutex must be released before returning on error. This was a bug in the previous code where the token_mutex would remain locked on memory allocation failure.
1556-1560: LGTM! Correct timer cancellation ordering.Canceling the upload timer before cleanup prevents the timer callback from racing with the exit cleanup. Setting
upload_timer = NULLafter destruction is good defensive practice.
426-557: LGTM! Correct lock protocol in ingest_all_chunks.The synchronization pattern is correct:
- Lock mutex (line 427)
- Mark chunk as locked (line 459) before releasing mutex
- Release mutex for I/O (line 460)
- Re-acquire mutex on all error paths before returning
- Re-acquire mutex before next iteration (line 552)
This ensures the chunk remains protected (via its
lockedflag) while I/O is in progress, preventing other threads from deleting it.
615-640: LGTM! Proper lock acquisition and file marking before I/O.The pattern of locking the mutex, marking the file as locked, then releasing the mutex before I/O is correct. This prevents other threads from deleting the file while I/O is in progress.
1339-1346: LGTM! Correct unlock-before-delete pattern.Unlocking the file before calling delete/inactive is necessary since those functions now skip locked files. This ensures the file can be properly cleaned up.
973-991: LGTM! Proper initialization of new synchronization primitives.Both
upload_timer(NULL) andfiles_mutexare properly initialized. The mutex is initialized alongside the other mutexes, maintaining consistency.
1259-1260: LGTM! Timer handle capture for lifecycle management.Passing
&ctx->upload_timertoflb_sched_timer_cb_createallows the plugin to cancel the timer during shutdown, preventing the race condition where the timer callback could access freed resources.
There was a problem hiding this comment.
💡 Codex Review
fluent-bit/plugins/out_azure_kusto/azure_kusto.c
Lines 449 to 453 in 97759fa
Inside ingest_all_chunks the function still holds ctx->files_mutex when it routes exhausted retries to azure_kusto_store_file_delete / azure_kusto_store_file_inactive (mutex taken at line 426). Those helpers now take the same files_mutex internally (see azure_kusto_store.c:736-742 and 664-669), so the first chunk that exceeds scheduler_max_retries will block on a recursive lock and leave the mutex permanently held, freezing subsequent flush/timer processing.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
97759fa to
361baa1
Compare
|
This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days. |
79ae281 to
731c8de
Compare
Signed-off-by: Yashwanth Anantharaju <yaananth@users.noreply.github.com>
Signed-off-by: Yashwanth Anantharaju <yaananth@users.noreply.github.com>
731c8de to
0370157
Compare
Summary
Fix Azure Kusto buffered-file lifecycle races between flush, upload timer, and shutdown paths. The race was observed as intermittent SIGSEGV crashes while the upload timer was reading buffered file contents around task/file cleanup.
Crash signature observed in
out_azure_kustobuffered mode:One failing run logged task destruction immediately before the SIGSEGV, which is consistent with the upload timer and flush/task cleanup paths racing over buffered file state.
Changes:
fsf->databefore freeing Azure Kusto file contextsbuffer_file_delete_earlyis enabled, delete buffered files only after queue enqueue succeedsTesting
git diff --check origin/master..HEADcmake --build build --target flb-plugin-out_azure_kusto -j2cmake --build build --target flb-rt-out_azure_kusto -j2ctest --test-dir build -R '^flb-rt-out_azure_kusto$' --output-on-failureGITHUB_EVENT_NAME=pull_request GITHUB_BASE_REF=master uv run --quiet --with gitpython python .github/scripts/commit_prefix_check.py