out_azure_kusto: fix timer/flush race condition causing SIGSEGV by yaananth · Pull Request #11303 · fluent/fluent-bit

yaananth · 2025-12-19T15:21:26Z

Summary

Fix Azure Kusto buffered-file lifecycle races between flush, upload timer, and shutdown paths. The race was observed as intermittent SIGSEGV crashes while the upload timer was reading buffered file contents around task/file cleanup.

Crash signature observed in out_azure_kusto buffered mode:

#0  memcpy()
#1  cio_file_content_copy()
#2  flb_fstore_file_content_copy()
#3  construct_request_buffer() at plugins/out_azure_kusto/azure_kusto.c
#4  cb_azure_kusto_ingest() at plugins/out_azure_kusto/azure_kusto.c
#5  flb_sched_event_handler()
#6  output_thread()

One failing run logged task destruction immediately before the SIGSEGV, which is consistent with the upload timer and flush/task cleanup paths racing over buffered file state.

Changes:

guard buffered flush, upload timer, and exit paths so they do not concurrently inspect or mutate fstore file lists
keep the upload timer handle and invalidate it before shutdown destroys plugin state
clear fsf->data before freeing Azure Kusto file contexts
when buffer_file_delete_early is enabled, delete buffered files only after queue enqueue succeeds
add a runtime regression test that restarts with buffered backlog, drives a flush, and lets a short upload timer run

Testing

git diff --check origin/master..HEAD
cmake --build build --target flb-plugin-out_azure_kusto -j2
cmake --build build --target flb-rt-out_azure_kusto -j2
ctest --test-dir build -R '^flb-rt-out_azure_kusto$' --output-on-failure
GITHUB_EVENT_NAME=pull_request GITHUB_BASE_REF=master uv run --quiet --with gitpython python .github/scripts/commit_prefix_check.py

coderabbitai · 2025-12-19T15:21:47Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bfe8f74d-27be-4428-acfe-4b12ded61e3b

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR introduces thread synchronization and timer management to the Azure Kusto output plugin. Changes add a mutex (files_mutex) to protect buffered file lists and concurrent access across multiple functions, a timer handle (upload_timer) for scheduler management, and ensure proper lock discipline around file operations and error paths.

Changes

Cohort / File(s)	Summary
Header definitions `plugins/out_azure_kusto/azure_kusto.h`	Adds `upload_timer` (scheduler timer handle) and `files_mutex` (pthread_mutex_t) fields to `struct flb_azure_kusto` to protect concurrent file-buffer access and manage timer lifecycle.
Core synchronization and timer setup `plugins/out_azure_kusto/azure_kusto.c`	Initializes and manages `files_mutex` and `upload_timer` throughout plugin lifecycle; adds mutex lock/unlock guards around file-buffer iterations and modifications in functions like `ingest_all_chunks`, `cb_azure_kusto_ingest`, and `cb_azure_kusto_flush`; ensures error paths include proper cleanup and unlocking.
File store operations `plugins/out_azure_kusto/azure_kusto_store.c`	Surrounds file retrieval, creation, metadata operations, and cleanup with mutex protection; adds locked-file checks to prevent race conditions; ensures proper lock release in all code paths including error handling.
Ingestion queue logic `plugins/out_azure_kusto/azure_kusto_ingest.c`	Moves temporary file deletion to after successful enqueue (rather than before), and ensures file is unlocked before deletion to accommodate locked-file semantics.
Test suite `tests/runtime/out_azure_kusto.c`	Adds helper utilities (`flb_kusto_rm_rf`, `flb_kusto_unlink_cb`) and two new test functions: `flb_test_azure_kusto_buffering_backlog` (backlog replay on restart) and `flb_test_azure_kusto_timer_flush_race` (timer/flush race condition, non-Windows only).

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60–90 minutes

Thread-safety critical: Mutex placement and lock/unlock patterns throughout multiple functions require careful verification to prevent deadlocks and race conditions.
Error-path complexity: All error paths in azure_kusto.c and azure_kusto_store.c must properly release locks and clean up resources.
Timing sensitivity: File deletion and enqueue ordering in azure_kusto_ingest.c affects data retention and retry logic; requires understanding of the ingestion pipeline.
Cross-file interactions: Lock discipline spans three main implementation files, making it important to verify consistency and ordering across module boundaries.

Suggested labels

backport to v4.0.x

Suggested reviewers

edsiper

Poem

🐰 Locks and timers, oh what a dance!
Mutexes guard our files' chance.
No more races on the ether,
Azure Kusto flows together!
Thread-safe buffering hops along,
This synchronization's strong! 🔐

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: fixing a race condition in the Azure Kusto plugin that causes SIGSEGV between timer and flush operations.
Docstring Coverage	✅ Passed	Docstring coverage is 84.21% which is sufficient. The required threshold is 80.00%.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

plugins/out_azure_kusto/azure_kusto.c (1)
668-683: Potential use-after-free in retry loop after unlocking file.

After gzip compression fails:

Line 674: azure_kusto_store_file_unlock(file) sets file->locked = FLB_FALSE

Line 675: file->failures += 1 (small window but OK)

Line 681: sleep(backoff_time + jitter)

Line 683: continue retries using the same file pointer

During the sleep, another thread (e.g., cb_azure_kusto_exit or another flush) could call azure_kusto_store_file_delete which now sees locked=FALSE and proceeds to free the file. When the retry loop continues, it accesses freed memory.

The same pattern exists at lines 710-721 and lines 739-750.

Consider keeping the file locked during retries and only unlocking on final failure or success:
🔎 Suggested approach

Keep the file locked during the entire retry loop. Only unlock when:

Maximum retries reached (before delete/inactive), or

Successful ingestion (before delete)
             if (ret != 0) {
                 flb_plg_error(ctx->ins, "scheduler_kusto_ingest :: cannot gzip payload");
                 flb_free(buffer);
                 flb_sds_destroy(payload);
                 flb_sds_destroy(tag_sds);
                 retry_count++;
-                if (file){
-                    azure_kusto_store_file_unlock(file);
+                if (file) {
+                    /* Keep file locked during retries to prevent deletion by other threads */
                     file->failures += 1;
                 }
                 /* Add jitter: random value between 0 and backoff_time */
Apply similar changes to the other retry failure paths (lines 710-721, 739-750).

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b9f26a9 and 97759fa.

📒 Files selected for processing (5)

plugins/out_azure_kusto/azure_kusto.c (18 hunks)
plugins/out_azure_kusto/azure_kusto.h (2 hunks)
plugins/out_azure_kusto/azure_kusto_ingest.c (1 hunks)
plugins/out_azure_kusto/azure_kusto_store.c (11 hunks)
tests/runtime/out_azure_kusto.c (4 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-12-17T18:45:02.283Z

Learnt from: edsiper
Repo: fluent/fluent-bit PR: 11286
File: src/flb_http_client.c:1657-1715
Timestamp: 2025-12-17T18:45:02.283Z
Learning: In the Fluent Bit OAuth2 implementation (src/flb_oauth2.c), the function flb_oauth2_get_access_token returns a pointer to the cached token stored in the OAuth2 context (ctx->access_token), not a new allocation. The OAuth2 context manages the token lifecycle, so callers must not free the returned token pointer.

Applied to files:

plugins/out_azure_kusto/azure_kusto.c

🧬 Code graph analysis (2)

plugins/out_azure_kusto/azure_kusto_store.c (1)

src/flb_fstore.c (3)

flb_fstore_file_delete (259-277)

flb_fstore_file_create (153-190)

flb_fstore_file_inactive (238-256)

plugins/out_azure_kusto/azure_kusto.c (1)

plugins/out_azure_kusto/azure_kusto_store.c (5)

azure_kusto_store_file_lock (808-811)

azure_kusto_store_file_unlock (822-825)

azure_kusto_store_file_cleanup (694-716)

azure_kusto_store_file_delete (732-755)

azure_kusto_store_file_inactive (658-680)

🔇 Additional comments (15)

plugins/out_azure_kusto/azure_kusto.h (1)

130-147: LGTM! Well-documented synchronization primitives.

The addition of upload_timer for timer lifecycle management and files_mutex for protecting concurrent access to the file list are appropriate for addressing the race condition. The comment clearly documents the mutex's purpose.

plugins/out_azure_kusto/azure_kusto_ingest.c (1)

641-651: LGTM! Correct ordering prevents use-after-free on queue failure.

Moving the file deletion to after successful queue enqueue is the right fix. The previous approach deleted the file after blob creation but before queue enqueue, which could lose data if queue enqueue failed. The unlock-before-delete sequence aligns with the new azure_kusto_store_file_delete semantics that skip locked files.

plugins/out_azure_kusto/azure_kusto_store.c (4)

143-153: Verify safe list iteration after deletion inside the loop.

When fsf->data == NULL, you delete the file via flb_fstore_file_delete and then continue. The mk_list_foreach_safe macro should handle safe deletion during iteration, but verify that the tmp pointer (from mk_list_foreach_safe) correctly allows continuing after deletion. Looking at the code, this function uses a regular mk_list_foreach_safe which captures the next pointer before the iteration body, so this should be safe.

248-286: LGTM! Proper synchronization during file creation.

The mutex correctly guards the critical section where the file is created and fsf->data is assigned. All error paths properly release the lock before returning. This prevents race conditions where another thread could access a partially initialized file.

736-752: Verify all callers handle the -1 return when file is locked.

The skip-if-locked guard returns -1 when the file cannot be deleted because it's locked. This is correct for preventing races, but callers must handle this return value appropriately. For example, azure_kusto_queued_ingestion logs an error if delete fails, which is acceptable. Ensure all other call sites either:

Already unlocked the file before calling (as done in azure_kusto.c line 755)

Handle the -1 return appropriately

528-548: LGTM! Proper cleanup with dangling pointer mitigation.

Setting fsf->data = NULL after freeing the azure_kusto_file is a good defensive measure to prevent use-after-free via dangling pointers. The mutex protection around the cleanup iteration is appropriate.

tests/runtime/out_azure_kusto.c (2)

31-45: LGTM! Appropriate test helper for directory cleanup.

The nftw-based recursive removal is a clean approach for test cleanup. The stat check before nftw prevents errors on non-existent paths.

329-416: Good regression test for the race condition.

The test uses aggressive timing parameters (Flush=0.5, upload_timeout=1s) to maximize the chance of triggering the race between the timer callback and flush/exit paths. The test comments clearly document the expected behavior before and after the fix. The Windows guard is appropriate since the test relies on POSIX-specific APIs (nftw, ftw.h).

Note: This test relies on ASan to detect the UAF. Consider adding a comment noting that this test should be run with ASan enabled for full verification.

plugins/out_azure_kusto/azure_kusto.c (7)

146-147: LGTM! Proper mutex release on error path.

The mutex must be released before returning on error. This was a bug in the previous code where the token_mutex would remain locked on memory allocation failure.

1556-1560: LGTM! Correct timer cancellation ordering.

Canceling the upload timer before cleanup prevents the timer callback from racing with the exit cleanup. Setting upload_timer = NULL after destruction is good defensive practice.

426-557: LGTM! Correct lock protocol in ingest_all_chunks.

The synchronization pattern is correct:

Lock mutex (line 427)

Mark chunk as locked (line 459) before releasing mutex

Release mutex for I/O (line 460)

Re-acquire mutex on all error paths before returning

Re-acquire mutex before next iteration (line 552)

This ensures the chunk remains protected (via its locked flag) while I/O is in progress, preventing other threads from deleting it.

615-640: LGTM! Proper lock acquisition and file marking before I/O.

The pattern of locking the mutex, marking the file as locked, then releasing the mutex before I/O is correct. This prevents other threads from deleting the file while I/O is in progress.

1339-1346: LGTM! Correct unlock-before-delete pattern.

Unlocking the file before calling delete/inactive is necessary since those functions now skip locked files. This ensures the file can be properly cleaned up.

973-991: LGTM! Proper initialization of new synchronization primitives.

Both upload_timer (NULL) and files_mutex are properly initialized. The mutex is initialized alongside the other mutexes, maintaining consistency.

1259-1260: LGTM! Timer handle capture for lifecycle management.

Passing &ctx->upload_timer to flb_sched_timer_cb_create allows the plugin to cancel the timer during shutdown, preventing the race condition where the timer callback could access freed resources.

chatgpt-codex-connector

💡 Codex Review

fluent-bit/plugins/out_azure_kusto/azure_kusto.c

Lines 449 to 453 in 97759fa

    
           if (ctx->delete_on_max_upload_error){ 
        
               azure_kusto_store_file_delete(ctx, chunk); 
        
           } 
        
           else{ 
        
               azure_kusto_store_file_inactive(ctx, chunk);

Avoid deadlocking when dropping failed chunks

Inside ingest_all_chunks the function still holds ctx->files_mutex when it routes exhausted retries to azure_kusto_store_file_delete / azure_kusto_store_file_inactive (mutex taken at line 426). Those helpers now take the same files_mutex internally (see azure_kusto_store.c:736-742 and 664-669), so the first chunk that exceeds scheduler_max_retries will block on a recursive lock and leave the mutex permanently held, freezing subsequent flush/timer processing.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

github-actions · 2026-05-10T02:23:53Z

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

Signed-off-by: Yashwanth Anantharaju <yaananth@users.noreply.github.com>

yaananth requested review from cosmo0920 and edsiper as code owners December 19, 2025 15:21

github-actions Bot added the docs-required label Dec 19, 2025

yaananth marked this pull request as draft December 19, 2025 15:21

coderabbitai Bot reviewed Dec 19, 2025

View reviewed changes

chatgpt-codex-connector Bot reviewed Dec 19, 2025

View reviewed changes

Comment thread plugins/out_azure_kusto/azure_kusto.c Outdated

yaananth force-pushed the fix/azure-kusto-timer-flush-race branch from 97759fa to 361baa1 Compare December 19, 2025 15:32

github-actions Bot added the Stale label May 10, 2026

yaananth mentioned this pull request May 29, 2026

out_azure_kusto: fix buffered upload races #11884

Closed

yaananth force-pushed the fix/azure-kusto-timer-flush-race branch 6 times, most recently from 79ae281 to 731c8de Compare May 29, 2026 19:26

yaananth added 2 commits May 29, 2026 15:41

out_azure_kusto: fix buffered upload races

f506208

Signed-off-by: Yashwanth Anantharaju <yaananth@users.noreply.github.com>

tests: runtime: add azure kusto buffering regression

0370157

Signed-off-by: Yashwanth Anantharaju <yaananth@users.noreply.github.com>

yaananth force-pushed the fix/azure-kusto-timer-flush-race branch from 731c8de to 0370157 Compare May 29, 2026 19:42

github-actions Bot removed the Stale label May 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

out_azure_kusto: fix timer/flush race condition causing SIGSEGV#11303

out_azure_kusto: fix timer/flush race condition causing SIGSEGV#11303
yaananth wants to merge 2 commits into
fluent:masterfrom
yaananth:fix/azure-kusto-timer-flush-race

yaananth commented Dec 19, 2025 •

edited

Loading

Uh oh!

coderabbitai Bot commented Dec 19, 2025 •

edited

Loading

Review skipped

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

github-actions Bot commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	if (ctx->delete_on_max_upload_error){
	azure_kusto_store_file_delete(ctx, chunk);
	}
	else{
	azure_kusto_store_file_inactive(ctx, chunk);

Conversation

yaananth commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Uh oh!

coderabbitai Bot commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

github-actions Bot commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yaananth commented Dec 19, 2025 •

edited

Loading

coderabbitai Bot commented Dec 19, 2025 •

edited

Loading