Skip to content

[Bug] BlockFileCacheTest flaky: background thread interference and async open timeout #64189

@heguanhui

Description

@heguanhui

Search before asking

  • I had searched in the issues and found no similar issues.

Version

master

What's Wrong?

BlockFileCacheTest has two types of flaky failures:


Type 1: Background thread interference — ttl_modify failure

The test_file_cache() helper creates a BlockFileCache that starts background threads. These threads asynchronously modify cache state between test assertions, causing file_block->state() to return EMPTY instead of SKIP_CACHE:

[ RUN      ] BlockFileCacheTest.ttl_modify
be/test/io/cache/block_file_cache_test.cpp:447: Failure
Expected: file_block->state() == io::FileBlock::State::SKIP_CACHE
Actual:    EMPTY == SKIP_CACHE

Root cause: the background evict_in_advance thread evicts releasable DOWNLOADED blocks, freeing space so that try_reserve() unexpectedly succeeds, keeping the state as EMPTY instead of transitioning to SKIP_CACHE.


Type 2: Background thread interference — io_error failure

Same root cause as Type 1, but manifests as incorrect block count because EMPTY blocks are not removed due to use_count>2 from background thread references:

[ RUN      ] BlockFileCacheTest.io_error
be/test/io/cache/block_file_cache_test.cpp:530: Failure
Expected: mgr.get_file_blocks_num(key) == 9
Actual:    10 == 9

Root cause: when a FileBlocksHolder destructor tries to remove EMPTY blocks (use_count()==2), the background thread or another holder still holds a reference (use_count()>2), preventing removal and leaving the block in the queue.


Type 3: Insufficient async open timeout — evict_privilege_order_for_ttl failure

initialize() starts a background disk I/O loading thread that sets _async_open_done=true only on completion. Tests wait only 100ms (100 iterations × 1ms), which is insufficient under high CPU load:

[ RUN      ] BlockFileCacheTest.evict_privilege_order_for_ttl
be/test/io/cache/block_file_cache_test.cpp:6980: Failure
Expected: cache.get_or_set(key1, offset, 100000, context1) to succeed
  (async open not completed, cache not ready)

Root cause: the 100ms total timeout is too short when the system is under load. The background loading thread needs more time to complete disk I/O and set _async_open_done=true.


What You Expected?

Tests should be deterministic and not affected by background threads or timing issues.

How to Reproduce?

Run BlockFileCacheTest repeatedly under CPU load. The flaky failures appear intermittently.

Anything Else?

All three are test defects, not business code defects. The fix:

  1. For Type 1 & 2: Save and restore config values, set all background thread intervals to 10000000ms during test_file_cache and test_file_cache_memory_storage to prevent background thread interference.
  2. For Type 3: Extract wait_for_async_open() helper with 1000 iterations × 10ms (10s total timeout), replace all 91 inline wait loops. The 10ms sleep interval also avoids exacerbating CPU pressure under load compared to the original 1ms interval.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions