Search before asking
Version
master
What's Wrong?
BlockFileCacheTest has two types of flaky failures:
Type 1: Background thread interference — ttl_modify failure
The test_file_cache() helper creates a BlockFileCache that starts background threads. These threads asynchronously modify cache state between test assertions, causing file_block->state() to return EMPTY instead of SKIP_CACHE:
[ RUN ] BlockFileCacheTest.ttl_modify
be/test/io/cache/block_file_cache_test.cpp:447: Failure
Expected: file_block->state() == io::FileBlock::State::SKIP_CACHE
Actual: EMPTY == SKIP_CACHE
Root cause: the background evict_in_advance thread evicts releasable DOWNLOADED blocks, freeing space so that try_reserve() unexpectedly succeeds, keeping the state as EMPTY instead of transitioning to SKIP_CACHE.
Type 2: Background thread interference — io_error failure
Same root cause as Type 1, but manifests as incorrect block count because EMPTY blocks are not removed due to use_count>2 from background thread references:
[ RUN ] BlockFileCacheTest.io_error
be/test/io/cache/block_file_cache_test.cpp:530: Failure
Expected: mgr.get_file_blocks_num(key) == 9
Actual: 10 == 9
Root cause: when a FileBlocksHolder destructor tries to remove EMPTY blocks (use_count()==2), the background thread or another holder still holds a reference (use_count()>2), preventing removal and leaving the block in the queue.
Type 3: Insufficient async open timeout — evict_privilege_order_for_ttl failure
initialize() starts a background disk I/O loading thread that sets _async_open_done=true only on completion. Tests wait only 100ms (100 iterations × 1ms), which is insufficient under high CPU load:
[ RUN ] BlockFileCacheTest.evict_privilege_order_for_ttl
be/test/io/cache/block_file_cache_test.cpp:6980: Failure
Expected: cache.get_or_set(key1, offset, 100000, context1) to succeed
(async open not completed, cache not ready)
Root cause: the 100ms total timeout is too short when the system is under load. The background loading thread needs more time to complete disk I/O and set _async_open_done=true.
What You Expected?
Tests should be deterministic and not affected by background threads or timing issues.
How to Reproduce?
Run BlockFileCacheTest repeatedly under CPU load. The flaky failures appear intermittently.
Anything Else?
All three are test defects, not business code defects. The fix:
- For Type 1 & 2: Save and restore config values, set all background thread intervals to 10000000ms during
test_file_cache and test_file_cache_memory_storage to prevent background thread interference.
- For Type 3: Extract
wait_for_async_open() helper with 1000 iterations × 10ms (10s total timeout), replace all 91 inline wait loops. The 10ms sleep interval also avoids exacerbating CPU pressure under load compared to the original 1ms interval.
Are you willing to submit PR?
Search before asking
Version
master
What's Wrong?
BlockFileCacheTesthas two types of flaky failures:Type 1: Background thread interference —
ttl_modifyfailureThe
test_file_cache()helper creates aBlockFileCachethat starts background threads. These threads asynchronously modify cache state between test assertions, causingfile_block->state()to return EMPTY instead of SKIP_CACHE:Root cause: the background
evict_in_advancethread evicts releasable DOWNLOADED blocks, freeing space so thattry_reserve()unexpectedly succeeds, keeping the state as EMPTY instead of transitioning to SKIP_CACHE.Type 2: Background thread interference —
io_errorfailureSame root cause as Type 1, but manifests as incorrect block count because EMPTY blocks are not removed due to
use_count>2from background thread references:Root cause: when a
FileBlocksHolderdestructor tries to remove EMPTY blocks (use_count()==2), the background thread or another holder still holds a reference (use_count()>2), preventing removal and leaving the block in the queue.Type 3: Insufficient async open timeout —
evict_privilege_order_for_ttlfailureinitialize()starts a background disk I/O loading thread that sets_async_open_done=trueonly on completion. Tests wait only 100ms (100 iterations × 1ms), which is insufficient under high CPU load:Root cause: the 100ms total timeout is too short when the system is under load. The background loading thread needs more time to complete disk I/O and set
_async_open_done=true.What You Expected?
Tests should be deterministic and not affected by background threads or timing issues.
How to Reproduce?
Run
BlockFileCacheTestrepeatedly under CPU load. The flaky failures appear intermittently.Anything Else?
All three are test defects, not business code defects. The fix:
test_file_cacheandtest_file_cache_memory_storageto prevent background thread interference.wait_for_async_open()helper with 1000 iterations × 10ms (10s total timeout), replace all 91 inline wait loops. The 10ms sleep interval also avoids exacerbating CPU pressure under load compared to the original 1ms interval.Are you willing to submit PR?