Skip to content

Support automated interpolation search#14383

Closed
joshkang97 wants to merge 8 commits intofacebook:mainfrom
joshkang97:auto_interpolation_search
Closed

Support automated interpolation search#14383
joshkang97 wants to merge 8 commits intofacebook:mainfrom
joshkang97:auto_interpolation_search

Conversation

@joshkang97
Copy link
Copy Markdown
Contributor

@joshkang97 joshkang97 commented Feb 24, 2026

Summary

Add automatic per-block interpolation search selection (kAuto mode) for index blocks. During SST construction, each index block's key distribution is analyzed using the coefficient of variation (CV) of gaps between restart-point keys. Blocks with uniformly distributed keys are flagged via a new bit in the data block footer, and at read time, kAuto resolves to interpolation search for uniform blocks and binary search otherwise.

Key changes

  • New BlockSearchType::kAuto enum value: Resolves per-block at read time to either kInterpolation or kBinary based on the block's uniformity flag. Falls back to kBinary on older versions that don't recognize it.
  • Write-path uniformity analysis: BlockBuilder::ScanForUniformity() uses Welford's online algorithm to incrementally compute the CV of key gaps at restart points. The result is stored in a new bit (bit 30) of the data block footer's packed restart count.
  • New table option uniform_cv_threshold (default: -1 disabled): Controls how strict the uniformity check is. Set to negative to disable. Exposed in C++, Java (JNI), and db_bench.
  • Code reorganization: Block entry decode helpers (DecodeEntry, DecodeKey, DecodeKeyV4, ReadBe64FromKey) moved from block.cc to a new shared header block_util.h so they can be reused by BlockBuilder on the write path.
  • New histogram BLOCK_KEY_DISTRIBUTION_CV: Records the CV (scaled by 10000) of each index block's key distribution for observability.
  • Java bindings: IndexSearchType.kAuto, uniformCvThreshold getter/setter, JNI portal constructor signature updated, and HistogramType.BLOCK_KEY_DISTRIBUTION_CV added.

Test Plan

  • IndexBlockTest.IndexValueEncodingTest parameterized to include kAuto search type alongside kBinary and kInterpolation, verifying correct seek/iteration behavior across all combinations of key distributions, restart intervals, and key lengths.
  • Uniformity detection validated: blocks with uniform key distribution correctly set is_uniform = true, blocks with clustered/non-uniform keys set is_uniform = false.
  • Stress test coverage
  • Updated check_format_compatible to also include a "uniform" dataset. By default using uniform_cv_threshold=-1 does not result in an incompatibility issues. When manually changing the threshold (e.g. uniform_cv_threshold=1000), I see bad block contents, which is expected

Benchmark

readrandom with fillrandom,compact -seed=1 --statistics:

Benchmark Branch Params avg ops/s % change vs main CV P50
readrandom main binary_search, shortening=1 335,791 baseline N/A
readrandom feature binary_search, shortening=1 (default) 335,749 -0.0% 1,500
readrandom feature auto_search, shortening=1 (kAuto) 366,832 +9.2% 1,500
readrandom feature interpolation_search, shortening=1 366,598 +9.2% 1,500
readrandom feature auto_search, shortening=2 (kAuto) 344,631 +2.6% 1,030,000
readrandom feature interpolation_search, shortening=2 201,178 -40.1% 1,030,000

As seen with shortening=2, a non-uniform distribution produces a high CV, which does not use interpolation search.

Write benchmark

There is a write overhead which scans each restart entry for a block upon Finish. In practice this is very low because currently it is only applied to index blocks.

See cpu profile (https://fburl.com/strobelight/io5hwj9h) here of -benchmarks=fillseq,compact -compression_type=none -disable_wal=1. Only 0.08% attributed to ScanForUniformity.

@meta-cla meta-cla Bot added the CLA Signed label Feb 24, 2026
@joshkang97 joshkang97 force-pushed the auto_interpolation_search branch from f166cef to 6af3fe8 Compare February 24, 2026 23:05
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Feb 25, 2026

⚠️ clang-tidy: 1 warning(s) on changed lines

Completed in 1294.5s.

Summary by check

Check Count
cert-err58-cpp 1
Total 1

Details

tools/ldb_cmd.cc (1 warning(s))
tools/ldb_cmd.cc:91:31: warning: initialization of 'ARG_UNIFORM_CV_THRESHOLD' with static storage duration may throw an exception that cannot be caught [cert-err58-cpp]

@joshkang97 joshkang97 force-pushed the auto_interpolation_search branch 4 times, most recently from c4281ee to 06a66a7 Compare February 27, 2026 01:02
@joshkang97 joshkang97 force-pushed the auto_interpolation_search branch from 06a66a7 to 74e959a Compare February 28, 2026 00:13
@joshkang97 joshkang97 marked this pull request as ready for review February 28, 2026 00:14
@joshkang97 joshkang97 changed the title [WIP] Support automated interpolation search Support automated interpolation search Feb 28, 2026
@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Feb 28, 2026

@joshkang97 has imported this pull request. If you are a Meta employee, you can view this in D94738890.

Copy link
Copy Markdown
Contributor

@xingbowang xingbowang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a release note?
Any plan to expand beyond index block?

Comment thread include/rocksdb/table.h Outdated
@joshkang97
Copy link
Copy Markdown
Contributor Author

Any plan to expand beyond index block?

It is doable, but will require a format version bump and gains are likely smaller due to the fact that there are much fewer restart points to search through in data blocks.

@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Mar 3, 2026

@joshkang97 has imported this pull request. If you are a Meta employee, you can view this in D94738890.

Copy link
Copy Markdown
Contributor

@pdillinger pdillinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see crash test coverage. This is especially required for features skipping the "experimental" state of production readiness.

I don't see any testing that the uniform_cv_threshold actually affects the uniformity hint, and that negative values act as a kill switch.

I'm investigating the potential format compatibility concerns.

Comment thread include/rocksdb/table.h Outdated
Comment thread include/rocksdb/table.h Outdated
Comment thread include/rocksdb/table.h Outdated
Comment thread include/rocksdb/table.h
Comment thread table/block_based/block_builder.cc Outdated
Comment thread table/block_based/block_util.h Outdated
Comment thread table/block_based/data_block_footer.h Outdated
@joshkang97 joshkang97 requested a review from pdillinger March 4, 2026 22:39
@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Mar 4, 2026

@joshkang97 has imported this pull request. If you are a Meta employee, you can view this in D94738890.

Copy link
Copy Markdown
Contributor

@pdillinger pdillinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks awesome overall!

EOF

# Generate a file with uniformly distributed keys
uniform_input_data=$input_data_path/uniform_data
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears incomplete, not using the data anywhere

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm doesn't this automatically get ingested in

for f in `ls -1 $input_data_dir`

Comment thread tools/generate_random_db.sh Outdated
Comment thread include/rocksdb/table.h
Comment thread tools/generate_random_db.sh Outdated
@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Mar 5, 2026

@joshkang97 has imported this pull request. If you are a Meta employee, you can view this in D94738890.

@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Mar 6, 2026

@joshkang97 merged this pull request in 3ad23b2.

doxtop pushed a commit to flyingw/rocksdb that referenced this pull request Apr 7, 2026
Summary:
Add automatic per-block interpolation search selection (`kAuto` mode) for index blocks. During SST construction, each index block's key distribution is analyzed using the coefficient of variation (CV) of gaps between restart-point keys. Blocks with uniformly distributed keys are flagged via a new bit in the data block footer, and at read time, `kAuto` resolves to interpolation search for uniform blocks and binary search otherwise.

## Key changes

- **New `BlockSearchType::kAuto` enum value**: Resolves per-block at read time to either `kInterpolation` or `kBinary` based on the block's uniformity flag. Falls back to `kBinary` on older versions that don't recognize it.
- **Write-path uniformity analysis**: `BlockBuilder::ScanForUniformity()` uses Welford's online algorithm to incrementally compute the CV of key gaps at restart points. The result is stored in a new bit (bit 30) of the data block footer's packed restart count.
- **New table option `uniform_cv_threshold`** (default: -1 `disabled`): Controls how strict the uniformity check is. Set to negative to disable. Exposed in C++, Java (JNI), and `db_bench`.
- **Code reorganization**: Block entry decode helpers (`DecodeEntry`, `DecodeKey`, `DecodeKeyV4`, `ReadBe64FromKey`) moved from `block.cc` to a new shared header `block_util.h` so they can be reused by `BlockBuilder` on the write path.
- **New histogram `BLOCK_KEY_DISTRIBUTION_CV`**: Records the CV (scaled by 10000) of each index block's key distribution for observability.
- **Java bindings**: `IndexSearchType.kAuto`, `uniformCvThreshold` getter/setter, JNI portal constructor signature updated, and `HistogramType.BLOCK_KEY_DISTRIBUTION_CV` added.

Pull Request resolved: facebook#14383

Test Plan:
- `IndexBlockTest.IndexValueEncodingTest` parameterized to include `kAuto` search type alongside `kBinary` and `kInterpolation`, verifying correct seek/iteration behavior across all combinations of key distributions, restart intervals, and key lengths.
- Uniformity detection validated: blocks with uniform key distribution correctly set `is_uniform = true`, blocks with clustered/non-uniform keys set `is_uniform = false`.
- Stress test coverage
- Updated check_format_compatible to also include a "uniform" dataset. By default using uniform_cv_threshold=-1 does not result in an incompatibility issues. When manually changing the threshold (e.g. `uniform_cv_threshold=1000`), I see `bad block contents`, which is expected

## Benchmark

readrandom with `fillrandom,compact -seed=1 --statistics`:

| Benchmark | Branch | Params | avg ops/s | % change vs main | CV P50 |
|-----------|--------|--------|-----------|------------------|--------|
| readrandom | main | `binary_search, shortening=1` | 335,791 | baseline | N/A |
| readrandom | feature | `binary_search, shortening=1` (default) | 335,749 | -0.0% | 1,500 |
| readrandom | feature | `auto_search, shortening=1` (kAuto) | 366,832 | **+9.2%** | 1,500 |
| readrandom | feature | `interpolation_search, shortening=1` | 366,598 | **+9.2%** | 1,500 |
| readrandom | feature | `auto_search, shortening=2` (kAuto) | 344,631 | **+2.6%** | 1,030,000 |
| readrandom | feature | `interpolation_search, shortening=2` | 201,178 | **-40.1%** | 1,030,000 |

As seen with shortening=2, a non-uniform distribution produces a high CV, which does not use interpolation search.

## Write benchmark

There is a write overhead which scans each restart entry for a block upon Finish. In practice this is very low because currently it is only applied to index blocks.

See cpu profile (https://fburl.com/strobelight/io5hwj9h) here of `-benchmarks=fillseq,compact -compression_type=none -disable_wal=1`. Only 0.08% attributed to `ScanForUniformity`.

Reviewed By: pdillinger

Differential Revision: D94738890

Pulled By: joshkang97

fbshipit-source-id: 9661ac593c5fef89d49f3a8a027f1338a0c96766
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants