GH-50007: [C++][Parquet] Add bloom filter folding to automatically size SBBF filters by HuaHuaY · Pull Request #50008 · apache/arrow

HuaHuaY · 2026-05-21T10:18:59Z

Rationale for this change

This PR follows apache/arrow-rs#9628. It supports optimizing the disk usage of the Bloom filter. So specifying an ndv value larger than the actual value will not affect disk usage.

Bloom filters now support folding mode: allocate a conservatively large filter (sized for worst-case NDV), insert all values during writing, then fold down at flush time to meet a target FPP. This eliminates the need to guess NDV upfront and produces optimally-sized filters automatically.

What changes are included in this PR?

BloomFilterBuilder will try to fold the bloom filter before writing it to the output stream.

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes.

The type of ndv in BloomFilterOptions is changed from int32_t to std::optional<int64_t>. And the argument type of OptimalNumOfBytes and OptimalNumOfBits in BlockSplitBloomFilter is changed from uint32_t ndv to uint64_t ndv.

Add a new field fold in BloomFilterOptions and default value is true.

GitHub Issue: [C++][Parquet] Add bloom filter folding to automatically size SBBF filters #50007

github-actions · 2026-05-21T10:20:34Z

⚠️ GitHub issue #50007 has been automatically assigned in GitHub to PR creator.

HuaHuaY · 2026-05-21T12:56:23Z

@wgtmac @alamb @etseidl @emkornfield Please take a look.

alamb · 2026-05-21T18:11:16Z

I am not likely to have time to review C++ code in the arrow repository unfortunately

wgtmac

Thanks @HuaHuaY for adding this quickly!

wgtmac · 2026-05-29T02:26:19Z

          std::to_string(bloom_filter_options.fpp));
    }
+    if (bloom_filter_options.ndv.has_value() && bloom_filter_options.ndv.value() < 0) {
+      throw ParquetException("Bloom filter number of distinct values must be >= 0, got " +


What is the expected behavior of 0?

It will create a smallest bloom filter.

wgtmac · 2026-05-29T03:53:17Z

cc @mapleFU @adamreeve

wgtmac

Generally LGTM. I left some nits.

mapleFU

Generally LGTM

HuaHuaY · 2026-06-02T11:17:12Z

@pitrou @mapleFU Please take a look.

pitrou · 2026-06-02T11:59:36Z

+    }
+    ++num_folds;
+  }
+  return num_folds;


With this algorithm the actual size reduction will always be a power of 2 (group_size = UINT32_C(1) << num_folds). Why aren't we trying to be more granular?

BlockSplitBloomFilter::Init will check (num_bytes & (num_bytes - 1)) != 0. I didn't find this limitation in the Parquet documentation. But If we break the rule, old parquet reader will not be able to read the bloom filter.

Gotcha. We probably don't want to produce data that would be incompatible with old readers.

Does the power-of-two constraint serve a purpose? Perhaps we can remove it in a separate PR.

In any case, can you add a comment somewhere mentioning this restriction?

I have add a comment in front of group_size.

// A fold group is a consecutive run of blocks ORed into one output block. // Keeping the group size as (1 << num_folds) preserves a power-of-two bitset // size. Folding by this power-of-two group size keeps the old-to-new bucket // remapping aligned with bucket lookup and avoids false negatives. const uint32_t group_size = UINT32_C(1) << num_folds;

After more thinking, I think the actual size reduction must be a power of 2. Because the block index is calculated by static_cast<uint32_t>(((hash >> 32) * NumBlocks()) >> 32);, which is required by parquet's document. And we must ensure that the calculated block index is the same before and after the fold.

pitrou · 2026-06-02T12:12:53Z

+            filter.GetBitsetSize());
+  for (uint64_t hash : hashes) {
+    EXPECT_TRUE(filter.FindHash(hash));
+  }


Should we check that most non-inserted values are not found, with an actual FPP value below kFpp?

I will let each round of testing calculate the FPP for the 10,000 numbers that have not been inserted.

mapleFU · 2026-06-02T13:59:55Z

+                          (static_cast<double>(num_blocks) * kBytesPerFilterBlock * 8);
+  const auto max_folds = static_cast<uint32_t>(std::countr_zero(num_blocks));
+
+  if (avg_fill == 0.0) {


I little bit forgot would this really happens when writing a parquet file?

If all values in a column chunk are null, avg_fill will be 0.

Does it still need a BF or fold in this scenerio? Or this path would lead to zero cost folding?

I think there are differences between "not have a bloom filter" and "bloom filter has no values". The latter can filter every not null values. And there is currently no way to indicate that a bloom filter exists but has no value through metadata.

What about
(1) without folding, just replace to a smallest one without any copying

And there is currently no way to indicate that a bloom filter exists but has no value through metadata.

In theory you're right. In production, I believe null_count == num_values works?

without folding, just replace to a smallest one without any copying

Good idea. I will change the code soon.

I believe null_count == num_values works

I don't think it's good to mix two separate components. Also, null_count is an optional value and may not actually exist.

I have updated the commits and now only fold when total_set_bits is not equal to 0.

mapleFU · 2026-06-02T14:16:53Z

+  const auto* bitset32 = reinterpret_cast<const uint32_t*>(data_->data());
+  const uint32_t num_words = num_bytes_ / static_cast<uint32_t>(sizeof(uint32_t));
+  for (uint32_t i = 0; i < num_words; ++i) {
+    total_set_bits += static_cast<uint64_t>(std::popcount(bitset32[i]));


I don't know whether internal::CountSetBits easy to understand here ( though popcount is right and a bit faster)

I have changd to internal::CountSetBits. internal::CountSetBits may be faster because it counts once every 64 bits.

HuaHuaY requested a review from wgtmac as a code owner May 21, 2026 10:19

github-actions Bot added the awaiting review Awaiting review label May 21, 2026

github-actions Bot added Component: Parquet Component: C++ labels May 21, 2026

HuaHuaY commented May 21, 2026

View reviewed changes

Comment thread cpp/src/parquet/bloom_filter_writer.cc Outdated

github-actions Bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels May 21, 2026

wgtmac reviewed May 29, 2026

View reviewed changes

HuaHuaY force-pushed the sbbf_filters branch from 4495c53 to 9565196 Compare May 29, 2026 07:55

wgtmac approved these changes May 29, 2026

View reviewed changes

Comment thread cpp/src/parquet/properties.h Outdated

Comment thread cpp/src/parquet/bloom_filter.cc Outdated

Comment thread cpp/src/parquet/properties.h

mapleFU reviewed May 29, 2026

View reviewed changes

Comment thread cpp/src/parquet/bloom_filter_writer.cc Outdated

Comment thread cpp/src/parquet/bloom_filter.cc Outdated

pitrou reviewed Jun 2, 2026

View reviewed changes

mapleFU approved these changes Jun 2, 2026

View reviewed changes

mapleFU reviewed Jun 2, 2026

View reviewed changes

Comment thread cpp/src/parquet/bloom_filter_reader_writer_test.cc

add bloom filter folding to automatically size SBBF filters

b5e1a0b

HuaHuaY force-pushed the sbbf_filters branch from a363cb5 to b5e1a0b Compare June 4, 2026 08:58

fix review

d57601b

HuaHuaY force-pushed the sbbf_filters branch from d26cb2e to d57601b Compare June 4, 2026 09:34

fix review

edcc583

Conversation

HuaHuaY commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

Uh oh!

HuaHuaY commented May 21, 2026

Uh oh!

alamb commented May 21, 2026

Uh oh!

wgtmac left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wgtmac commented May 29, 2026

Uh oh!

wgtmac left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mapleFU left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

HuaHuaY commented Jun 2, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HuaHuaY Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HuaHuaY Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HuaHuaY Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HuaHuaY Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

HuaHuaY commented May 21, 2026 •

edited

Loading

HuaHuaY Jun 4, 2026 •

edited

Loading

HuaHuaY Jun 4, 2026 •

edited

Loading

HuaHuaY Jun 4, 2026 •

edited

Loading

HuaHuaY Jun 4, 2026 •

edited

Loading