Skip to content

GH-50007: [C++][Parquet] Add bloom filter folding to automatically size SBBF filters#50008

Open
HuaHuaY wants to merge 3 commits into
apache:mainfrom
HuaHuaY:sbbf_filters
Open

GH-50007: [C++][Parquet] Add bloom filter folding to automatically size SBBF filters#50008
HuaHuaY wants to merge 3 commits into
apache:mainfrom
HuaHuaY:sbbf_filters

Conversation

@HuaHuaY
Copy link
Copy Markdown
Contributor

@HuaHuaY HuaHuaY commented May 21, 2026

Rationale for this change

This PR follows apache/arrow-rs#9628. It supports optimizing the disk usage of the Bloom filter. So specifying an ndv value larger than the actual value will not affect disk usage.

Bloom filters now support folding mode: allocate a conservatively large filter (sized for worst-case NDV), insert all values during writing, then fold down at flush time to meet a target FPP. This eliminates the need to guess NDV upfront and produces optimally-sized filters automatically.

What changes are included in this PR?

BloomFilterBuilder will try to fold the bloom filter before writing it to the output stream.

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes.

The type of ndv in BloomFilterOptions is changed from int32_t to std::optional<int64_t>. And the argument type of OptimalNumOfBytes and OptimalNumOfBits in BlockSplitBloomFilter is changed from uint32_t ndv to uint64_t ndv.

Add a new field fold in BloomFilterOptions and default value is true.

@HuaHuaY HuaHuaY requested a review from wgtmac as a code owner May 21, 2026 10:19
@github-actions github-actions Bot added the awaiting review Awaiting review label May 21, 2026
@github-actions
Copy link
Copy Markdown

⚠️ GitHub issue #50007 has been automatically assigned in GitHub to PR creator.

Comment thread cpp/src/parquet/bloom_filter_writer.cc Outdated
@github-actions github-actions Bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels May 21, 2026
@HuaHuaY
Copy link
Copy Markdown
Contributor Author

HuaHuaY commented May 21, 2026

@wgtmac @alamb @etseidl @emkornfield Please take a look.

@alamb
Copy link
Copy Markdown
Contributor

alamb commented May 21, 2026

I am not likely to have time to review C++ code in the arrow repository unfortunately

Copy link
Copy Markdown
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @HuaHuaY for adding this quickly!

Comment thread cpp/src/parquet/properties.h
Comment thread cpp/src/parquet/properties.h Outdated
std::to_string(bloom_filter_options.fpp));
}
if (bloom_filter_options.ndv.has_value() && bloom_filter_options.ndv.value() < 0) {
throw ParquetException("Bloom filter number of distinct values must be >= 0, got " +
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the expected behavior of 0?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will create a smallest bloom filter.

Comment thread cpp/src/parquet/bloom_filter.h
Comment thread cpp/src/parquet/bloom_filter.h Outdated
Comment thread cpp/src/parquet/bloom_filter.cc Outdated
Comment thread cpp/src/parquet/bloom_filter.cc
Comment thread cpp/src/parquet/bloom_filter_reader_writer_test.cc Outdated
@wgtmac
Copy link
Copy Markdown
Member

wgtmac commented May 29, 2026

cc @mapleFU @adamreeve

Copy link
Copy Markdown
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM. I left some nits.

Comment thread cpp/src/parquet/properties.h Outdated
Comment thread cpp/src/parquet/bloom_filter.cc Outdated
Comment thread cpp/src/parquet/properties.h
Copy link
Copy Markdown
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM

Comment thread cpp/src/parquet/bloom_filter_writer.cc Outdated
Comment thread cpp/src/parquet/bloom_filter.cc Outdated
@HuaHuaY
Copy link
Copy Markdown
Contributor Author

HuaHuaY commented Jun 2, 2026

@pitrou @mapleFU Please take a look.

Comment thread cpp/src/parquet/bloom_filter.cc Outdated
Comment thread cpp/src/parquet/bloom_filter.cc
Comment thread cpp/src/parquet/bloom_filter.cc
}
++num_folds;
}
return num_folds;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this algorithm the actual size reduction will always be a power of 2 (group_size = UINT32_C(1) << num_folds). Why aren't we trying to be more granular?

Copy link
Copy Markdown
Contributor Author

@HuaHuaY HuaHuaY Jun 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BlockSplitBloomFilter::Init will check (num_bytes & (num_bytes - 1)) != 0. I didn't find this limitation in the Parquet documentation. But If we break the rule, old parquet reader will not be able to read the bloom filter.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. We probably don't want to produce data that would be incompatible with old readers.

Does the power-of-two constraint serve a purpose? Perhaps we can remove it in a separate PR.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case, can you add a comment somewhere mentioning this restriction?

Copy link
Copy Markdown
Contributor Author

@HuaHuaY HuaHuaY Jun 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have add a comment in front of group_size.

  // A fold group is a consecutive run of blocks ORed into one output block.
  // Keeping the group size as (1 << num_folds) preserves a power-of-two bitset
  // size. Folding by this power-of-two group size keeps the old-to-new bucket
  // remapping aligned with bucket lookup and avoids false negatives.
  const uint32_t group_size = UINT32_C(1) << num_folds;

After more thinking, I think the actual size reduction must be a power of 2. Because the block index is calculated by static_cast<uint32_t>(((hash >> 32) * NumBlocks()) >> 32);, which is required by parquet's document. And we must ensure that the calculated block index is the same before and after the fold.

Comment thread cpp/src/parquet/bloom_filter_writer.cc Outdated
filter.GetBitsetSize());
for (uint64_t hash : hashes) {
EXPECT_TRUE(filter.FindHash(hash));
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check that most non-inserted values are not found, with an actual FPP value below kFpp?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will let each round of testing calculate the FPP for the 10,000 numbers that have not been inserted.

Comment thread cpp/src/parquet/bloom_filter.cc Outdated
(static_cast<double>(num_blocks) * kBytesPerFilterBlock * 8);
const auto max_folds = static_cast<uint32_t>(std::countr_zero(num_blocks));

if (avg_fill == 0.0) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I little bit forgot would this really happens when writing a parquet file?

Copy link
Copy Markdown
Contributor Author

@HuaHuaY HuaHuaY Jun 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If all values in a column chunk are null, avg_fill will be 0.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it still need a BF or fold in this scenerio? Or this path would lead to zero cost folding?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are differences between "not have a bloom filter" and "bloom filter has no values". The latter can filter every not null values. And there is currently no way to indicate that a bloom filter exists but has no value through metadata.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about
(1) without folding, just replace to a smallest one without any copying

And there is currently no way to indicate that a bloom filter exists but has no value through metadata.

In theory you're right. In production, I believe null_count == num_values works?

Copy link
Copy Markdown
Contributor Author

@HuaHuaY HuaHuaY Jun 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without folding, just replace to a smallest one without any copying

Good idea. I will change the code soon.

I believe null_count == num_values works

I don't think it's good to mix two separate components. Also, null_count is an optional value and may not actually exist.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the commits and now only fold when total_set_bits is not equal to 0.

Comment thread cpp/src/parquet/bloom_filter.cc Outdated
const auto* bitset32 = reinterpret_cast<const uint32_t*>(data_->data());
const uint32_t num_words = num_bytes_ / static_cast<uint32_t>(sizeof(uint32_t));
for (uint32_t i = 0; i < num_words; ++i) {
total_set_bits += static_cast<uint64_t>(std::popcount(bitset32[i]));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know whether internal::CountSetBits easy to understand here ( though popcount is right and a bit faster)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changd to internal::CountSetBits. internal::CountSetBits may be faster because it counts once every 64 bits.

Comment thread cpp/src/parquet/bloom_filter_reader_writer_test.cc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants