GH-50007: [C++][Parquet] Add bloom filter folding to automatically size SBBF filters#50008
GH-50007: [C++][Parquet] Add bloom filter folding to automatically size SBBF filters#50008HuaHuaY wants to merge 3 commits into
Conversation
|
|
|
@wgtmac @alamb @etseidl @emkornfield Please take a look. |
|
I am not likely to have time to review C++ code in the arrow repository unfortunately |
| std::to_string(bloom_filter_options.fpp)); | ||
| } | ||
| if (bloom_filter_options.ndv.has_value() && bloom_filter_options.ndv.value() < 0) { | ||
| throw ParquetException("Bloom filter number of distinct values must be >= 0, got " + |
There was a problem hiding this comment.
What is the expected behavior of 0?
There was a problem hiding this comment.
It will create a smallest bloom filter.
wgtmac
left a comment
There was a problem hiding this comment.
Generally LGTM. I left some nits.
| } | ||
| ++num_folds; | ||
| } | ||
| return num_folds; |
There was a problem hiding this comment.
With this algorithm the actual size reduction will always be a power of 2 (group_size = UINT32_C(1) << num_folds). Why aren't we trying to be more granular?
There was a problem hiding this comment.
BlockSplitBloomFilter::Init will check (num_bytes & (num_bytes - 1)) != 0. I didn't find this limitation in the Parquet documentation. But If we break the rule, old parquet reader will not be able to read the bloom filter.
There was a problem hiding this comment.
Gotcha. We probably don't want to produce data that would be incompatible with old readers.
Does the power-of-two constraint serve a purpose? Perhaps we can remove it in a separate PR.
There was a problem hiding this comment.
In any case, can you add a comment somewhere mentioning this restriction?
There was a problem hiding this comment.
I have add a comment in front of group_size.
// A fold group is a consecutive run of blocks ORed into one output block.
// Keeping the group size as (1 << num_folds) preserves a power-of-two bitset
// size. Folding by this power-of-two group size keeps the old-to-new bucket
// remapping aligned with bucket lookup and avoids false negatives.
const uint32_t group_size = UINT32_C(1) << num_folds;After more thinking, I think the actual size reduction must be a power of 2. Because the block index is calculated by static_cast<uint32_t>(((hash >> 32) * NumBlocks()) >> 32);, which is required by parquet's document. And we must ensure that the calculated block index is the same before and after the fold.
| filter.GetBitsetSize()); | ||
| for (uint64_t hash : hashes) { | ||
| EXPECT_TRUE(filter.FindHash(hash)); | ||
| } |
There was a problem hiding this comment.
Should we check that most non-inserted values are not found, with an actual FPP value below kFpp?
There was a problem hiding this comment.
I will let each round of testing calculate the FPP for the 10,000 numbers that have not been inserted.
| (static_cast<double>(num_blocks) * kBytesPerFilterBlock * 8); | ||
| const auto max_folds = static_cast<uint32_t>(std::countr_zero(num_blocks)); | ||
|
|
||
| if (avg_fill == 0.0) { |
There was a problem hiding this comment.
I little bit forgot would this really happens when writing a parquet file?
There was a problem hiding this comment.
If all values in a column chunk are null, avg_fill will be 0.
There was a problem hiding this comment.
Does it still need a BF or fold in this scenerio? Or this path would lead to zero cost folding?
There was a problem hiding this comment.
I think there are differences between "not have a bloom filter" and "bloom filter has no values". The latter can filter every not null values. And there is currently no way to indicate that a bloom filter exists but has no value through metadata.
There was a problem hiding this comment.
What about
(1) without folding, just replace to a smallest one without any copying
And there is currently no way to indicate that a bloom filter exists but has no value through metadata.
In theory you're right. In production, I believe null_count == num_values works?
There was a problem hiding this comment.
without folding, just replace to a smallest one without any copying
Good idea. I will change the code soon.
I believe null_count == num_values works
I don't think it's good to mix two separate components. Also, null_count is an optional value and may not actually exist.
There was a problem hiding this comment.
I have updated the commits and now only fold when total_set_bits is not equal to 0.
| const auto* bitset32 = reinterpret_cast<const uint32_t*>(data_->data()); | ||
| const uint32_t num_words = num_bytes_ / static_cast<uint32_t>(sizeof(uint32_t)); | ||
| for (uint32_t i = 0; i < num_words; ++i) { | ||
| total_set_bits += static_cast<uint64_t>(std::popcount(bitset32[i])); |
There was a problem hiding this comment.
I don't know whether internal::CountSetBits easy to understand here ( though popcount is right and a bit faster)
There was a problem hiding this comment.
I have changd to internal::CountSetBits. internal::CountSetBits may be faster because it counts once every 64 bits.
Rationale for this change
This PR follows apache/arrow-rs#9628. It supports optimizing the disk usage of the Bloom filter. So specifying an ndv value larger than the actual value will not affect disk usage.
What changes are included in this PR?
BloomFilterBuilderwill try to fold the bloom filter before writing it to the output stream.Are these changes tested?
Yes.
Are there any user-facing changes?
Yes.
The type of
ndvinBloomFilterOptionsis changed fromint32_ttostd::optional<int64_t>. And the argument type ofOptimalNumOfBytesandOptimalNumOfBitsinBlockSplitBloomFilteris changed fromuint32_t ndvtouint64_t ndv.Add a new field
foldinBloomFilterOptionsand default value istrue.