refactor(encoding): remove cardinality stats from dict decisions#6022
refactor(encoding): remove cardinality stats from dict decisions#6022
Conversation
Code ReviewSummary: Clean refactor that removes HyperLogLog-based cardinality estimation and switches to actual size-based dictionary encoding decisions. This eliminates HLL estimation errors at the cost of attempting encoding before deciding. P1 - Performance ConsiderationThe new approach always attempts dictionary encoding for eligible blocks, then checks the size ratio. For high-cardinality data (e.g., many unique strings), this means full dictionary building (O(n) with HashMap insertions) before rejecting, whereas the old HLL approach would bail early. Consider whether benchmarks have been run to validate this change does not regress write performance for high-cardinality columns. If the encoding is fast enough in practice, this is acceptable. Minor Notes (not blocking)
Overall, this is a good simplification that removes a dependency and makes the code more predictable. LGTM pending performance validation. 👍 |
53aa607 to
0a08547
Compare
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
This PR is a follow up of #5891, we removed all usage of HLL.
Stat::Cardinalityand all related exact-dedup computation fromlance-encodingstatisticsParts of this PR were drafted with assistance from Codex (with
gpt-5.3-codex) and fully reviewed and edited by me. I take full responsibility for all changes.