Skip to content

refactor(encoding): remove cardinality stats from dict decisions#6022

Open
Xuanwo wants to merge 3 commits intomainfrom
xuanwo/hll-cleanup
Open

refactor(encoding): remove cardinality stats from dict decisions#6022
Xuanwo wants to merge 3 commits intomainfrom
xuanwo/hll-cleanup

Conversation

@Xuanwo
Copy link
Collaborator

@Xuanwo Xuanwo commented Feb 26, 2026

This PR is a follow up of #5891, we removed all usage of HLL.

  • remove Stat::Cardinality and all related exact-dedup computation from lance-encoding statistics
  • stop using cardinality-based pre-estimation in primitive dictionary decision
  • switch to a concrete decision flow: attempt dictionary encoding first, then accept only when the actual encoded size ratio is below threshold

Parts of this PR were drafted with assistance from Codex (with gpt-5.3-codex) and fully reviewed and edited by me. I take full responsibility for all changes.

@github-actions
Copy link
Contributor

Code Review

Summary: Clean refactor that removes HyperLogLog-based cardinality estimation and switches to actual size-based dictionary encoding decisions. This eliminates HLL estimation errors at the cost of attempting encoding before deciding.

P1 - Performance Consideration

The new approach always attempts dictionary encoding for eligible blocks, then checks the size ratio. For high-cardinality data (e.g., many unique strings), this means full dictionary building (O(n) with HashMap insertions) before rejecting, whereas the old HLL approach would bail early.

Consider whether benchmarks have been run to validate this change does not regress write performance for high-cardinality columns. If the encoding is fast enough in practice, this is acceptable.

Minor Notes (not blocking)

  1. Capacity planning (dict.rs:137,287,292): Using num_values instead of estimated cardinality for dictionary_buffer capacity may over-allocate temporarily for high-cardinality data. This is fine since Vec grows as needed, but worth being aware of.

  2. Legacy encoder threshold guard (previous/encoder.rs:547): The check threshold > u8::MAX as u64 silently caps the threshold at 255. This seems intentional for the HashSet-based precheck but could surprise users setting larger values via env var.


Overall, this is a good simplification that removes a dependency and makes the code more predictable. LGTM pending performance validation. 👍

@Xuanwo Xuanwo force-pushed the xuanwo/hll-cleanup branch from 53aa607 to 0a08547 Compare February 26, 2026 08:24
@codecov
Copy link

codecov bot commented Feb 26, 2026

Codecov Report

❌ Patch coverage is 91.66667% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-encoding/src/statistics.rs 83.33% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant