refactor(encoding): remove cardinality stats from dict decisions by Xuanwo · Pull Request #6022 · lance-format/lance

Xuanwo · 2026-02-26T07:46:20Z

This PR is a follow up of #5891, we removed all usage of HLL.

remove Stat::Cardinality and all related exact-dedup computation from lance-encoding statistics
stop using cardinality-based pre-estimation in primitive dictionary decision
switch to a concrete decision flow: attempt dictionary encoding first, then accept only when the actual encoded size ratio is below threshold

Parts of this PR were drafted with assistance from Codex (with gpt-5.3-codex) and fully reviewed and edited by me. I take full responsibility for all changes.

github-actions · 2026-02-26T07:47:47Z

Code Review

Summary: Clean refactor that removes HyperLogLog-based cardinality estimation and switches to actual size-based dictionary encoding decisions. This eliminates HLL estimation errors at the cost of attempting encoding before deciding.

P1 - Performance Consideration

The new approach always attempts dictionary encoding for eligible blocks, then checks the size ratio. For high-cardinality data (e.g., many unique strings), this means full dictionary building (O(n) with HashMap insertions) before rejecting, whereas the old HLL approach would bail early.

Consider whether benchmarks have been run to validate this change does not regress write performance for high-cardinality columns. If the encoding is fast enough in practice, this is acceptable.

Minor Notes (not blocking)

Capacity planning (dict.rs:137,287,292): Using num_values instead of estimated cardinality for dictionary_buffer capacity may over-allocate temporarily for high-cardinality data. This is fine since Vec grows as needed, but worth being aware of.
Legacy encoder threshold guard (previous/encoder.rs:547): The check threshold > u8::MAX as u64 silently caps the threshold at 255. This seems intentional for the HashSet-based precheck but could surprise users setting larger values via env var.

Overall, this is a good simplification that removes a dependency and makes the code more predictable. LGTM pending performance validation. 👍

codecov · 2026-02-26T08:59:55Z

Codecov Report

❌ Patch coverage is 91.66667% with 1 line in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance-encoding/src/statistics.rs	83.33%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

refactor(encoding): remove HLL cardinality paths

0a08547

Xuanwo force-pushed the xuanwo/hll-cleanup branch from 53aa607 to 0a08547 Compare February 26, 2026 08:24

docs(encoding): clarify legacy dict decision comment

4345b28

ci: update lockfiles after hll removal

ba72d28

github-actions bot added python java labels Feb 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(encoding): remove cardinality stats from dict decisions#6022

refactor(encoding): remove cardinality stats from dict decisions#6022
Xuanwo wants to merge 3 commits intomainfrom
xuanwo/hll-cleanup

Xuanwo commented Feb 26, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 26, 2026

Uh oh!

codecov bot commented Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Xuanwo commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 26, 2026

Code Review

P1 - Performance Consideration

Minor Notes (not blocking)

Uh oh!

codecov bot commented Feb 26, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Xuanwo commented Feb 26, 2026 •

edited

Loading