perf: add dict-values compression controls with lz4 default#6059
perf: add dict-values compression controls with lz4 default#6059
Conversation
|
ACTION NEEDED The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
PR Review: encoding: add dict-values compression controls with lz4 defaultClean, well-structured PR. Tests and docs are solid. Two items worth addressing: P1:
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
…mpression # Conflicts: # rust/lance-encoding/src/compression.rs # rust/lance-encoding/src/encodings/logical/primitive.rs
westonpace
left a comment
There was a problem hiding this comment.
This is a good cleanup and nice additional docs. Also, thanks for the extensive tests. I have just a few doc suggestions.
| threshold. If the number of unique values is less than the threshold then we apply dictionary encoding. The | ||
| configuration variable defines the divisor that we apply and it defaults to 2 which means we apply dictionary | ||
| encoding if we estimate that less than half the values are unique. | ||
| Dictionary encoding is attempted for supported primitive pages and gated by a few heuristics. |
There was a problem hiding this comment.
Aren't all pages primitive pages? A naive read might be that we cannot do dictionary encoding of something like List<u32> but I think we can?
|
|
||
| Dictionary encoding is effective for columns with low cardinality where the same values repeat many times. | ||
| The dictionary is stored once per page and indices are stored in place of the actual values. | ||
| - `lance-encoding:dict-divisor` (default `2`): cardinality budget starts from `num_values / divisor` |
There was a problem hiding this comment.
I don't understand this sentence. It might be worth having a paragraph explaining these two properties instead of a brief line for each. Also, maybe we should avoid jargony words like cardinality in a user-facing doc.
| This is likely to change in future versions. | ||
| There are additional global guards available as environment variables: | ||
|
|
||
| - `LANCE_ENCODING_DICT_TOO_SMALL` (minimum page size before trying dictionary encoding, default `100`) |
There was a problem hiding this comment.
Is this 100 values or 100 bytes?
| There are additional global guards available as environment variables: | ||
|
|
||
| - `LANCE_ENCODING_DICT_TOO_SMALL` (minimum page size before trying dictionary encoding, default `100`) | ||
| - `LANCE_ENCODING_DICT_DIVISOR` (fallback divisor when field metadata is not set) |
There was a problem hiding this comment.
Let's say what the default is here (we list it above)
| - `LANCE_ENCODING_DICT_TOO_SMALL` (minimum page size before trying dictionary encoding, default `100`) | ||
| - `LANCE_ENCODING_DICT_DIVISOR` (fallback divisor when field metadata is not set) | ||
| - `LANCE_ENCODING_DICT_MAX_CARDINALITY` (upper cap for dictionary entries, default `100000`) | ||
| - `LANCE_ENCODING_DICT_SIZE_RATIO` (fallback ratio when field metadata is not set) |
There was a problem hiding this comment.
Let's add the default here (we list it above)
|
I'm going to merge with all comments addressed. |
…rmat#6059) This PR adds dictionary-values-specific general compression controls and aligns docs with current dictionary encoding behavior. This PR also changed the default compression of dict values from `zstd` to `lz4` as it's a better trade off of speed and size. Users prefer smaller size can always pick change to zstd by setting the newly added metadata keys. - Added new metadata keys: - `lance-encoding:dict-values-compression` - `lance-encoding:dict-values-compression-level` - Added new env var fallbacks: - `LANCE_ENCODING_DICT_VALUES_COMPRESSION` - `LANCE_ENCODING_DICT_VALUES_COMPRESSION_LEVEL` - Set default dictionary-values general compression scheme to `lz4`. --- **Parts of this PR were drafted with assistance from Codex (with `gpt-5.3-codex`) and fully reviewed and edited by me. I take full responsibility for all changes.**
This PR adds dictionary-values-specific general compression controls and aligns docs with current dictionary encoding behavior.
This PR also changed the default compression of dict values from
zstdtolz4as it's a better trade off of speed and size. Users prefer smaller size can always pick change to zstd by setting the newly added metadata keys.lance-encoding:dict-values-compressionlance-encoding:dict-values-compression-levelLANCE_ENCODING_DICT_VALUES_COMPRESSIONLANCE_ENCODING_DICT_VALUES_COMPRESSION_LEVELlz4.Parts of this PR were drafted with assistance from Codex (with
gpt-5.3-codex) and fully reviewed and edited by me. I take full responsibility for all changes.