Skip to content

perf: add dict-values compression controls with lz4 default#6059

Merged
Xuanwo merged 7 commits intomainfrom
xuanwo/dict-values-compression
Mar 3, 2026
Merged

perf: add dict-values compression controls with lz4 default#6059
Xuanwo merged 7 commits intomainfrom
xuanwo/dict-values-compression

Conversation

@Xuanwo
Copy link
Copy Markdown
Collaborator

@Xuanwo Xuanwo commented Feb 28, 2026

This PR adds dictionary-values-specific general compression controls and aligns docs with current dictionary encoding behavior.

This PR also changed the default compression of dict values from zstd to lz4 as it's a better trade off of speed and size. Users prefer smaller size can always pick change to zstd by setting the newly added metadata keys.

  • Added new metadata keys:
    • lance-encoding:dict-values-compression
    • lance-encoding:dict-values-compression-level
  • Added new env var fallbacks:
    • LANCE_ENCODING_DICT_VALUES_COMPRESSION
    • LANCE_ENCODING_DICT_VALUES_COMPRESSION_LEVEL
  • Set default dictionary-values general compression scheme to lz4.

Parts of this PR were drafted with assistance from Codex (with gpt-5.3-codex) and fully reviewed and edited by me. I take full responsibility for all changes.

@github-actions
Copy link
Copy Markdown
Contributor

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@github-actions
Copy link
Copy Markdown
Contributor

PR Review: encoding: add dict-values compression controls with lz4 default

Clean, well-structured PR. Tests and docs are solid. Two items worth addressing:

P1: try_general_compression "none" early-return is a broader behavioral change

The change in compression.rs to add an early return Ok(None) when compression == "none" affects all callers of try_general_compression via create_block_compressor, not just dictionary values. Previously, setting compression=none on a field would skip user-requested compression but still allow automatic compression for large blocks (>32KiB). Now it fully disables general compression.

This is arguably the correct behavior (honoring an explicit none), but it's a semantic change beyond dict-values. The existing test_none_compression test only covers the miniblock and pervalue paths — it doesn't exercise create_block_compressor with compression=none + data exceeding MIN_BLOCK_SIZE_FOR_GENERAL_COMPRESSION. Consider adding a block-compressor test case to lock in this behavior.

Minor: dummy field type in build_dict_values_compressor_field

The constructed field uses DataType::UInt16 with name "". The get_merged_field_params in DefaultCompressionStrategy calls self.params.get_field_params(&field.name, &field.data_type()), which checks user-configured type-level params. If a user happened to configure compression params for UInt16, those would bleed into dictionary value compression. Very unlikely in practice but worth a comment noting why UInt16/"" was chosen (inherited from the previous dummy field).

@Xuanwo Xuanwo changed the title encoding: add dict-values compression controls with lz4 default perf: add dict-values compression controls with lz4 default Feb 28, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 28, 2026

Codecov Report

❌ Patch coverage is 95.45455% with 9 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
.../lance-encoding/src/encodings/logical/primitive.rs 94.85% 5 Missing and 4 partials ⚠️

📢 Thoughts on this report? Let us know!

Xuanwo added 2 commits March 3, 2026 16:46
…mpression

# Conflicts:
#	rust/lance-encoding/src/compression.rs
#	rust/lance-encoding/src/encodings/logical/primitive.rs
Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good cleanup and nice additional docs. Also, thanks for the extensive tests. I have just a few doc suggestions.

Comment thread docs/src/format/file/encoding.md Outdated
threshold. If the number of unique values is less than the threshold then we apply dictionary encoding. The
configuration variable defines the divisor that we apply and it defaults to 2 which means we apply dictionary
encoding if we estimate that less than half the values are unique.
Dictionary encoding is attempted for supported primitive pages and gated by a few heuristics.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't all pages primitive pages? A naive read might be that we cannot do dictionary encoding of something like List<u32> but I think we can?

Comment thread docs/src/format/file/encoding.md Outdated

Dictionary encoding is effective for columns with low cardinality where the same values repeat many times.
The dictionary is stored once per page and indices are stored in place of the actual values.
- `lance-encoding:dict-divisor` (default `2`): cardinality budget starts from `num_values / divisor`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this sentence. It might be worth having a paragraph explaining these two properties instead of a brief line for each. Also, maybe we should avoid jargony words like cardinality in a user-facing doc.

Comment thread docs/src/format/file/encoding.md Outdated
This is likely to change in future versions.
There are additional global guards available as environment variables:

- `LANCE_ENCODING_DICT_TOO_SMALL` (minimum page size before trying dictionary encoding, default `100`)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this 100 values or 100 bytes?

Comment thread docs/src/format/file/encoding.md Outdated
There are additional global guards available as environment variables:

- `LANCE_ENCODING_DICT_TOO_SMALL` (minimum page size before trying dictionary encoding, default `100`)
- `LANCE_ENCODING_DICT_DIVISOR` (fallback divisor when field metadata is not set)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's say what the default is here (we list it above)

Comment thread docs/src/format/file/encoding.md Outdated
- `LANCE_ENCODING_DICT_TOO_SMALL` (minimum page size before trying dictionary encoding, default `100`)
- `LANCE_ENCODING_DICT_DIVISOR` (fallback divisor when field metadata is not set)
- `LANCE_ENCODING_DICT_MAX_CARDINALITY` (upper cap for dictionary entries, default `100000`)
- `LANCE_ENCODING_DICT_SIZE_RATIO` (fallback ratio when field metadata is not set)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add the default here (we list it above)

@Xuanwo
Copy link
Copy Markdown
Collaborator Author

Xuanwo commented Mar 3, 2026

I'm going to merge with all comments addressed.

@Xuanwo Xuanwo merged commit 8d936af into main Mar 3, 2026
29 checks passed
@Xuanwo Xuanwo deleted the xuanwo/dict-values-compression branch March 3, 2026 16:57
wjones127 pushed a commit to wjones127/lance that referenced this pull request Mar 4, 2026
…rmat#6059)

This PR adds dictionary-values-specific general compression controls and
aligns docs with current dictionary encoding behavior.

This PR also changed the default compression of dict values from `zstd`
to `lz4` as it's a better trade off of speed and size. Users prefer
smaller size can always pick change to zstd by setting the newly added
metadata keys.

- Added new metadata keys:
  - `lance-encoding:dict-values-compression`
  - `lance-encoding:dict-values-compression-level`
- Added new env var fallbacks:
  - `LANCE_ENCODING_DICT_VALUES_COMPRESSION`
  - `LANCE_ENCODING_DICT_VALUES_COMPRESSION_LEVEL`
- Set default dictionary-values general compression scheme to `lz4`.

---

**Parts of this PR were drafted with assistance from Codex (with
`gpt-5.3-codex`) and fully reviewed and edited by me. I take full
responsibility for all changes.**
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants