Skip to content

gh-150583: Fix zstd compression with digested ZstdDict#150586

Open
Rogdham wants to merge 1 commit into
python:mainfrom
Rogdham:fix-zst-non-deterministic-compression-with-digested-zstddict
Open

gh-150583: Fix zstd compression with digested ZstdDict#150586
Rogdham wants to merge 1 commit into
python:mainfrom
Rogdham:fix-zst-non-deterministic-compression-with-digested-zstddict

Conversation

@Rogdham
Copy link
Copy Markdown
Contributor

@Rogdham Rogdham commented May 29, 2026

Changes

Force setting a compression_level of ZSTD_CLEVEL_DEFAULT when creating the ZstdCompressor object.

Question for reviewers: should I use _zstd_set_c_level(ZSTD_CLEVEL_DEFAULT) instead?

See reproducer in #150583.

Analysis

When creating a ZstdCompressor object (_zstd_ZstdCompressor_new_impl), the compression_level field is unset, and as a result keeps its value from whatever was in memory (not deterministic), assuming no level is passed in the constructor.

Then, when set, the C dict is loaded into the compression context with _zstd_load_c_dict, which in turn calls _zstd_load_impl.

In the case of a digested dictionary, _get_CDict(zd, self->compression_level) is now called, which in turns calls ZSTD_createCDict(..., compressionLevel).

So an arbitrary compression level is passed to that call to libzstd.

Bug impact

Prerequisites:

  • Compress data with compression.zstd
  • Do not specify a compression level (neither with level nor through options)
  • Pass a digested dictionary with zstd_dict (e.g. with ZstdDict(...).as_digested_dict)

Consequence: the compression level used is arbitrary, and may change from one run to another.

Misc

Backport suggestion: to 3.15 and 3.14.

AI disclosure: analysis was performed with help of an LLM.

@emmatyping 👋

Copy link
Copy Markdown
Member

@sobolevn sobolevn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, add a NEWS entry for the fixed bug :)
Also, add a unit test.

@Rogdham
Copy link
Copy Markdown
Contributor Author

Rogdham commented Jun 1, 2026

Noted for the news entry, I will add it.

For the unit test, I am unsure about what to add. Because the bug was using whatever was left in memory when the object was initialized, testing this bug means making a non-deterministic test (it could work on not from one run to the other). The reproducer from the ticket worked around that by calling the code many times in a loop, but this is not ideal either for tests because it increases the time to run tests (plus it is still non-deterministic, although with decreased probability), and it is unclear how many times it should loop.

And then the question of why we would test this case specifically and not any other possibility of arguments during init (or other methods) appears. Should that be a job of a static code analyzer instead? But maybe this is my lack of experience in C projects that is showing.

Anyways, I am unsure if/how such cases are tested in cpython codebase, and what should I do for tests here. I would appreciate if you could point me in the right direction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants