Skip to content

Support dictionary-compressed virtual chunk locations#22

Open
Shane98c wants to merge 7 commits into
mainfrom
Shane98c/vendor-fzstd-compressed-locations
Open

Support dictionary-compressed virtual chunk locations#22
Shane98c wants to merge 7 commits into
mainfrom
Shane98c/vendor-fzstd-compressed-locations

Conversation

@Shane98c

Copy link
Copy Markdown
Collaborator

Closes #3. Closes #21.

Decodes icechunk's dictionary-compressed virtual chunk locations (Manifest.location_dictionary + ChunkRef.compressed_location, from earth-mover/icechunk#1776).

Two commits, each green on its own:

  1. vendor fzstd with dictionary decompression support — behavior-preserving.
  2. decode dictionary-compressed virtual chunk locations — the feature.

Why a vendored, patched fzstd?

compressed_location needs zstd dictionary decompression, which released fzstd@0.1.1 lacks. Tested against all 244 real metadata fixtures:

Decoder Result
fzstd@0.1.1 byte-exact on all 244, but no dictionary API
zstdify (1.4.0 and main) silently corrupts 77/244 (~32%) — wrong bytes, no error (vs zstd CLI)
fzstd PR#18 correct on all 244 + decodes real dictionary frames

zstdify is disqualified (silent corruption), and dictionary support exists only in the unmerged, unreleased fzstd#18 (can't be cleanly npm installed; upstream dormant). So we vendor its source under src/vendor/fzstd/ (MIT, attributed — see its README.md for provenance, re-vendor steps, and drop criteria).

Notes

  • One deliberate patch to the vendored code (!dic && on a fast path) fixes a dictionary/output-buffer aliasing bug in PR#18 that silently corrupts output when the dictionary length equals the decompressed size. Reproduced vs the zstd CLI; contributed upstream as handlerug/fzstd#1.
  • Decode matches the Rust ref_to_payload (priority, the 1024-byte bound, strict UTF-8). The bound is enforced from the frame header before decompression, so a malformed frame can't force a huge allocation.

Testing

360 tests pass (npm ci, typecheck, Prettier clean). New manifest-dictionary.test.ts covers end-to-end decode, missing-dictionary error, the size bound, the aliasing-fix regression, and malformed-frame rejection. Fixtures are zstd-CLI-generated (no upstream fixture uses this feature yet).

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request vendors the fzstd library to support zstd dictionary decompression for virtual chunk locations in manifests, resolving dependency limitations and data corruption issues. It updates the manifest parser to decode compressed locations using the manifest's zstd dictionary, implements safety checks to bound decompression memory allocation, and aligns payload extraction priority with the Rust implementation. Comprehensive test coverage is also added. Feedback on the changes points out a potential integer overflow in zstdFrameAllocSize due to 32-bit bitwise shift operations in JavaScript, suggesting standard arithmetic exponentiation and division as a safer alternative.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +60 to +61
const base = 1 << (10 + (wd >> 3));
windowSize = base + (base >> 3) * (wd & 7);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

In JavaScript, bitwise operators (like << and >>) operate on 32-bit signed integers. If the window descriptor wd has a large exponent (e.g., wd >> 3 >= 21), 10 + (wd >> 3) will be >= 31, causing 1 << (10 + (wd >> 3)) to overflow or wrap around to a negative or incorrect value. Similarly, base >> 3 will overflow if base is >= 2^31.

To prevent this and ensure correct handling of large window sizes (especially when validating untrusted input), use standard arithmetic exponentiation (2 ** ...) and division (/ 8) instead of bitwise shifts.

Suggested change
const base = 1 << (10 + (wd >> 3));
windowSize = base + (base >> 3) * (wd & 7);
const base = 2 ** (10 + (wd >> 3));
windowSize = base + (base / 8) * (wd & 7);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

location being interpreted as null Support compressed virtual chunk locations (needs fzstd dictionary support)

1 participant