Support dictionary-compressed virtual chunk locations#22
Conversation
There was a problem hiding this comment.
Code Review
This pull request vendors the fzstd library to support zstd dictionary decompression for virtual chunk locations in manifests, resolving dependency limitations and data corruption issues. It updates the manifest parser to decode compressed locations using the manifest's zstd dictionary, implements safety checks to bound decompression memory allocation, and aligns payload extraction priority with the Rust implementation. Comprehensive test coverage is also added. Feedback on the changes points out a potential integer overflow in zstdFrameAllocSize due to 32-bit bitwise shift operations in JavaScript, suggesting standard arithmetic exponentiation and division as a safer alternative.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| const base = 1 << (10 + (wd >> 3)); | ||
| windowSize = base + (base >> 3) * (wd & 7); |
There was a problem hiding this comment.
In JavaScript, bitwise operators (like << and >>) operate on 32-bit signed integers. If the window descriptor wd has a large exponent (e.g., wd >> 3 >= 21), 10 + (wd >> 3) will be >= 31, causing 1 << (10 + (wd >> 3)) to overflow or wrap around to a negative or incorrect value. Similarly, base >> 3 will overflow if base is >= 2^31.
To prevent this and ensure correct handling of large window sizes (especially when validating untrusted input), use standard arithmetic exponentiation (2 ** ...) and division (/ 8) instead of bitwise shifts.
| const base = 1 << (10 + (wd >> 3)); | |
| windowSize = base + (base >> 3) * (wd & 7); | |
| const base = 2 ** (10 + (wd >> 3)); | |
| windowSize = base + (base / 8) * (wd & 7); |
Closes #3. Closes #21.
Decodes icechunk's dictionary-compressed virtual chunk locations (
Manifest.location_dictionary+ChunkRef.compressed_location, from earth-mover/icechunk#1776).Two commits, each green on its own:
Why a vendored, patched fzstd?
compressed_locationneeds zstd dictionary decompression, which releasedfzstd@0.1.1lacks. Tested against all 244 real metadata fixtures:fzstd@0.1.1zstdify(1.4.0 andmain)zstdCLI)fzstdPR#18zstdifyis disqualified (silent corruption), and dictionary support exists only in the unmerged, unreleased fzstd#18 (can't be cleanlynpm installed; upstream dormant). So we vendor its source undersrc/vendor/fzstd/(MIT, attributed — see itsREADME.mdfor provenance, re-vendor steps, and drop criteria).Notes
!dic &&on a fast path) fixes a dictionary/output-buffer aliasing bug in PR#18 that silently corrupts output when the dictionary length equals the decompressed size. Reproduced vs thezstdCLI; contributed upstream as handlerug/fzstd#1.ref_to_payload(priority, the 1024-byte bound, strict UTF-8). The bound is enforced from the frame header before decompression, so a malformed frame can't force a huge allocation.Testing
360 tests pass (
npm ci, typecheck, Prettier clean). Newmanifest-dictionary.test.tscovers end-to-end decode, missing-dictionary error, the size bound, the aliasing-fix regression, and malformed-frame rejection. Fixtures arezstd-CLI-generated (no upstream fixture uses this feature yet).