Skip to content

feat: unified chunk grid with rectilinear chunk/shard support#3802

Open
maxrjones wants to merge 114 commits intozarr-developers:mainfrom
maxrjones:poc/unified-chunk-grid
Open

feat: unified chunk grid with rectilinear chunk/shard support#3802
maxrjones wants to merge 114 commits intozarr-developers:mainfrom
maxrjones:poc/unified-chunk-grid

Conversation

@maxrjones
Copy link
Copy Markdown
Member

@maxrjones maxrjones commented Mar 21, 2026

Summary

This PR contains an alternative implementation of the rectilinear chunk grid extension, building on the work in #3534 (RLE helpers, validation logic, and test cases were directly adopted). While the core feature of variable-sized chunks is the same, the internal architecture differs in ways that impact extensibility, performance, and release safety.

I appreciate the patience of those who contributed to #3534, and everyone who's been waiting on this feature. I know it's frustrating to see a new PR after #3534 was so close. That PR provided fundamental components, and I hope people will see the value here. I really believe it is worth the churn for the following reasons:

Key differences from #3534

  1. Extensibility. Each dimension is represented by a type implementing the DimensionGrid protocol (FixedDimension, VaryingDimension). Adding a new dimension type (e.g. TiledDimension for periodic patterns like days-per-month) requires implementing that protocol — no changes to indexing, codecs, or the ChunkGrid class. A prototype was built to verify this.
  2. Performance. The indexing pipeline queries each dimension independently with scalar calls rather than constructing N-d coordinate tuples per chunk lookup. This avoids allocation overhead in the inner loop of every indexer. VaryingDimension uses precomputed prefix sums for O(log n) lookups via binary search. See https://github.com/maxrjones/zarr-chunk-grid-tests for a performance comparison.
  3. Feature flag. Rectilinear chunk grids are gated behind zarr.config.set({'array.rectilinear_chunks': True}) (or ZARR_ARRAY__RECTILINEAR_CHUNKS=True), disabled by default. This gives downstream libraries time to adapt before the API is finalized, and us an opportunity to gracefully finalize the API.
  4. Rectilinear sharding. Shard boundaries can be rectilinear while inner chunks remain regular, with validation that each shard edge is divisible by the inner chunk size. This is tested end-to-end and documented in the user guide.

Design document: docs/design/chunk-grid.md covers the full design, rationale, and a suggested PR sequence for splitting this into reviewable increments, if needed.

Downstream POCs (all passing):

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/user-guide/*.md
  • Changes documented as a new file in changes/
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@github-actions github-actions bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Mar 29, 2026
@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Mar 29, 2026
@github-actions github-actions bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Mar 30, 2026
@maxrjones
Copy link
Copy Markdown
Member Author

I added a real-world example based on the data that @tinaok shared in #3534 at https://zarr--3802.org.readthedocs.build/en/3802/user-guide/examples/rectilinear_chunks/. I could adjust the source branches for the juv-compliant notebook before we merge.

@d-v-b would you have any time for a review on this PR?

return _shards_out, _chunks_out


class _RegularChunkGridMeta(type):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's this for? I'm guessing it would not be necessary if we had a clear abstraction boundary between the in-memory ChunkGrid and the metadata chunk grids.

maxrjones and others added 2 commits March 30, 2026 09:47
…metadata (#7)

* chore: simplify sharding codec validation against varying chunk grid metadata

* test: restore test strength
| Fully variable | Arbitrary per-chunk sizes | `[5, 12, 3, 20]` |

A registry-based plugin system adds complexity without clear benefit.
Prior iterations on the chunk grid design were based on the Zarr V3 spec's definition of chunk grids as an extension point alongside codecs, dtypes, etc. Therefore, we started designing the chunk grid implementation following a similar registry based approach. However, in practice chunk grids are fundamentally different than codecs. Codecs are independent; supporting `zstd` tells you nothing about `gzip`. Chunk grids are not: every regular grid is a valid rectilinear grid. A registry-based plugin system makes sense for codecs but adds complexity without clear benefit for chunk grids. Here we start from some basic goals and propose a more fitting design for supporting different chunk grids in zarr-python.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏽

Nit: registry based --> registry-based

…chunk grid (#8)

* refactor: allow regular-style chunk grid declaration for rectilinear chunk grid

The rectilinear chunk grid spec allows bare integers per dimension (meaning
"regular step size"), distinct from explicit single-element edge lists. This
commit widens `RectilinearChunkGrid.chunk_shapes` to `tuple[int | tuple[int, ...], ...]`
so bare ints are preserved for faithful JSON round-tripping.

Additionally:
- unifies `_validate_chunk_shapes` to handle both regular and rectilinear validation;
  `_parse_chunk_shape` now delegates to it
- adds `from_sizes` method to `ChunkGrid`, accepting `int | Sequence[int]` per dimension
- removes `from_regular` and `from_rectilinear` methods from `ChunkGrid`
- removes `parse_chunk_grid` from `chunk_grids.py` (JSON → ChunkGrid shortcut that
  bypassed the metadata layer)
- removes `serialize_chunk_grid`, `_infer_chunk_grid_name`, and serialization helpers
  from `chunk_grids.py` (ChunkGrid never needs to be serialized; metadata DTOs handle it)
- renames `parse_chunk_grid` in `v3.py` to `parse_chunk_grid_metadata` to disambiguate
- moves the rectilinear feature flag to `RectilinearChunkGrid.__post_init__`
- simplifies sharding codec validation into a single divisibility check for both
  regular and rectilinear grids
- updates `validate_rectilinear_edges` to skip bare-int dimensions
- refactors chunk grid tests to functional style with parametrization
- adds docstrings to all test functions

* chore: remove .claude

* refactor: rename chunk_grid parsing function

---------

Co-authored-by: Max Jones <14077947+maxrjones@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants