diff --git a/changes/3802.feature.md b/changes/3802.feature.md new file mode 100644 index 0000000000..8199b5b718 --- /dev/null +++ b/changes/3802.feature.md @@ -0,0 +1,11 @@ +Add support for rectilinear (variable-sized) chunk grids. This feature is experimental and +must be explicitly enabled via ``zarr.config.set({'array.rectilinear_chunks': True})``. + +Rectilinear chunks can be used through: + +- **Creating arrays**: Pass nested sequences (e.g., ``[[10, 20, 30], [50, 50]]``) to ``chunks`` + in ``zarr.create_array``, ``zarr.from_array``, ``zarr.zeros``, ``zarr.ones``, ``zarr.full``, + ``zarr.open``, and related functions, or to ``chunk_shape`` in ``zarr.create``. +- **Opening existing arrays**: Arrays stored with the ``rectilinear`` chunk grid are read + transparently via ``zarr.open`` and ``zarr.open_array``. +- **Rectilinear sharding**: Shard boundaries can be rectilinear while inner chunks remain regular. diff --git a/design/chunk-grid.md b/design/chunk-grid.md new file mode 100644 index 0000000000..ac16a4b264 --- /dev/null +++ b/design/chunk-grid.md @@ -0,0 +1,619 @@ +# Unified Chunk Grid + +Version: 6 + +Design document for adding rectilinear (variable) chunk grid support to **zarr-python**, conforming to the [rectilinear chunk grid extension spec](https://github.com/zarr-developers/zarr-extensions/pull/25). + +**Related:** + +- [#3750](https://github.com/zarr-developers/zarr-python/issues/3750) (single ChunkGrid proposal) +- [#3534](https://github.com/zarr-developers/zarr-python/pull/3534) (rectilinear implementation) +- [#3735](https://github.com/zarr-developers/zarr-python/pull/3735) (chunk grid module/registry) +- [ZEP0003](https://github.com/zarr-developers/zeps/blob/main/draft/ZEP0003.md) (variable chunking spec) +- [zarr-specs#370](https://github.com/zarr-developers/zarr-specs/pull/370) (sharding v1.1: non-divisible subchunks) +- [zarr-extensions#25](https://github.com/zarr-developers/zarr-extensions/pull/25) (rectilinear extension) +- [zarr-extensions#34](https://github.com/zarr-developers/zarr-extensions/issues/34) (sharding + rectilinear) + +## Problem + +Chunk grids form a hierarchy — the rectilinear grid is strictly more general than the regular grid. Any regular grid is expressible as a rectilinear grid. There is no known chunk grid that is both (a) more general than rectilinear and (b) retains the axis-aligned tessellation properties Zarr assumes. All known grids are special cases: + +| Grid type | Description | Example | +|---|---|---| +| Regular | Uniform chunk size, boundary chunks padded with fill_value | `[10, 10, 10, 10]` | +| Regular-bounded (zarrs) | Uniform chunk size, boundary chunks trimmed to array extent | `[10, 10, 10, 5]` | +| HPC boundary-padded | Regular interior, larger boundary chunks ([VirtualiZarr#217](https://github.com/zarr-developers/VirtualiZarr/issues/217)) | `[10, 8, 8, 8, 10]` | +| Fully variable | Arbitrary per-chunk sizes | `[5, 12, 3, 20]` | + +Prior iterations on the chunk grid design were based on the Zarr V3 spec's definition of chunk grids as an extension point alongside codecs, dtypes, etc. Therefore, we started designing the chunk grid implementation following a similar registry-based approach. However, in practice chunk grids are fundamentally different than codecs. Codecs are independent; supporting `zstd` tells you nothing about `gzip`. Chunk grids are not: every regular grid is a valid rectilinear grid. A registry-based plugin system makes sense for codecs but adds complexity without clear benefit for chunk grids. Here we start from some basic goals and propose a more fitting design for supporting different chunk grids in zarr-python. + +## Goals + +1. **Follow the zarr extension proposal.** The implementation should conform to the [rectilinear chunk grid spec](https://github.com/zarr-developers/zarr-extensions/pull/25), not innovate on the metadata format. +2. **Minimize changes to the public API.** Users creating regular arrays should see no difference. Rectilinear is additive. +3. **Maintain backwards compatibility.** Existing code using `RegularChunkGrid`, `.chunks`, or `isinstance` checks should continue to work (with deprecation warnings where appropriate). +4. **Design for future iteration.** The internal architecture should allow refactoring (e.g., metadata/array separation, new dimension types) without breaking the public API. +5. **Minimize downstream changes.** xarray, VirtualiZarr, Icechunk, Cubed, etc. should need minimal updates. +6. **Minimize time to stable release.** Ship behind a feature flag, stabilize through real-world usage, promote to stable API. +7. **The new API should be useful.** `read_chunk_sizes`/`write_chunk_sizes`, `ChunkGrid.__getitem__`, `is_regular` — these should solve real problems, not just expose internals. +8. **Extensible for other serialization structures.** The per-dimension design should support future encodings (tile, temporal) without changes to indexing or codecs. + +## Design + +### Design choices + +1. **A chunk grid is a concrete arrangement of chunks.** Not an abstract tiling pattern. This means that the chunk grid is bound to specific array dimensions, which enables the chunk grid to answer any question about any chunk (offset, size, count) without external parameters. +2. **One implementation, multiple serialization forms.** A single `ChunkGrid` class handles all chunking logic. The serialization format (`"regular"` vs `"rectilinear"`) is chosen by the metadata layer, not the grid. +3. **No chunk grid registry.** Simple name-based dispatch in the metadata layer's `parse_chunk_grid()`. +4. **Fixed vs Varying per dimension.** `FixedDimension(size, extent)` for uniform chunks; `VaryingDimension(edges, extent)` for per-chunk edge lengths with precomputed prefix sums. Avoids expanding regular dimensions into lists of identical values. +5. **Transparent transitions.** Operations like `resize()` can move an array from regular to rectilinear chunking. + +### Internal representation + +```python +@dataclass(frozen=True) +class FixedDimension: + """Uniform chunk size. Boundary chunks contain less data but are + encoded at full size by the codec pipeline.""" + size: int # chunk edge length (>= 0) + extent: int # array dimension length + + def __post_init__(self) -> None: + # validates size >= 0 and extent >= 0 + + @property + def nchunks(self) -> int: + if self.size == 0: + return 0 + return ceildiv(self.extent, self.size) + + def index_to_chunk(self, idx: int) -> int: + return idx // self.size # raises IndexError if OOB + def chunk_offset(self, chunk_ix: int) -> int: + return chunk_ix * self.size # raises IndexError if OOB + def chunk_size(self, chunk_ix: int) -> int: + return self.size # always uniform; raises IndexError if OOB + def data_size(self, chunk_ix: int) -> int: + return max(0, min(self.size, self.extent - chunk_ix * self.size)) # raises IndexError if OOB + @property + def unique_edge_lengths(self) -> Iterable[int]: + return (self.size,) # O(1) + def indices_to_chunks(self, indices: NDArray) -> NDArray: + return indices // self.size + def with_extent(self, new_extent: int) -> FixedDimension: + return FixedDimension(size=self.size, extent=new_extent) + def resize(self, new_extent: int) -> FixedDimension: + return FixedDimension(size=self.size, extent=new_extent) + +@dataclass(frozen=True) +class VaryingDimension: + """Explicit per-chunk sizes. The last chunk may extend past the array + extent, in which case data_size clips to the valid region while + chunk_size returns the full edge length for codec processing.""" + edges: tuple[int, ...] # per-chunk edge lengths (all > 0) + cumulative: tuple[int, ...] # prefix sums for O(log n) lookup + extent: int # array dimension length (may be < sum(edges)) + + def __init__(self, edges: Sequence[int], extent: int) -> None: + # validates edges non-empty, all > 0, extent >= 0, extent <= sum(edges) + # computes cumulative via itertools.accumulate + # uses object.__setattr__ for frozen dataclass + + @property + def nchunks(self) -> int: + # number of chunks that overlap [0, extent) + if extent == 0: + return 0 + return bisect.bisect_left(self.cumulative, extent) + 1 + + @property + def ngridcells(self) -> int: + return len(self.edges) + + def index_to_chunk(self, idx: int) -> int: + return bisect.bisect_right(self.cumulative, idx) # raises IndexError if OOB + def chunk_offset(self, chunk_ix: int) -> int: + return self.cumulative[chunk_ix - 1] if chunk_ix > 0 else 0 # raises IndexError if OOB + def chunk_size(self, chunk_ix: int) -> int: + return self.edges[chunk_ix] # raises IndexError if OOB + def data_size(self, chunk_ix: int) -> int: + offset = self.chunk_offset(chunk_ix) + return max(0, min(self.edges[chunk_ix], self.extent - offset)) # raises IndexError if OOB + @property + def unique_edge_lengths(self) -> Iterable[int]: + # lazy generator: yields unseen values, short-circuits deduplication + def indices_to_chunks(self, indices: NDArray) -> NDArray: + return np.searchsorted(self.cumulative, indices, side='right') + def with_extent(self, new_extent: int) -> VaryingDimension: + # validates cumulative[-1] >= new_extent (O(1)), re-binds extent + return VaryingDimension(self.edges, extent=new_extent) + def resize(self, new_extent: int) -> VaryingDimension: + # grow past edge sum: append chunk of size (new_extent - sum(edges)) + # shrink or grow within edge sum: preserve all edges, re-bind extent +``` + +Both types implement the `DimensionGrid` protocol: `nchunks`, `extent`, `index_to_chunk`, `chunk_offset`, `chunk_size`, `data_size`, `indices_to_chunks`, `unique_edge_lengths`, `with_extent`, `resize`. Memory usage scales with the number of *varying* dimensions, not total chunks. + +All per-chunk methods (`chunk_offset`, `chunk_size`, `data_size`) raise `IndexError` for out-of-bounds chunk indices, providing consistent fail-fast behavior across both dimension types. + +The two size methods serve different consumers: + +| Method | Returns | Consumer | +|---|---|---| +| `chunk_size` | Buffer size for codec processing | Codec pipeline (`ArraySpec.shape`) | +| `data_size` | Valid data region within the buffer | Indexing pipeline (`chunk_selection` slicing) | + +For `FixedDimension`, these differ only at the boundary. For `VaryingDimension`, these differ only when the last chunk extends past the extent (i.e., `extent < sum(edges)`). This matches current zarr-python behavior: `get_chunk_spec` passes the full `chunk_shape` to the codec for all chunks, and the indexer generates a `chunk_selection` that clips the decoded buffer. + +### DimensionGrid Protocol + +```python +@runtime_checkable +class DimensionGrid(Protocol): + """Structural interface shared by FixedDimension and VaryingDimension.""" + + @property + def nchunks(self) -> int: ... + @property + def ngridcells(self) -> int: ... + @property + def extent(self) -> int: ... + def index_to_chunk(self, idx: int) -> int: ... + def chunk_offset(self, chunk_ix: int) -> int: ... # raises IndexError if OOB + def chunk_size(self, chunk_ix: int) -> int: ... # raises IndexError if OOB + def data_size(self, chunk_ix: int) -> int: ... # raises IndexError if OOB + def indices_to_chunks(self, indices: NDArray[np.intp]) -> NDArray[np.intp]: ... + @property + def unique_edge_lengths(self) -> Iterable[int]: ... + def with_extent(self, new_extent: int) -> DimensionGrid: ... + def resize(self, new_extent: int) -> DimensionGrid: ... +``` + +The protocol is `@runtime_checkable`, enabling polymorphic handling of both dimension types without `isinstance` checks. + +`nchunks` and `ngridcells` differ when `extent < sum(edges)`: `nchunks` counts only chunks that overlap `[0, extent)`, while `ngridcells` counts total defined grid cells (i.e., `len(edges)`). For `FixedDimension`, both are equal. For `VaryingDimension`, they differ after a resize that shrinks the extent below the edge sum. + +### ChunkSpec + +```python +@dataclass(frozen=True) +class ChunkSpec: + slices: tuple[slice, ...] # valid data region in array coordinates + codec_shape: tuple[int, ...] # buffer shape for codec processing + + @property + def shape(self) -> tuple[int, ...]: + return tuple(s.stop - s.start for s in self.slices) + + @property + def is_boundary(self) -> bool: + return self.shape != self.codec_shape +``` + +For interior chunks, `shape == codec_shape`. For boundary chunks of a regular grid, `codec_shape` is the full declared chunk size while `shape` is clipped. For rectilinear grids, `shape == codec_shape` unless the last chunk extends past the extent. + +### API + +```python +# Creating arrays +arr = zarr.create_array(shape=(100, 200), chunks=(10, 20)) # regular +arr = zarr.create_array(shape=(60, 100), chunks=[[10, 20, 30], [25, 25, 25, 25]]) # rectilinear + +# ChunkGrid as a collection +grid = arr._chunk_grid # behavioral ChunkGrid (bound to array shape) +grid.grid_shape # (10, 10) — number of chunks per dimension +grid.ndim # 2 +grid.is_regular # True if all dimensions are Fixed + +spec = grid[0, 1] # ChunkSpec for chunk at grid position (0, 1) +spec.slices # (slice(0, 10), slice(20, 40)) +spec.shape # (10, 20) — data shape +spec.codec_shape # (10, 20) — same for interior chunks + +boundary = grid[9, 0] # boundary chunk (extent=100, size=10) +boundary.shape # (10, 20) — data shape +boundary.codec_shape # (10, 20) — codec sees full buffer + +grid[99, 99] # None — out of bounds + +for spec in grid: # iterate all chunks + ... + +# .chunks property: retained for regular grids, raises NotImplementedError for rectilinear +arr.chunks # (10, 20) + +# .read_chunk_sizes / .write_chunk_sizes: works for all grids (dask-style) +arr.write_chunk_sizes # ((10, 10, ..., 10), (20, 20, ..., 20)) +``` + +`ChunkGrid.__getitem__` constructs `ChunkSpec` using `chunk_size` for `codec_shape` and `data_size` for `slices`: + +```python +def __getitem__(self, coords: int | tuple[int, ...]) -> ChunkSpec | None: + if isinstance(coords, int): + coords = (coords,) + slices = [] + codec_shape = [] + for dim, ix in zip(self.dimensions, coords): + if ix < 0 or ix >= dim.nchunks: + return None + offset = dim.chunk_offset(ix) + slices.append(slice(offset, offset + dim.data_size(ix))) + codec_shape.append(dim.chunk_size(ix)) + return ChunkSpec(tuple(slices), tuple(codec_shape)) +``` + +#### Construction + +Both `from_regular` and `from_rectilinear` require `array_shape`, binding the extent per dimension at construction time. This is a core design choice: a chunk grid is a concrete arrangement for a specific array, not an abstract tiling pattern. + +```python +# Regular grid — all FixedDimension +grid = ChunkGrid.from_regular(array_shape=(100, 200), chunk_shape=(10, 20)) + +# Rectilinear grid — extent = sum(edges) when shape matches +grid = ChunkGrid.from_rectilinear([[10, 20, 30], [25, 25, 25, 25]], array_shape=(60, 100)) + +# Rectilinear grid with boundary clipping — last chunk extends past array extent +# e.g., shape=(55, 90) but edges sum to (60, 100): data_size clips at extent +grid = ChunkGrid.from_rectilinear([[10, 20, 30], [25, 25, 25, 25]], array_shape=(55, 90)) + +# Direct construction +grid = ChunkGrid(dimensions=(FixedDimension(10, 100), VaryingDimension([10, 20, 30], 55))) +``` + +When `extent < sum(edges)`, the dimension is always stored as `VaryingDimension` (even if all edges are identical) to preserve the explicit edge count. The last chunk's `chunk_size` returns the full declared edge (codec buffer) while `data_size` clips to the extent. This mirrors how `FixedDimension` handles boundary chunks in regular grids. + +#### Serialization + +```python +# Regular grid: +{"name": "regular", "configuration": {"chunk_shape": [10, 20]}} + +# Rectilinear grid (with RLE compression and "kind" field): +{"name": "rectilinear", "configuration": {"kind": "inline", "chunk_shapes": [[10, 20, 30], [[25, 4]]]}} +``` + +Both names deserialize to the same `ChunkGrid` class. The serialized form does not include the array extent — that comes from `shape` in array metadata and is combined with the chunk grid when constructing a behavioral `ChunkGrid` via `ChunkGrid.from_metadata()`. + +**The `ChunkGrid` does not serialize itself.** The format choice (`"regular"` vs `"rectilinear"`) belongs to `ArrayV3Metadata`. Serialization and deserialization are handled by the metadata-layer chunk grid classes (`RegularChunkGrid` and `RectilinearChunkGrid` in `metadata/v3.py`), which provide `to_dict()` and `from_dict()` methods. + +For `create_array`, the format is inferred from the `chunks` argument: a flat tuple produces `"regular"`, a nested list produces `"rectilinear"`. The `_is_rectilinear_chunks()` helper detects nested sequences like `[[10, 20], [5, 5]]`. + +##### Rectilinear spec compliance + +The rectilinear format requires `"kind": "inline"` (validated by `validate_rectilinear_kind()`). Per the spec, each element of `chunk_shapes` can be: + +- A bare integer `m`: repeated until `sum >= array_extent` +- A list of bare integers: explicit per-chunk sizes +- A mixed array of bare integers and `[value, count]` RLE pairs + +RLE compression is used when serializing: runs of identical sizes become `[value, count]` pairs, singletons stay as bare integers. + +```python +# compress_rle([10, 10, 10, 5]) -> [[10, 3], 5] +# expand_rle([[10, 3], 5]) -> [10, 10, 10, 5] +``` + +For a single-element `chunk_shapes` tuple like `(10,)`, `RectilinearChunkGrid.to_dict()` serializes it as a bare integer `10`. Per the rectilinear spec, a bare integer is repeated until the sum >= extent, preserving the full codec buffer size for boundary chunks. + +**Zero-extent handling:** Regular grids serialize zero-extent dimensions without issue (the format encodes only `chunk_shape`, no edges). Rectilinear grids cannot represent zero-extent dimensions because the spec requires at least one positive-integer edge length per axis. + +#### read_chunk_sizes / write_chunk_sizes + +The `read_chunk_sizes` and `write_chunk_sizes` properties provide universal access to per-dimension chunk data sizes, matching the dask `Array.chunks` convention. They work for both regular and rectilinear grids: + +- `write_chunk_sizes`: always returns outer (storage) chunk sizes +- `read_chunk_sizes`: returns inner chunk sizes when sharding is used, otherwise same as `write_chunk_sizes` + +```python +>>> arr = zarr.create_array(store, shape=(100, 80), chunks=(30, 40)) +>>> arr.write_chunk_sizes +((30, 30, 30, 10), (40, 40)) + +>>> arr = zarr.create_array(store, shape=(60, 100), chunks=[[10, 20, 30], [50, 50]]) +>>> arr.write_chunk_sizes +((10, 20, 30), (50, 50)) +``` + +The underlying `ChunkGrid.chunk_sizes` property (on the grid, not the array) returns the same as `write_chunk_sizes`. + +#### Resize + +```python +arr.resize((80, 100)) # re-binds extent; FixedDimension stays fixed +arr.resize((200, 100)) # VaryingDimension grows by appending a new chunk +arr.resize((30, 100)) # VaryingDimension shrinks: preserves all edges, re-binds extent +``` + +Resize uses `ChunkGrid.update_shape(new_shape)`, which delegates to each dimension's `.resize()` method: +- `FixedDimension.resize()`: simply re-binds the extent (identical to `with_extent`) +- `VaryingDimension.resize()`: grow past `sum(edges)` appends a chunk covering the gap; shrink or grow within `sum(edges)` preserves all edges and re-binds the extent (the spec allows trailing edges beyond the array extent) + +**Known limitation (deferred):** When growing a `VaryingDimension`, the current implementation always appends a single chunk covering the new region. For example, `[10, 10, 10]` resized from 30 to 45 produces `[10, 10, 10, 15]` instead of the more natural `[10, 10, 10, 10, 10]`. A future improvement should add an optional `chunks` parameter to `resize()` that controls how the new region is partitioned, with a sane default (e.g., repeating the last chunk size). This is safely deferrable because: +- `FixedDimension` already handles resize correctly (regular grids stay regular) +- The single-chunk default produces valid state, just suboptimal chunk layout +- Rectilinear arrays are behind an experimental feature flag +- Adding an optional parameter is backwards-compatible + +Open design questions for the `chunks` parameter: +- Does it describe the new region only, or the entire post-resize array? +- Must the overlapping portion agree with existing chunks (no rechunking)? +- What is the type? Same as `chunks` in `create_array`? + +#### from_array + +The `from_array()` function handles both regular and rectilinear source arrays: + +```python +src = zarr.create_array(store, shape=(60, 100), chunks=[[10, 20, 30], [50, 50]]) +new = zarr.from_array(data=src, store=new_store, chunks="keep") +# Preserves rectilinear structure: new.write_chunk_sizes == ((10, 20, 30), (50, 50)) +``` + +When `chunks="keep"`, the logic checks `data._chunk_grid.is_regular`: +- Regular: extracts `data.chunks` (flat tuple) and preserves shards +- Rectilinear: extracts `data.write_chunk_sizes` (nested tuples) and forces shards to None + +### Indexing + +The indexing pipeline is coupled to regular grid assumptions — every per-dimension indexer takes a scalar `dim_chunk_len: int` and uses `//` and `*`: + +```python +dim_chunk_ix = self.dim_sel // self.dim_chunk_len # IntDimIndexer +dim_offset = dim_chunk_ix * self.dim_chunk_len # SliceDimIndexer +``` + +Replace `dim_chunk_len: int` with the dimension object (`FixedDimension | VaryingDimension`). The shared interface means the indexer code structure stays the same — `dim_sel // dim_chunk_len` becomes `dim_grid.index_to_chunk(dim_sel)`. O(1) for regular, binary search for varying. + +### Codec pipeline + +Today, `get_chunk_spec()` returns the same `ArraySpec(shape=chunk_grid.chunk_shape)` for every chunk. For rectilinear grids, each chunk has a different codec shape: + +```python +def get_chunk_spec(self, chunk_coords, array_config, prototype) -> ArraySpec: + spec = self._chunk_grid[chunk_coords] + return ArraySpec(shape=spec.codec_shape, ...) +``` + +Note `spec.codec_shape`, not `spec.shape`. For regular grids, `codec_shape` is uniform (preserving current behavior). The boundary clipping flow is unchanged: + +``` +Write: user data → pad to codec_shape with fill_value → encode → store +Read: store → decode to codec_shape → slice via chunk_selection → user data +``` + +### Sharding + +The `ShardingCodec` constructs a `ChunkGrid` per shard using the shard shape as extent and the subchunk shape as `FixedDimension`. Each shard is self-contained — it doesn't need to know whether the outer grid is regular or rectilinear. Validation checks that every unique edge length per dimension is divisible by the inner chunk size, using `dim.unique_edge_lengths` for efficient polymorphic iteration (O(1) for fixed dimensions, lazy-deduplicated for varying). + +``` +Level 1 — Outer chunk grid (shard boundaries): regular or rectilinear +Level 2 — Inner subchunk grid (within each shard): always regular +Level 3 — Shard index: ceil(shard_dim / subchunk_dim) entries per dimension +``` + +[zarr-specs#370](https://github.com/zarr-developers/zarr-specs/pull/370) lifts the requirement that subchunk shapes evenly divide the shard shape. With the proposed `ChunkGrid`, this just means removing the `shard_shape % subchunk_shape == 0` validation — `FixedDimension` already handles boundary clipping via `data_size`. + +| Outer grid | Subchunk divisibility | Required change | +|---|---|---| +| Regular | Evenly divides (v1.0) | None | +| Regular | Non-divisible (v1.1) | Remove divisibility validation | +| Rectilinear | Evenly divides | Remove "sharding incompatible" guard | +| Rectilinear | Non-divisible | Both changes | + +### What this replaces + +| Current | Proposed | +|---|---| +| `ChunkGrid` ABC + `RegularChunkGrid` subclass | Single concrete `ChunkGrid` with `is_regular` | +| `RectilinearChunkGrid` (#3534) | Same `ChunkGrid` class | +| Chunk grid registry + entrypoints (#3735) | Direct name dispatch | +| `arr.chunks` | Retained for regular; `arr.read_chunk_sizes`/`arr.write_chunk_sizes` for general use | +| `get_chunk_shape(shape, coord)` | `grid[coord].codec_shape` or `grid[coord].shape` | + +## Design decisions + +### Why store the extent in ChunkGrid? + +The chunk grid is a concrete arrangement, not an abstract tiling pattern. A finite collection naturally has an extent. Storing it enables `__getitem__`, eliminates `dim_len` parameters from every method, and makes the grid self-describing. + +This does *not* mean `ArrayV3Metadata.shape` should delegate to the grid. The array shape remains an independent field in metadata. The extent is passed into the grid at construction time so it can answer boundary questions without external parameters. It is **not** serialized as part of the chunk grid JSON — it comes from the `shape` field in array metadata and is combined with the chunk grid configuration in `ChunkGrid.from_metadata()`. + +### Why distinguish chunk_size from data_size? + +A chunk in a regular grid has two sizes. `chunk_size` is the buffer size the codec processes — always `size` for `FixedDimension`, even at the boundary (padded with `fill_value`). `data_size` is the valid data region — clipped to `extent % size` at the boundary. The indexing layer uses `data_size` to generate `chunk_selection` slices. + +This matches current zarr-python behavior and matters for: +1. **Backward compatibility.** Existing stores have boundary chunks encoded at full `chunk_shape`. +2. **Codec simplicity.** Codecs assume uniform input shapes for regular grids. +3. **Shard index correctness.** The index assumes `subchunk_dim`-sized entries. + +For `VaryingDimension`, `chunk_size == data_size` when `extent == sum(edges)`. When `extent < sum(edges)` (e.g., after a resize that keeps the last chunk oversized), `data_size` clips the last chunk. This is the fundamental difference: `FixedDimension` has a declared size plus an extent that clips data; `VaryingDimension` has explicit sizes that normally *are* the extent but can also extend past it. + +### Why not a chunk grid registry? + +There is no known chunk grid outside the rectilinear family that retains the tessellation properties zarr-python assumes. A `match` on the grid name is sufficient. + +### Why a single class instead of RegularChunkGrid + RectilinearChunkGrid? + +[Discussed in #3534.](https://github.com/zarr-developers/zarr-python/pull/3534) @d-v-b argued that `RegularChunkGrid` is unnecessary since rectilinear is more general; @dcherian argued that downstream libraries need a fast way to detect regular grids without inspecting potentially millions of chunk edges (see [xarray#9808](https://github.com/pydata/xarray/pull/9808)). + +The resolution: a single `ChunkGrid` class with an `is_regular` property (O(1), cached at construction). This gives downstream code the fast-path detection @dcherian needed without the class hierarchy complexity @d-v-b wanted to avoid. The metadata document's `name` field (`"regular"` vs `"rectilinear"`) is also available for clients who inspect JSON directly. + +A `RegularChunkGrid` deprecation shim preserves `isinstance` checks for existing code — see [Backwards compatibility](#backwards-compatibility). + +### Why is ChunkGrid a concrete class instead of a Protocol/ABC? + +The old design had `ChunkGrid` as an ABC with `RegularChunkGrid` as a subclass. #3534 added `RectilinearChunkGrid` as a second subclass. This branch makes `ChunkGrid` a single concrete class instead. + +All known grids are special cases of rectilinear, so there's no need for a class hierarchy at the grid level. A `ChunkGrid` Protocol/ABC would mean every caller programs against an abstract interface and adding a grid type requires implementing ~15 methods. A single class is simpler. + +Note: the *dimension* types (`FixedDimension`, `VaryingDimension`) do use a `DimensionGrid` Protocol — that's where the polymorphism lives. The grid-level class is concrete; the dimension-level types are polymorphic. If a genuinely novel grid type emerges that can't be expressed as a combination of per-dimension types, a grid-level Protocol can be extracted. + +### Why `.chunks` raises for rectilinear grids + +[Debated in #3534.](https://github.com/zarr-developers/zarr-python/pull/3534) @d-v-b suggested making `.chunks` return `tuple[tuple[int, ...], ...]` (dask-style) for all grids. @dcherian strongly objected: every downstream consumer expects `tuple[int, ...]`, and silently returning a different type would be worse than raising. Materializing O(10M) chunk edges into a Python tuple is also a real performance risk ([xarray#8902](https://github.com/pydata/xarray/issues/8902#issuecomment-2546127373)). + +The resolution: +- `.chunks` is retained for regular grids (returns `tuple[int, ...]` as before) +- `.chunks` raises `NotImplementedError` for rectilinear grids with a message pointing to `.read_chunk_sizes`/`.write_chunk_sizes` +- `.read_chunk_sizes` and `.write_chunk_sizes` return `tuple[tuple[int, ...], ...]` (dask convention) for all grids + +@maxrjones noted in review that deprecating `.chunks` for regular grids was not desirable. The current branch does not deprecate it. + +### User control over grid serialization format + +@d-v-b raised in #3534 that users need a way to say "these chunks are regular, but serialize as rectilinear" (e.g., to allow future append/extend workflows without format changes). @jhamman initially made nested-list input always produce `RectilinearChunkGrid`. + +The current branch resolves this via the metadata-layer chunk grid classes. When metadata is deserialized, the original name (from `{"name": "regular"}` or `{"name": "rectilinear"}`) determines which metadata class is instantiated (`RegularChunkGrid` or `RectilinearChunkGrid`), and that class handles serialization via `to_dict()`. Current inference behavior for `create_array`: +- `chunks=(10, 20)` (flat tuple) → infers `"regular"` +- `chunks=[[10, 20], [5, 5]]` (nested lists with varying sizes) → infers `"rectilinear"` +- `chunks=[[10, 10], [20, 20]]` (nested lists with uniform sizes) → `from_rectilinear` collapses to `FixedDimension`, so `is_regular=True` and infers `"regular"` + +**Open question:** Should uniform nested lists preserve `"rectilinear"` to support future append workflows without a format change? This could be addressed by checking the input form before collapsing, or by allowing users to pass `chunk_grid_name` explicitly through the `create_array` API. + +### Deferred: Tiled/periodic chunk patterns + +[#3750 discussion](https://github.com/zarr-developers/zarr-python/issues/3750) identified periodic chunk patterns as a use case not efficiently served by RLE alone. RLE compresses runs of identical values (`np.repeat`), but periodic patterns like days-per-month (`[31, 28, 31, 30, ...]` repeated 30 years) need a tile encoding (`np.tile`). Real-world examples include: + +- **Oceanographic models** (ROMS): HPC boundary-padded chunks like `[10, 8, 8, 8, 10]` — handled by RLE +- **Temporal axes**: days-per-month, hours-per-day — need tile encoding for compact metadata +- **Temporal-aware grids**: date/time-aware chunk grids that layer over other axes (raised by @LDeakin) + +A `TiledDimension` prototype was built ([commit 9c0f582](https://github.com/maxrjones/zarr-python/commit/9c0f582f)) demonstrating that the per-dimension design supports this without changes to indexing or the codec pipeline. However, it was intentionally excluded from this release because: + +1. **Metadata format must come first.** Tile encoding requires a new `kind` value in the rectilinear spec (currently only `"inline"` is defined). This should go through [zarr-extensions#25](https://github.com/zarr-developers/zarr-extensions/pull/25), not zarr-python unilaterally. +2. **The per-dimension architecture doesn't preclude it.** A future `TiledDimension` can implement the `DimensionGrid` protocol alongside `FixedDimension` and `VaryingDimension` with no changes to indexing, codecs, or the `ChunkGrid` class. +3. **RLE covers the MVP.** Most real-world variable chunk patterns (HPC boundaries, irregular partitions) are efficiently encoded with RLE. Tile encoding is an optimization for a specific (temporal) subset. + +### Metadata / Array separation (partially implemented) + +An earlier design doc proposed decoupling `ChunkGrid` (behavioral) from `ArrayV3Metadata` (data), so that metadata would store only a plain dict and the array layer would construct the `ChunkGrid`. + +The current implementation partially realizes this separation: + +- **Metadata DTOs** (`RegularChunkGrid`, `RectilinearChunkGrid` in `metadata/v3.py`): Pure data, frozen dataclasses, no array shape. These live on `ArrayV3Metadata.chunk_grid` and represent only what goes into `zarr.json`. +- **Behavioral `ChunkGrid`** (`chunk_grids.py`): Shape-bound, supports indexing, iteration, and chunk specs. Lives on `AsyncArray.chunk_grid`, constructed from metadata + `shape` via `ChunkGrid.from_metadata()`. + +This means `ArrayV3Metadata.chunk_grid` is now a `ChunkGridMetadata` (the DTO union type), **not** the behavioral `ChunkGrid`. Code that previously accessed behavioral methods on `metadata.chunk_grid` (e.g., `all_chunk_coords()`, `__getitem__`) must now use the behavioral grid from the array layer instead. + +The name controls serialization format; each metadata DTO class provides its own `to_dict()` method for serialization. The behavioral grid handles all runtime queries. + +## Prior art + +**zarrs (Rust):** Three independent grid types behind a `ChunkGridTraits` trait. Key patterns adopted: Fixed vs Varying per dimension, prefix sums + binary search, `Option` for out-of-bounds, `NonZeroU64` for chunk dimensions, separate subchunk grid per shard, array shape at construction. + +**TensorStore (C++):** Stores only `chunk_shape` — boundary clipping via `valid_data_bounds` at query time. Both `RegularGridRef` and `IrregularGrid` internally. No registry. + +## Migration + +### Backwards compatibility + +A `RegularChunkGrid` deprecation shim preserves the three common usage patterns: + +```python +from zarr.core.chunk_grids import RegularChunkGrid # works (no ImportError) + +# Construction emits DeprecationWarning, returns a real ChunkGrid +grid = RegularChunkGrid(chunk_shape=(10, 20)) + +# isinstance works via __instancecheck__ metaclass +isinstance(grid, RegularChunkGrid) # True for any regular ChunkGrid +``` + +The shim uses `chunk_shape` as extent (matching the old shape-unaware behavior). The deprecation warning directs users to `ChunkGrid.from_regular()`. + +**Known limitation:** Because the shim binds `extent=chunk_shape`, `RegularChunkGrid(chunk_shape=(100,)).get_nchunks()` returns `1` (one chunk of size 100 in a dimension of extent 100). This is intentional — the old `RegularChunkGrid` was shape-unaware, and the shim preserves that by using the chunk shape as a stand-in extent. Code that relied on constructing a `RegularChunkGrid` and later querying `nchunks` without binding an array shape must migrate to `ChunkGrid.from_regular(array_shape, chunk_shape)`. + +### Downstream migration + +| Two-class pattern | Unified pattern | +|---|---| +| `isinstance(cg, RegularChunkGrid)` | `cg.is_regular` (or keep `isinstance` — shim handles it) | +| `isinstance(cg, RectilinearChunkGrid)` | `not cg.is_regular` | +| `cg.chunk_shape` | `cg.dimensions[i].size` or `cg[coord].shape` | +| `cg.chunk_shapes` | `tuple(d.edges for d in cg.dimensions)` | +| `RegularChunkGrid(chunk_shape=...)` | `ChunkGrid.from_regular(shape, chunks)` | +| `RectilinearChunkGrid(chunk_shapes=...)` | `ChunkGrid.from_rectilinear(edges, shape)` | +| Feature detection via class import | Version check or `hasattr(ChunkGrid, 'is_regular')` | + +**[xarray#10880](https://github.com/pydata/xarray/pull/10880):** Replace `isinstance` checks with `.is_regular`. Write path simplifies with `chunks=[[...]]` API. + +**[VirtualiZarr#877](https://github.com/zarr-developers/VirtualiZarr/pull/877):** Drop vendored `_is_nested_sequence`. Replace `isinstance` checks. + +**[Icechunk#1338](https://github.com/earth-mover/icechunk/issues/1338):** Minimal impact — format changes driven by spec, not class hierarchy. + +**[cubed#876](https://github.com/cubed-dev/cubed/issues/876):** Switch store creation to `ChunkGrid` API. @tomwhite confirmed in #3534 that rechunking with variable-sized intermediate chunks works. + +**HEALPix use case:** @tinaok demonstrated in #3534 that variable-chunked arrays arise naturally when grouping HEALPix cells by parent pixel — the chunk sizes come from `np.unique(parents, return_counts=True)`. + +### Credits + +This implementation builds on prior work: + +- **[#3534](https://github.com/zarr-developers/zarr-python/pull/3534)** (@jhamman) — RLE helpers, validation logic, test cases, and the review discussion that shaped the architecture. +- **[#3737](https://github.com/zarr-developers/zarr-python/pull/3737)** — extent-in-grid idea (adopted per-dimension). +- **[#1483](https://github.com/zarr-developers/zarr-python/pull/1483)** — original variable chunking POC. +- **[#3736](https://github.com/zarr-developers/zarr-python/pull/3736)** — resolved by storing extent per-dimension. + +### Suggested PR sequence + +If the design is accepted, the POC branch can be split into 5 incremental PRs. PRs 1–2 are where the design decisions are reviewed; PRs 3–5 are mechanical consequences. + +**PR 1: Per-dimension types + ChunkSpec** (purely additive) +- `FixedDimension`, `VaryingDimension`, `DimensionGrid` protocol, `ChunkSpec` +- RLE helpers (`_expand_rle`, `_compress_rle`, `_decode_dim_spec`) +- `ChunkGridName` type alias +- Unit tests for all new types +- Zero changes to existing code + +**PR 2: Unified ChunkGrid class + serialization** (replaces hierarchy) +- `ChunkGrid` with `from_regular`, `from_rectilinear`, `from_metadata`, `__getitem__`, `__iter__`, `all_chunk_coords`, `is_regular`, `chunk_shape`, `chunk_sizes`, `unique_edge_lengths` +- `RegularChunkGrid` deprecation shim +- Metadata-layer serialization via `RegularChunkGrid.to_dict()`/`RectilinearChunkGrid.to_dict()` +- Feature flag (`array.rectilinear_chunks`) + +**PR 3: Indexing generalization** +- Replace `dim_chunk_len: int` with `dim_grid: DimensionGrid` in all per-dimension indexers +- Vectorized `indices_to_chunks()` in `IntArrayDimIndexer` and `CoordinateIndexer` + +**PR 4: Array, codec pipeline, and sharding integration** +- Wire `ChunkGrid` into `create_array` / `init_array` +- `get_chunk_spec()` → `grid[chunk_coords].codec_shape` +- Sharding validation via `dim.unique_edge_lengths` +- `arr.read_chunk_sizes`, `arr.write_chunk_sizes`, `from_array` with `chunks="keep"`, resize support +- Hypothesis strategies for rectilinear grids + +**PR 5: End-to-end tests + docs** +- Full pipeline tests (create → write → read → verify) +- V2 backwards compatibility regression tests +- Boundary/overflow/edge case tests +- Design doc and user guide updates + +## Open questions + +1. **Resize defaults (deferred):** When growing a rectilinear array, should `resize()` accept an optional `chunks` parameter? See the [Resize section](#resize) for details and open design questions. Regular arrays already stay regular on resize. +2. **`ChunkSpec` complexity:** `ChunkSpec` carries both `slices` and `codec_shape`. Should the grid expose separate methods for codec vs data queries instead? +3. **`__getitem__` with slices:** Should `grid[0, :]` or `grid[0:3, :]` return a sub-grid or an iterator of `ChunkSpec`s? +4. **Uniform nested lists:** Should `chunks=[[10, 10], [20, 20]]` serialize as `"rectilinear"` (preserving user intent for future append) or `"regular"` (current behavior, collapses uniform edges)? See [User control over grid serialization format](#user-control-over-grid-serialization-format). +5. **`zarr.open` with rectilinear:** @tomwhite noted in #3534 that `zarr.open(mode="w")` doesn't support rectilinear chunks directly. This could be addressed in a follow-up. + +## Proofs of concepts + +- Zarr-Python: + - branch - https://github.com/maxrjones/zarr-python/tree/poc/unified-chunk-grid + - diff - https://github.com/zarr-developers/zarr-python/compare/main...maxrjones:zarr-python:poc/unified-chunk-grid?expand=1 +- Xarray: + - branch - https://github.com/maxrjones/xarray/tree/poc/unified-zarr-chunk-grid + - diff - https://github.com/pydata/xarray/compare/main...maxrjones:xarray:poc/unified-zarr-chunk-grid?expand=1 +- VirtualiZarr: + - branch - https://github.com/maxrjones/VirtualiZarr/tree/poc/unified-chunk-grid + - diff - https://github.com/zarr-developers/VirtualiZarr/compare/main...maxrjones:VirtualiZarr:poc/unified-chunk-grid?expand=1 +- Virtual TIFF: + - branch - https://github.com/virtual-zarr/virtual-tiff/tree/poc/unified-chunk-grid + - diff - https://github.com/virtual-zarr/virtual-tiff/compare/main...poc/unified-chunk-grid?expand=1 +- Cubed: + - branch - https://github.com/maxrjones/cubed/tree/poc/unified-chunk-grid +- Microbenchmarks: + - https://github.com/maxrjones/zarr-chunk-grid-tests/tree/unified-chunk-grid diff --git a/docs/user-guide/arrays.md b/docs/user-guide/arrays.md index a44c096b73..e230d7f962 100644 --- a/docs/user-guide/arrays.md +++ b/docs/user-guide/arrays.md @@ -599,6 +599,171 @@ In this example a shard shape of (1000, 1000) and a chunk shape of (100, 100) is This means that `10*10` chunks are stored in each shard, and there are `10*10` shards in total. Without the `shards` argument, there would be 10,000 chunks stored as individual files. +## Rectilinear (variable) chunk grids + +!!! warning "Experimental" + Rectilinear chunk grids are an experimental feature and may change in + future releases. This feature is expected to stabilize in Zarr version 3.3. + + Because the feature is still stabilizing, it is disabled by default and + must be explicitly enabled: + + ```python + import zarr + zarr.config.set({"array.rectilinear_chunks": True}) + ``` + + Or via the environment variable `ZARR_ARRAY__RECTILINEAR_CHUNKS=True`. + + The examples below assume this config has been set. + +By default, Zarr arrays use a regular chunk grid where every chunk along a +given dimension has the same size (except possibly the final boundary chunk). +Rectilinear chunk grids allow each chunk along a dimension to have a different +size. This is useful when the natural partitioning of the data is not uniform — +for example, satellite swaths of varying width, time series with irregular +intervals, or spatial tiles of different extents. + +### Creating arrays with rectilinear chunks + +To create an array with rectilinear chunks, pass a nested list to the `chunks` +parameter where each inner list gives the chunk sizes along one dimension: + +```python exec="true" session="arrays" source="above" result="ansi" +zarr.config.set({"array.rectilinear_chunks": True}) +z = zarr.create_array( + store=zarr.storage.MemoryStore(), + shape=(60, 100), + chunks=[[10, 20, 30], [50, 50]], + dtype='int32', +) +print(z.info) +``` + +In this example the first dimension is split into three chunks of sizes 10, 20, +and 30, while the second dimension is split into two equal chunks of size 50. + +### Reading and writing data + +Rectilinear arrays support the same indexing interface as regular arrays. +Reads and writes that cross chunk boundaries of different sizes are handled +automatically: + +```python exec="true" session="arrays" source="above" result="ansi" +import numpy as np +data = np.arange(60 * 100, dtype='int32').reshape(60, 100) +z[:] = data +# Read a slice that spans the first two chunks (sizes 10 and 20) along axis 0 +print(z[5:25, 0:5]) +``` + +### Inspecting chunk sizes + +The `.write_chunk_sizes` property returns the actual data size of each storage +chunk along every dimension. It works for both regular and rectilinear arrays +and returns a tuple of tuples (matching the dask `Array.chunks` convention). +When sharding is used, `.read_chunk_sizes` returns the inner chunk sizes instead: + +```python exec="true" session="arrays" source="above" result="ansi" +print(z.write_chunk_sizes) +``` + +For regular arrays, this includes the boundary chunk: + +```python exec="true" session="arrays" source="above" result="ansi" +z_regular = zarr.create_array( + store=zarr.storage.MemoryStore(), + shape=(100, 80), + chunks=(30, 40), + dtype='int32', +) +print(z_regular.write_chunk_sizes) +``` + +Note that the `.chunks` property is only available for regular chunk grids. For +rectilinear arrays, use `.write_chunk_sizes` (or `.read_chunk_sizes`) instead. + +### Resizing and appending + +Rectilinear arrays can be resized. When growing past the current edge sum, a +new chunk is appended covering the additional extent. When shrinking, the chunk +edges are preserved and the extent is re-bound (chunks beyond the new extent +simply become inactive): + +```python exec="true" session="arrays" source="above" result="ansi" +z = zarr.create_array( + store=zarr.storage.MemoryStore(), + shape=(30,), + chunks=[[10, 20]], + dtype='float64', +) +z[:] = np.arange(30, dtype='float64') +print(f"Before resize: chunk_sizes={z.write_chunk_sizes}") +z.resize((50,)) +print(f"After resize: chunk_sizes={z.write_chunk_sizes}") +``` + +The `append` method also works with rectilinear arrays: + +```python exec="true" session="arrays" source="above" result="ansi" +z.append(np.arange(10, dtype='float64')) +print(f"After append: shape={z.shape}, chunk_sizes={z.write_chunk_sizes}") +``` + +### Compressors and filters + +Rectilinear arrays work with all codecs — compressors, filters, and checksums. +Since each chunk may have a different size, the codec pipeline processes each +chunk independently: + +```python exec="true" session="arrays" source="above" result="ansi" +z = zarr.create_array( + store=zarr.storage.MemoryStore(), + shape=(60, 100), + chunks=[[10, 20, 30], [50, 50]], + dtype='float64', + filters=[zarr.codecs.TransposeCodec(order=(1, 0))], + compressors=[zarr.codecs.BloscCodec(cname='zstd', clevel=3)], +) +z[:] = np.arange(60 * 100, dtype='float64').reshape(60, 100) +np.testing.assert_array_equal(z[:], np.arange(60 * 100, dtype='float64').reshape(60, 100)) +print("Roundtrip OK") +``` + +### Rectilinear shard boundaries + +Rectilinear chunk grids can also be used for shard boundaries when combined +with sharding. In this case, the outer grid (shards) is rectilinear while the +inner chunks remain regular. Each shard dimension must be divisible by the +corresponding inner chunk size: + +```python exec="true" session="arrays" source="above" result="ansi" +z = zarr.create_array( + store=zarr.storage.MemoryStore(), + shape=(120, 100), + chunks=(10, 10), + shards=[[60, 40, 20], [50, 50]], + dtype='int32', +) +z[:] = np.arange(120 * 100, dtype='int32').reshape(120, 100) +print(z[50:70, 40:60]) +``` + +Note that rectilinear inner chunks with sharding are not supported — only the +shard boundaries can be rectilinear. + +### Metadata format + +Rectilinear chunk grid metadata uses run-length encoding (RLE) for compact +serialization. When reading metadata, both bare integers and `[value, count]` +pairs are accepted: + +- `[10, 20, 30]` — three chunks with explicit sizes +- `[[10, 3]]` — three chunks of size 10 (RLE shorthand) +- `[[10, 3], 5]` — three chunks of size 10, then one chunk of size 5 + +When writing, Zarr automatically compresses repeated values into RLE format. + ## Missing features in 3.0 The following features have not been ported to 3.0 yet. diff --git a/docs/user-guide/config.md b/docs/user-guide/config.md index 21fe9b5def..113217e097 100644 --- a/docs/user-guide/config.md +++ b/docs/user-guide/config.md @@ -30,6 +30,7 @@ Configuration options include the following: - Default Zarr format `default_zarr_version` - Default array order in memory `array.order` - Whether empty chunks are written to storage `array.write_empty_chunks` +- Enable experimental rectilinear chunks `array.rectilinear_chunks` - Async and threading options, e.g. `async.concurrency` and `threading.max_workers` - Selections of implementations of codecs, codec pipelines and buffers - Enabling GPU support with `zarr.config.enable_gpu()`. See GPU support for more. diff --git a/docs/user-guide/examples/rectilinear_chunks.ipynb b/docs/user-guide/examples/rectilinear_chunks.ipynb new file mode 100644 index 0000000000..376cd9ad88 --- /dev/null +++ b/docs/user-guide/examples/rectilinear_chunks.ipynb @@ -0,0 +1,426 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 1, + "id": "da9139cc", + "metadata": { + "execution": { + "iopub.execute_input": "2026-03-30T13:18:20.792275Z", + "iopub.status.busy": "2026-03-30T13:18:20.792050Z", + "iopub.status.idle": "2026-03-30T13:18:20.801655Z", + "shell.execute_reply": "2026-03-30T13:18:20.797952Z", + "shell.execute_reply.started": "2026-03-30T13:18:20.792253Z" + } + }, + "outputs": [], + "source": [ + "# /// script\n", + "# requires-python = \">=3.12\"\n", + "# dependencies = [\n", + "# \"dask\",\n", + "# \"healpix-geo\",\n", + "# \"matplotlib\",\n", + "# \"numpy\",\n", + "# \"obstore\",\n", + "# \"xarray\",\n", + "# \"zarr\",\n", + "# ]\n", + "#\n", + "# [tool.uv.sources]\n", + "# zarr = { git = \"https://github.com/maxrjones/zarr-python\", branch = \"poc/unified-chunk-grid\" }\n", + "# xarray = { git = \"https://github.com/maxrjones/xarray\", branch = \"poc/unified-zarr-chunk-grid\" }\n", + "# ///" + ] + }, + { + "cell_type": "markdown", + "id": "71gnhfq4pfe", + "metadata": {}, + "source": [ + "# Rectilinear Chunk Grids\n", + "\n", + "This notebook demonstrates the unified chunk grid implementation from [#3802](https://github.com/zarr-developers/zarr-python/pull/3802), which adds support for rectilinear (variable) chunk grids.\n", + "\n", + "Rectilinear grids allow different chunk sizes along each dimension, which is useful for data that doesn't partition evenly. For example, sparse HEALPix cells grouped by parent tile, boundary-padded HPC arrays, or ingesting existing variable-chunked datasets via VirtualiZarr." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "9e9nyjdx06f", + "metadata": { + "execution": { + "iopub.execute_input": "2026-03-30T13:18:20.802629Z", + "iopub.status.busy": "2026-03-30T13:18:20.802471Z", + "iopub.status.idle": "2026-03-30T13:18:21.183147Z", + "shell.execute_reply": "2026-03-30T13:18:21.182751Z", + "shell.execute_reply.started": "2026-03-30T13:18:20.802615Z" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import tempfile\n", + "from pathlib import Path\n", + "import json\n", + "\n", + "import numpy as np\n", + "import xarray as xr\n", + "from healpix_geo import nested\n", + "from obstore.store import HTTPStore\n", + "\n", + "import zarr\n", + "from zarr.storage import ObjectStore\n", + "\n", + "zarr.config.set({'async.concurrency': 128}) # Increase concurrency for better performance with obstore\n", + "zarr.config.set({\"array.rectilinear_chunks\": True}) # Opt-in to rectilinear chunks\n" + ] + }, + { + "cell_type": "markdown", + "id": "kj1o9xik9l", + "metadata": {}, + "source": [ + "## 1. Inspect HEALPix dataset\n", + "\n", + "Load the remote Zarr store to understand the data structure before chunking it." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "v6cot74r1gq", + "metadata": { + "execution": { + "iopub.execute_input": "2026-03-30T13:18:21.183653Z", + "iopub.status.busy": "2026-03-30T13:18:21.183505Z", + "iopub.status.idle": "2026-03-30T13:18:22.028419Z", + "shell.execute_reply": "2026-03-30T13:18:22.027356Z", + "shell.execute_reply.started": "2026-03-30T13:18:21.183644Z" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Members: [('cell_ids', ), ('da', )]\n", + "Attrs: {}\n", + "Write chunk sizes: ((55611, 55611, 55611, 55609),)\n" + ] + } + ], + "source": [ + "ob_store = HTTPStore.from_url(\"https://data-taos.ifremer.fr/GRID4EARTH/no_chunk_healpix.zarr\")\n", + "store = ObjectStore(ob_store)\n", + "g = zarr.open_group(store, mode=\"r\", zarr_format=2, use_consolidated=True)\n", + "arr = g['da']\n", + "\n", + "print(\"Members:\", list(g.members()))\n", + "print(\"Attrs:\", dict(g.attrs))\n", + "print(\"Write chunk sizes:\", arr.write_chunk_sizes)" + ] + }, + { + "cell_type": "markdown", + "id": "wmuqi66d46", + "metadata": {}, + "source": [ + "## 2. HEALPix-style variable chunking\n", + "\n", + "Inspired by [this use case](https://github.com/zarr-developers/zarr-python/pull/3534#issuecomment-3848669859): HEALPix grids where cells are grouped by parent tile at a coarser resolution level, producing variable-sized chunks along the cell dimension when accounting for sparsity." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "90bc91b9", + "metadata": { + "execution": { + "iopub.execute_input": "2026-03-30T13:18:22.029842Z", + "iopub.status.busy": "2026-03-30T13:18:22.029258Z", + "iopub.status.idle": "2026-03-30T13:18:23.629597Z", + "shell.execute_reply": "2026-03-30T13:18:23.628896Z", + "shell.execute_reply.started": "2026-03-30T13:18:22.029824Z" + } + }, + "outputs": [], + "source": [ + "da = xr.open_zarr(\n", + " store,\n", + " zarr_format=2,\n", + " consolidated=True,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "0d7785b0-d72f-4ef8-8a57-91d61f07be96", + "metadata": { + "execution": { + "iopub.execute_input": "2026-03-30T13:18:23.630244Z", + "iopub.status.busy": "2026-03-30T13:18:23.629978Z", + "iopub.status.idle": "2026-03-30T13:18:23.633850Z", + "shell.execute_reply": "2026-03-30T13:18:23.632930Z", + "shell.execute_reply.started": "2026-03-30T13:18:23.630232Z" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "10" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "depth = da.cell_ids.attrs['level']\n", + "depth" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "72c80224-dcac-4724-8caf-5717b29a25d5", + "metadata": { + "execution": { + "iopub.execute_input": "2026-03-30T13:18:23.634211Z", + "iopub.status.busy": "2026-03-30T13:18:23.634119Z", + "iopub.status.idle": "2026-03-30T13:18:23.642291Z", + "shell.execute_reply": "2026-03-30T13:18:23.641668Z", + "shell.execute_reply.started": "2026-03-30T13:18:23.634203Z" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "array([ 25, 645, 1510, 2363, 3203, 74, 769, 3963, 4096, 233, 1603,\n", + " 2450, 4096, 4096, 3327, 4047, 4096, 4096, 1278, 2113, 4096, 3879,\n", + " 4096, 3842, 2173, 983, 4046, 2187, 4095, 1369, 4096, 4096, 4096,\n", + " 4096, 3515, 1395, 4096, 3622, 4096, 4096, 3875, 4096, 4096, 4096,\n", + " 4096, 4096, 2034, 4096, 358, 3991, 4096, 4096, 4096, 4096, 2714,\n", + " 1210, 4096, 4096, 4096, 4096, 92, 3826, 4096, 2629, 4096, 1438,\n", + " 4096, 353, 4078, 3410, 2407, 226, 132, 2738, 1223, 23])" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "new_depth = depth-6\n", + "parents = nested.zoom_to(da.cell_ids, depth=depth, new_depth=new_depth)\n", + "_, chunk_sizes =np.unique(parents, return_counts=True)\n", + "chunk_sizes" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "a79a281b-ca74-49c3-a467-60490a4ad63e", + "metadata": { + "execution": { + "iopub.execute_input": "2026-03-30T13:18:23.642721Z", + "iopub.status.busy": "2026-03-30T13:18:23.642622Z", + "iopub.status.idle": "2026-03-30T13:18:23.649165Z", + "shell.execute_reply": "2026-03-30T13:18:23.648723Z", + "shell.execute_reply.started": "2026-03-30T13:18:23.642712Z" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "Frozen({'cell_ids': (25, 645, 1510, 2363, 3203, 74, 769, 3963, 4096, 233, 1603, 2450, 4096, 4096, 3327, 4047, 4096, 4096, 1278, 2113, 4096, 3879, 4096, 3842, 2173, 983, 4046, 2187, 4095, 1369, 4096, 4096, 4096, 4096, 3515, 1395, 4096, 3622, 4096, 4096, 3875, 4096, 4096, 4096, 4096, 4096, 2034, 4096, 358, 3991, 4096, 4096, 4096, 4096, 2714, 1210, 4096, 4096, 4096, 4096, 92, 3826, 4096, 2629, 4096, 1438, 4096, 353, 4078, 3410, 2407, 226, 132, 2738, 1223, 23)})" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "da = da.chunk({\"cell_ids\": tuple(chunk_sizes.tolist())})\n", + "da.chunks" + ] + }, + { + "cell_type": "markdown", + "id": "bsp6y7otkzb", + "metadata": {}, + "source": [ + "## 3. Write as rectilinear Zarr V3\n", + "\n", + "Write the variable-chunked dataset to a local Zarr V3 store with rectilinear chunk grids enabled." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "ribguojdr0s", + "metadata": { + "execution": { + "iopub.execute_input": "2026-03-30T13:18:23.649823Z", + "iopub.status.busy": "2026-03-30T13:18:23.649737Z", + "iopub.status.idle": "2026-03-30T13:18:24.089390Z", + "shell.execute_reply": "2026-03-30T13:18:24.088640Z", + "shell.execute_reply.started": "2026-03-30T13:18:23.649815Z" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Written to: /var/folders/70/hc_nynms54d8lp67z4rsfctc0000gp/T/tmp6dibcrho/healpix_rectilinear.zarr\n" + ] + } + ], + "source": [ + "output_path = Path(tempfile.mkdtemp()) / \"healpix_rectilinear.zarr\"\n", + "\n", + "encoding = {\n", + " \"da\": {\"chunks\": [chunk_sizes.tolist()]},\n", + " \"cell_ids\": {\"chunks\": [chunk_sizes.tolist()]},\n", + "}\n", + "\n", + "da.to_zarr(output_path, zarr_format=3, mode=\"w\", encoding=encoding, consolidated=False)\n", + "\n", + "print(f\"Written to: {output_path}\")" + ] + }, + { + "cell_type": "markdown", + "id": "rbfm1hn63g9", + "metadata": {}, + "source": [ + "## 4. Verify rectilinear metadata\n", + "\n", + "Inspect the output store to confirm the chunk grid is serialized as `\"rectilinear\"` in `zarr.json`,\n", + "following the [rectilinear chunk grid extension spec](https://github.com/zarr-developers/zarr-extensions/tree/main/chunk-grids/rectilinear).\n", + "\n", + "Key things to look for in `chunk_grid`:\n", + "- **`name`**: `\"rectilinear\"` (the extension identifier)\n", + "- **`configuration.kind`**: `\"inline\"` (edge lengths stored directly in metadata)\n", + "- **`configuration.chunk_shapes`**: one entry per dimension — here a single list for the 1D `cell_ids` axis. Each element is either:\n", + " - a **bare integer** for a unique edge length (e.g., `25`, `645`)\n", + " - a **`[value, count]` array** using [run-length encoding](https://github.com/zarr-developers/zarr-extensions/tree/main/chunk-grids/rectilinear#run-length-encoding) for consecutive repeated sizes (e.g., `[4096, 4]` means four consecutive chunks of size 4096)" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "mpdn5hxp7lp", + "metadata": { + "execution": { + "iopub.execute_input": "2026-03-30T13:18:24.090312Z", + "iopub.status.busy": "2026-03-30T13:18:24.090192Z", + "iopub.status.idle": "2026-03-30T13:18:24.093595Z", + "shell.execute_reply": "2026-03-30T13:18:24.092908Z", + "shell.execute_reply.started": "2026-03-30T13:18:24.090303Z" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'name': 'rectilinear', 'configuration': {'kind': 'inline', 'chunk_shapes': [[25, 645, 1510, 2363, 3203, 74, 769, 3963, 4096, 233, 1603, 2450, [4096, 2], 3327, 4047, [4096, 2], 1278, 2113, 4096, 3879, 4096, 3842, 2173, 983, 4046, 2187, 4095, 1369, [4096, 4], 3515, 1395, 4096, 3622, [4096, 2], 3875, [4096, 5], 2034, 4096, 358, 3991, [4096, 4], 2714, 1210, [4096, 4], 92, 3826, 4096, 2629, 4096, 1438, 4096, 353, 4078, 3410, 2407, 226, 132, 2738, 1223, 23]]}}\n" + ] + } + ], + "source": [ + "\n", + "# Read the zarr.json for the 'da' array\n", + "da_meta_path = output_path / \"da\" / \"zarr.json\"\n", + "meta = json.loads(da_meta_path.read_text())\n", + "print(meta['chunk_grid'])" + ] + }, + { + "cell_type": "markdown", + "id": "inz7s8ugu2c", + "metadata": {}, + "source": [ + "## 5. Round-trip verification\n", + "\n", + "Read the rectilinear store back and confirm the chunk sizes are preserved." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "308gxly6r3j", + "metadata": { + "execution": { + "iopub.execute_input": "2026-03-30T13:18:24.094252Z", + "iopub.status.busy": "2026-03-30T13:18:24.094013Z", + "iopub.status.idle": "2026-03-30T13:18:24.117313Z", + "shell.execute_reply": "2026-03-30T13:18:24.116670Z", + "shell.execute_reply.started": "2026-03-30T13:18:24.094242Z" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Round-trip chunk sizes: Frozen({'cell_ids': (25, 645, 1510, 2363, 3203, 74, 769, 3963, 4096, 233, 1603, 2450, 4096, 4096, 3327, 4047, 4096, 4096, 1278, 2113, 4096, 3879, 4096, 3842, 2173, 983, 4046, 2187, 4095, 1369, 4096, 4096, 4096, 4096, 3515, 1395, 4096, 3622, 4096, 4096, 3875, 4096, 4096, 4096, 4096, 4096, 2034, 4096, 358, 3991, 4096, 4096, 4096, 4096, 2714, 1210, 4096, 4096, 4096, 4096, 92, 3826, 4096, 2629, 4096, 1438, 4096, 353, 4078, 3410, 2407, 226, 132, 2738, 1223, 23)})\n" + ] + } + ], + "source": [ + "roundtrip = xr.open_zarr(output_path, zarr_format=3, consolidated=False)\n", + "\n", + "print(\"Round-trip chunk sizes:\", roundtrip.chunks)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e8d42341-c242-44f5-ad6a-491370e3ffab", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.0" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/mkdocs.yml b/mkdocs.yml index e2c4148e15..ce39fd0f2e 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -30,6 +30,7 @@ nav: - user-guide/glossary.md - Examples: - user-guide/examples/custom_dtype.md + - user-guide/examples/rectilinear_chunks.ipynb - API Reference: - api/zarr/index.md - api/zarr/array.md @@ -132,6 +133,11 @@ extra_css: plugins: - autorefs - search + - mkdocs-jupyter: + include: ["docs/user-guide/examples/*.ipynb"] + execute: false + ignore_h1_titles: true + show_input: true - markdown-exec - mkdocstrings: enable_inventory: true diff --git a/pyproject.toml b/pyproject.toml index 8277c3f752..442a70fbce 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -114,6 +114,7 @@ docs = [ "mkdocstrings>=0.29.1", "mkdocstrings-python>=1.16.10", "mike>=2.1.3", + "mkdocs-jupyter>=0.25.1", "mkdocs-redirects>=1.2.0", "markdown-exec[ansi]", "griffe-inherited-docstrings", diff --git a/src/zarr/abc/codec.py b/src/zarr/abc/codec.py index 50472e807a..0408e4769e 100644 --- a/src/zarr/abc/codec.py +++ b/src/zarr/abc/codec.py @@ -17,10 +17,10 @@ from zarr.abc.store import ByteGetter, ByteSetter, Store from zarr.core.array_spec import ArraySpec - from zarr.core.chunk_grids import ChunkGrid from zarr.core.dtype.wrapper import TBaseDType, TBaseScalar, ZDType from zarr.core.indexing import SelectorTuple from zarr.core.metadata import ArrayMetadata + from zarr.core.metadata.v3 import ChunkGridMetadata __all__ = [ "ArrayArrayCodec", @@ -146,7 +146,7 @@ def validate( *, shape: tuple[int, ...], dtype: ZDType[TBaseDType, TBaseScalar], - chunk_grid: ChunkGrid, + chunk_grid: ChunkGridMetadata, ) -> None: """Validates that the codec configuration is compatible with the array metadata. Raises errors when the codec configuration is not compatible. @@ -157,8 +157,8 @@ def validate( The array shape dtype : np.dtype[Any] The array data type - chunk_grid : ChunkGrid - The array chunk grid + chunk_grid : ChunkGridMetadata + The array chunk grid metadata """ async def _decode_single(self, chunk_data: CO, chunk_spec: ArraySpec) -> CI: @@ -361,7 +361,7 @@ def validate( *, shape: tuple[int, ...], dtype: ZDType[TBaseDType, TBaseScalar], - chunk_grid: ChunkGrid, + chunk_grid: ChunkGridMetadata, ) -> None: """Validates that all codec configurations are compatible with the array metadata. Raises errors when a codec configuration is not compatible. @@ -372,8 +372,8 @@ def validate( The array shape dtype : np.dtype[Any] The array data type - chunk_grid : ChunkGrid - The array chunk grid + chunk_grid : ChunkGridMetadata + The array chunk grid metadata """ ... diff --git a/src/zarr/api/synchronous.py b/src/zarr/api/synchronous.py index 4e718a234e..a865f97646 100644 --- a/src/zarr/api/synchronous.py +++ b/src/zarr/api/synchronous.py @@ -33,6 +33,7 @@ from zarr.core.common import ( JSON, AccessModeLiteral, + ChunksLike, DimensionNamesLike, MemoryOrder, ShapeLike, @@ -822,7 +823,7 @@ def create_array( shape: ShapeLike | None = None, dtype: ZDTypeLike | None = None, data: np.ndarray[Any, np.dtype[Any]] | None = None, - chunks: tuple[int, ...] | Literal["auto"] = "auto", + chunks: ChunksLike | Literal["auto"] = "auto", shards: ShardsLike | None = None, filters: FiltersLike = "auto", compressors: CompressorsLike = "auto", @@ -858,9 +859,13 @@ def create_array( data : np.ndarray, optional Array-like data to use for initializing the array. If this parameter is provided, the ``shape`` and ``dtype`` parameters must be ``None``. - chunks : tuple[int, ...] | Literal["auto"], default="auto" + chunks : tuple[int, ...] | Sequence[Sequence[int]] | Literal["auto"], default="auto" Chunk shape of the array. If chunks is "auto", a chunk shape is guessed based on the shape of the array and the dtype. + A nested list of per-dimension edge sizes creates a rectilinear grid. + Rectilinear chunk grids are experimental and must be explicitly enabled + with ``zarr.config.set({'array.rectilinear_chunks': True})`` while the + feature is stabilizing. shards : tuple[int, ...], optional Shard shape of the array. The default value of ``None`` results in no sharding at all. filters : Iterable[Codec] | Literal["auto"], optional @@ -993,7 +998,7 @@ def from_array( data: AnyArray | npt.ArrayLike, write_data: bool = True, name: str | None = None, - chunks: Literal["auto", "keep"] | tuple[int, ...] = "keep", + chunks: ChunksLike | Literal["auto", "keep"] = "keep", shards: ShardsLike | None | Literal["keep"] = "keep", filters: FiltersLike | Literal["keep"] = "keep", compressors: CompressorsLike | Literal["keep"] = "keep", @@ -1025,13 +1030,17 @@ def from_array( name : str or None, optional The name of the array within the store. If ``name`` is ``None``, the array will be located at the root of the store. - chunks : tuple[int, ...] or "auto" or "keep", optional + chunks : tuple[int, ...] or Sequence[Sequence[int]] or "auto" or "keep", optional Chunk shape of the array. Following values are supported: - "auto": Automatically determine the chunk shape based on the array's shape and dtype. - - "keep": Retain the chunk shape of the data array if it is a zarr Array. - - tuple[int, ...]: A tuple of integers representing the chunk shape. + - "keep": Retain the chunk grid of the data array if it is a zarr Array. + - tuple[int, ...]: A tuple of integers representing the chunk shape (regular grid). + - Sequence[Sequence[int]]: Per-dimension chunk edge lists (rectilinear grid). + Rectilinear chunk grids are experimental and must be explicitly enabled + with ``zarr.config.set({'array.rectilinear_chunks': True})`` while the + feature is stabilizing. If not specified, defaults to "keep" if data is a zarr Array, otherwise "auto". shards : tuple[int, ...], optional diff --git a/src/zarr/codecs/sharding.py b/src/zarr/codecs/sharding.py index 9f26bc57b1..da5cfce3bd 100644 --- a/src/zarr/codecs/sharding.py +++ b/src/zarr/codecs/sharding.py @@ -34,7 +34,7 @@ default_buffer_prototype, numpy_buffer_prototype, ) -from zarr.core.chunk_grids import ChunkGrid, RegularChunkGrid +from zarr.core.chunk_grids import ChunkGrid from zarr.core.common import ( ShapeLike, parse_enum, @@ -53,7 +53,12 @@ get_indexer, morton_order_iter, ) -from zarr.core.metadata.v3 import parse_codecs +from zarr.core.metadata.v3 import ( + ChunkGridMetadata, + RectilinearChunkGrid, + RegularChunkGrid, + parse_codecs, +) from zarr.registry import get_ndbuffer_class, get_pipeline_class from zarr.storage._utils import _normalize_byte_range_index @@ -382,26 +387,30 @@ def validate( *, shape: tuple[int, ...], dtype: ZDType[TBaseDType, TBaseScalar], - chunk_grid: ChunkGrid, + chunk_grid: ChunkGridMetadata, ) -> None: if len(self.chunk_shape) != len(shape): raise ValueError( "The shard's `chunk_shape` and array's `shape` need to have the same number of dimensions." ) - if not isinstance(chunk_grid, RegularChunkGrid): - raise TypeError("Sharding is only compatible with regular chunk grids.") - if not all( - s % c == 0 - for s, c in zip( - chunk_grid.chunk_shape, - self.chunk_shape, - strict=False, + if isinstance(chunk_grid, RegularChunkGrid): + edges_per_dim: tuple[tuple[int, ...], ...] = tuple((s,) for s in chunk_grid.chunk_shape) + elif isinstance(chunk_grid, RectilinearChunkGrid): + edges_per_dim = tuple( + (s,) if isinstance(s, int) else s for s in chunk_grid.chunk_shapes ) - ): - raise ValueError( - f"The array's `chunk_shape` (got {chunk_grid.chunk_shape}) " - f"needs to be divisible by the shard's inner `chunk_shape` (got {self.chunk_shape})." + else: + raise TypeError( + f"Sharding is only compatible with regular and rectilinear chunk grids, " + f"got {type(chunk_grid)}" ) + for i, (edges, inner) in enumerate(zip(edges_per_dim, self.chunk_shape, strict=False)): + for edge in set(edges): + if edge % inner != 0: + raise ValueError( + f"Chunk edge length {edge} in dimension {i} is not " + f"divisible by the shard's inner chunk size {inner}." + ) async def _decode_single( self, @@ -416,7 +425,7 @@ async def _decode_single( indexer = BasicIndexer( tuple(slice(0, s) for s in shard_shape), shape=shard_shape, - chunk_grid=RegularChunkGrid(chunk_shape=chunk_shape), + chunk_grid=ChunkGrid.from_sizes(shard_shape, chunk_shape), ) # setup output array @@ -462,7 +471,7 @@ async def _decode_partial_single( indexer = get_indexer( selection, shape=shard_shape, - chunk_grid=RegularChunkGrid(chunk_shape=chunk_shape), + chunk_grid=ChunkGrid.from_sizes(shard_shape, chunk_shape), ) # setup output array @@ -537,7 +546,7 @@ async def _encode_single( BasicIndexer( tuple(slice(0, s) for s in shard_shape), shape=shard_shape, - chunk_grid=RegularChunkGrid(chunk_shape=chunk_shape), + chunk_grid=ChunkGrid.from_sizes(shard_shape, chunk_shape), ) ) @@ -577,7 +586,9 @@ async def _encode_partial_single( indexer = list( get_indexer( - selection, shape=shard_shape, chunk_grid=RegularChunkGrid(chunk_shape=chunk_shape) + selection, + shape=shard_shape, + chunk_grid=ChunkGrid.from_sizes(shard_shape, chunk_shape), ) ) diff --git a/src/zarr/codecs/transpose.py b/src/zarr/codecs/transpose.py index 609448a59c..5756fba2b4 100644 --- a/src/zarr/codecs/transpose.py +++ b/src/zarr/codecs/transpose.py @@ -14,8 +14,8 @@ from typing import Self from zarr.core.buffer import NDBuffer - from zarr.core.chunk_grids import ChunkGrid from zarr.core.dtype.wrapper import TBaseDType, TBaseScalar, ZDType + from zarr.core.metadata.v3 import ChunkGridMetadata def parse_transpose_order(data: JSON | Iterable[int]) -> tuple[int, ...]: @@ -51,7 +51,7 @@ def validate( self, shape: tuple[int, ...], dtype: ZDType[TBaseDType, TBaseScalar], - chunk_grid: ChunkGrid, + chunk_grid: ChunkGridMetadata, ) -> None: if len(self.order) != len(shape): raise ValueError( diff --git a/src/zarr/core/_info.py b/src/zarr/core/_info.py index fef424346a..1503f05b26 100644 --- a/src/zarr/core/_info.py +++ b/src/zarr/core/_info.py @@ -117,7 +117,7 @@ def __repr__(self) -> str: if self._chunk_shape is None: # for non-regular chunk grids - kwargs["chunk_shape"] = "" + kwargs["_chunk_shape"] = "" template += "\nFilters : {_filters}" diff --git a/src/zarr/core/array.py b/src/zarr/core/array.py index b5212656f4..be838f285c 100644 --- a/src/zarr/core/array.py +++ b/src/zarr/core/array.py @@ -3,7 +3,7 @@ import json import warnings from asyncio import gather -from collections.abc import Iterable, Mapping +from collections.abc import Iterable, Mapping, Sequence from dataclasses import dataclass, field, replace from itertools import starmap from logging import getLogger @@ -28,7 +28,7 @@ from zarr.codecs.vlen_utf8 import VLenBytesCodec, VLenUTF8Codec from zarr.codecs.zstd import ZstdCodec from zarr.core._info import ArrayInfo -from zarr.core.array_spec import ArrayConfig, ArrayConfigLike, parse_array_config +from zarr.core.array_spec import ArrayConfig, ArrayConfigLike, ArraySpec, parse_array_config from zarr.core.attributes import Attributes from zarr.core.buffer import ( BufferPrototype, @@ -38,7 +38,11 @@ default_buffer_prototype, ) from zarr.core.buffer.cpu import buffer_prototype as cpu_buffer_prototype -from zarr.core.chunk_grids import RegularChunkGrid, _auto_partition, normalize_chunks +from zarr.core.chunk_grids import ( + ChunkGrid, + _auto_partition, + normalize_chunks, +) from zarr.core.chunk_key_encodings import ( ChunkKeyEncoding, ChunkKeyEncodingLike, @@ -51,6 +55,7 @@ ZARR_JSON, ZARRAY_JSON, ZATTRS_JSON, + ChunksLike, DimensionNamesLike, MemoryOrder, ShapeLike, @@ -113,7 +118,13 @@ parse_compressor, parse_filters, ) -from zarr.core.metadata.v3 import parse_node_type_array +from zarr.core.metadata.v3 import ( + ChunkGridMetadata, + RectilinearChunkGrid, + RegularChunkGrid, + parse_node_type_array, + resolve_chunks, +) from zarr.core.sync import sync from zarr.errors import ( ArrayNotFoundError, @@ -131,7 +142,7 @@ from zarr.storage._utils import _relativize_path if TYPE_CHECKING: - from collections.abc import Iterator, Sequence + from collections.abc import Iterator from typing import Self import numpy.typing as npt @@ -173,6 +184,18 @@ class DefaultFillValue: DEFAULT_FILL_VALUE = DefaultFillValue() +def _chunk_sizes_from_shape( + array_shape: tuple[int, ...], chunk_shape: tuple[int, ...] +) -> tuple[tuple[int, ...], ...]: + """Compute dask-style chunk sizes from an array shape and uniform chunk shape.""" + result: list[tuple[int, ...]] = [] + for s, c in zip(array_shape, chunk_shape, strict=True): + nchunks = ceildiv(s, c) + sizes = tuple(min(c, s - i * c) for i in range(nchunks)) + result.append(sizes) + return tuple(result) + + def parse_array_metadata(data: Any) -> ArrayMetadata: if isinstance(data, ArrayMetadata): return data @@ -304,6 +327,7 @@ class AsyncArray[T_ArrayMetadata: (ArrayV2Metadata, ArrayV3Metadata)]: metadata: T_ArrayMetadata store_path: StorePath codec_pipeline: CodecPipeline = field(init=False) + _chunk_grid: ChunkGrid = field(init=False) config: ArrayConfig @overload @@ -334,6 +358,7 @@ def __init__( object.__setattr__(self, "metadata", metadata_parsed) object.__setattr__(self, "store_path", store_path) object.__setattr__(self, "config", config_parsed) + object.__setattr__(self, "_chunk_grid", ChunkGrid.from_metadata(metadata_parsed)) object.__setattr__( self, "codec_pipeline", @@ -650,13 +675,11 @@ async def _create( if chunks is not None and chunk_shape is not None: raise ValueError("Only one of chunk_shape or chunks can be provided.") - item_size = 1 - if isinstance(dtype_parsed, HasItemSize): - item_size = dtype_parsed.item_size - if chunks: - _chunks = normalize_chunks(chunks, shape, item_size) - else: - _chunks = normalize_chunks(chunk_shape, shape, item_size) + + from zarr.core.chunk_grids import _is_rectilinear_chunks + + _raw_chunks = chunks if chunks is not None else chunk_shape + config_parsed = parse_array_config(config) result: AnyAsyncArray @@ -677,11 +700,14 @@ async def _create( if order is not None: _warn_order_kwarg() + item_size = 1 + if isinstance(dtype_parsed, HasItemSize): + item_size = dtype_parsed.item_size + chunk_grid = resolve_chunks(_raw_chunks, shape, item_size) result = await cls._create_v3( store_path, shape=shape, dtype=dtype_parsed, - chunk_shape=_chunks, fill_value=fill_value, chunk_key_encoding=chunk_key_encoding, codecs=codecs, @@ -689,6 +715,7 @@ async def _create( attributes=attributes, overwrite=overwrite, config=config_parsed, + chunk_grid=chunk_grid, ) elif zarr_format == 2: if codecs is not None: @@ -701,6 +728,16 @@ async def _create( ) if dimension_names is not None: raise ValueError("dimension_names cannot be used for arrays with zarr_format 2.") + if _is_rectilinear_chunks(_raw_chunks): + raise ValueError("Zarr format 2 does not support rectilinear chunk grids.") + + item_size = 1 + if isinstance(dtype_parsed, HasItemSize): + item_size = dtype_parsed.item_size + if chunks: + _chunks = normalize_chunks(chunks, shape, item_size) + else: + _chunks = normalize_chunks(chunk_shape, shape, item_size) if order is None: order_parsed = config_parsed.order @@ -735,16 +772,14 @@ async def _create( def _create_metadata_v3( shape: ShapeLike, dtype: ZDType[TBaseDType, TBaseScalar], - chunk_shape: tuple[int, ...], + chunk_grid: ChunkGridMetadata, fill_value: Any | None = DEFAULT_FILL_VALUE, chunk_key_encoding: ChunkKeyEncodingLike | None = None, codecs: Iterable[Codec | dict[str, JSON]] | None = None, dimension_names: DimensionNamesLike = None, attributes: dict[str, JSON] | None = None, ) -> ArrayV3Metadata: - """ - Create an instance of ArrayV3Metadata. - """ + """Create an instance of ArrayV3Metadata.""" filters: tuple[ArrayArrayCodec, ...] compressors: tuple[BytesBytesCodec, ...] @@ -771,11 +806,10 @@ def _create_metadata_v3( else: fill_value_parsed = fill_value - chunk_grid_parsed = RegularChunkGrid(chunk_shape=chunk_shape) return ArrayV3Metadata( shape=shape, data_type=dtype, - chunk_grid=chunk_grid_parsed, + chunk_grid=chunk_grid, chunk_key_encoding=chunk_key_encoding_parsed, fill_value=fill_value_parsed, codecs=codecs_parsed, # type: ignore[arg-type] @@ -790,7 +824,7 @@ async def _create_v3( *, shape: ShapeLike, dtype: ZDType[TBaseDType, TBaseScalar], - chunk_shape: tuple[int, ...], + chunk_grid: ChunkGridMetadata, config: ArrayConfig, fill_value: Any | None = DEFAULT_FILL_VALUE, chunk_key_encoding: ( @@ -822,7 +856,7 @@ async def _create_v3( metadata = cls._create_metadata_v3( shape=shape, dtype=dtype, - chunk_shape=chunk_shape, + chunk_grid=chunk_grid, fill_value=fill_value, chunk_key_encoding=chunk_key_encoding, codecs=codecs, @@ -1041,23 +1075,78 @@ def chunks(self) -> tuple[int, ...]: """Returns the chunk shape of the Array. If sharding is used the inner chunk shape is returned. - Only defined for arrays using using `RegularChunkGrid`. - If array doesn't use `RegularChunkGrid`, `NotImplementedError` is raised. + Only defined for arrays using a regular chunk grid. + If array uses a rectilinear chunk grid, `NotImplementedError` is raised. Returns ------- tuple[int, ...]: The chunk shape of the Array. """ + # TODO: move sharding awareness out of metadata return self.metadata.chunks + @property + def read_chunk_sizes(self) -> tuple[tuple[int, ...], ...]: + """Per-dimension data sizes of chunks used for reading, clipped to the array extent. + + Boundary chunks that extend past the array shape are clipped, so + the last size along a dimension may be smaller than the declared + chunk size. This matches the dask ``Array.chunks`` convention. + + When sharding is used, returns the inner chunk sizes. + Otherwise, returns the outer chunk sizes (same as ``write_chunk_sizes``). + + Returns + ------- + tuple[tuple[int, ...], ...] + One inner tuple per dimension containing the data size of each + chunk (not the encoded buffer size). + + Examples + -------- + >>> arr = zarr.create_array(store, shape=(100, 80), chunks=(30, 40)) + >>> arr.read_chunk_sizes + ((30, 30, 30, 10), (40, 40)) + """ + from zarr.codecs.sharding import ShardingCodec + + codecs: tuple[Codec, ...] = getattr(self.metadata, "codecs", ()) + if len(codecs) == 1 and isinstance(codecs[0], ShardingCodec): + inner_chunk_shape = codecs[0].chunk_shape + return _chunk_sizes_from_shape(self.shape, inner_chunk_shape) + return self._chunk_grid.chunk_sizes + + @property + def write_chunk_sizes(self) -> tuple[tuple[int, ...], ...]: + """Per-dimension data sizes of storage chunks, clipped to the array extent. + + Always returns the outer chunk sizes, regardless of sharding. + Boundary chunks that extend past the array shape are clipped, so + the last size along a dimension may be smaller than the declared + chunk size. This matches the dask ``Array.chunks`` convention. + + Returns + ------- + tuple[tuple[int, ...], ...] + One inner tuple per dimension containing the data size of each + chunk (not the encoded buffer size). + + Examples + -------- + >>> arr = zarr.create_array(store, shape=(100, 80), chunks=(30, 40)) + >>> arr.write_chunk_sizes + ((30, 30, 30, 10), (40, 40)) + """ + return self._chunk_grid.chunk_sizes + @property def shards(self) -> tuple[int, ...] | None: """Returns the shard shape of the Array. Returns None if sharding is not used. - Only defined for arrays using using `RegularChunkGrid`. - If array doesn't use `RegularChunkGrid`, `NotImplementedError` is raised. + Only defined for arrays using a regular chunk grid. + If array uses a rectilinear chunk grid, `NotImplementedError` is raised. Returns ------- @@ -1255,7 +1344,16 @@ def _chunk_grid_shape(self) -> tuple[int, ...]: tuple[int, ...] The number of chunks along each dimension. """ - return tuple(starmap(ceildiv, zip(self.shape, self.chunks, strict=True))) + # TODO: refactor — extract a sharding_codec property on ArrayV3Metadata + # to replace the repeated `len == 1 and isinstance` pattern. + from zarr.codecs.sharding import ShardingCodec + + codecs: tuple[Codec, ...] = getattr(self.metadata, "codecs", ()) + if len(codecs) == 1 and isinstance(codecs[0], ShardingCodec): + # When sharding, count inner chunks across the whole array + chunk_shape = codecs[0].chunk_shape + return tuple(starmap(ceildiv, zip(self.shape, chunk_shape, strict=True))) + return self._chunk_grid.grid_shape @property def _shard_grid_shape(self) -> tuple[int, ...]: @@ -1583,6 +1681,7 @@ async def _get_selection( self.metadata, self.codec_pipeline, self.config, + self._chunk_grid, indexer, prototype=prototype, out=out, @@ -1637,6 +1736,7 @@ async def example(): self.metadata, self.codec_pipeline, self.config, + self._chunk_grid, selection, prototype=prototype, ) @@ -1654,6 +1754,7 @@ async def get_orthogonal_selection( self.metadata, self.codec_pipeline, self.config, + self._chunk_grid, selection, out=out, fields=fields, @@ -1673,6 +1774,7 @@ async def get_mask_selection( self.metadata, self.codec_pipeline, self.config, + self._chunk_grid, mask, out=out, fields=fields, @@ -1692,6 +1794,7 @@ async def get_coordinate_selection( self.metadata, self.codec_pipeline, self.config, + self._chunk_grid, selection, out=out, fields=fields, @@ -1717,6 +1820,7 @@ async def _set_selection( self.metadata, self.codec_pipeline, self.config, + self._chunk_grid, indexer, value, prototype=prototype, @@ -1767,6 +1871,7 @@ async def setitem( self.metadata, self.codec_pipeline, self.config, + self._chunk_grid, selection, value, prototype=prototype, @@ -1923,6 +2028,7 @@ async def info_complete(self) -> Any: def _info( self, count_chunks_initialized: int | None = None, count_bytes_stored: int | None = None ) -> Any: + chunk_shape = self.chunks if self._chunk_grid.is_regular else None return ArrayInfo( _zarr_format=self.metadata.zarr_format, _data_type=self._zdtype, @@ -1930,7 +2036,7 @@ def _info( _shape=self.shape, _order=self.order, _shard_shape=self.shards, - _chunk_shape=self.chunks, + _chunk_shape=chunk_shape, _read_only=self.read_only, _compressors=self.compressors, _filters=self.filters, @@ -1974,6 +2080,11 @@ def config(self) -> ArrayConfig: """ return self.async_array.config + @property + def _chunk_grid(self) -> ChunkGrid: + """The behavioral chunk grid for this array, bound to the array's shape.""" + return self.async_array._chunk_grid + @classmethod @deprecated("Use zarr.create_array instead.", category=ZarrDeprecationWarning) def create( @@ -2265,8 +2376,8 @@ def chunks(self) -> tuple[int, ...]: """Returns a tuple of integers describing the length of each dimension of a chunk of the array. If sharding is used the inner chunk shape is returned. - Only defined for arrays using using `RegularChunkGrid`. - If array doesn't use `RegularChunkGrid`, `NotImplementedError` is raised. + Only defined for arrays using a regular chunk grid. + If array uses a rectilinear chunk grid, `NotImplementedError` is raised. Returns ------- @@ -2275,13 +2386,61 @@ def chunks(self) -> tuple[int, ...]: """ return self.async_array.chunks + @property + def read_chunk_sizes(self) -> tuple[tuple[int, ...], ...]: + """Per-dimension data sizes of chunks used for reading, clipped to the array extent. + + Boundary chunks that extend past the array shape are clipped, so + the last size along a dimension may be smaller than the declared + chunk size. This matches the dask ``Array.chunks`` convention. + + When sharding is used, returns the inner chunk sizes. + Otherwise, returns the outer chunk sizes (same as ``write_chunk_sizes``). + + Returns + ------- + tuple[tuple[int, ...], ...] + One inner tuple per dimension containing the data size of each + chunk (not the encoded buffer size). + + Examples + -------- + >>> arr = zarr.open_array(store) + >>> arr.read_chunk_sizes + ((30, 30, 30, 10), (40, 40)) + """ + return self.async_array.read_chunk_sizes + + @property + def write_chunk_sizes(self) -> tuple[tuple[int, ...], ...]: + """Per-dimension data sizes of storage chunks, clipped to the array extent. + + Always returns the outer chunk sizes, regardless of sharding. + Boundary chunks that extend past the array shape are clipped, so + the last size along a dimension may be smaller than the declared + chunk size. This matches the dask ``Array.chunks`` convention. + + Returns + ------- + tuple[tuple[int, ...], ...] + One inner tuple per dimension containing the data size of each + chunk (not the encoded buffer size). + + Examples + -------- + >>> arr = zarr.open_array(store) + >>> arr.write_chunk_sizes + ((30, 30, 30, 10), (40, 40)) + """ + return self.async_array.write_chunk_sizes + @property def shards(self) -> tuple[int, ...] | None: """Returns a tuple of integers describing the length of each dimension of a shard of the array. Returns None if sharding is not used. - Only defined for arrays using using `RegularChunkGrid`. - If array doesn't use `RegularChunkGrid`, `NotImplementedError` is raised. + Only defined for arrays using a regular chunk grid. + If array uses a rectilinear chunk grid, `NotImplementedError` is raised. Returns ------- @@ -2670,7 +2829,7 @@ def __array__( raise ValueError(msg) arr = self[...] - arr_np: NDArrayLike = np.array(arr, dtype=dtype) + arr_np = np.array(arr, dtype=dtype) if dtype is not None: arr_np = arr_np.astype(dtype) @@ -3061,7 +3220,7 @@ def get_basic_selection( prototype = default_buffer_prototype() return sync( self.async_array._get_selection( - BasicIndexer(selection, self.shape, self.metadata.chunk_grid), + BasicIndexer(selection, self.shape, self._chunk_grid), out=out, fields=fields, prototype=prototype, @@ -3168,7 +3327,7 @@ def set_basic_selection( """ if prototype is None: prototype = default_buffer_prototype() - indexer = BasicIndexer(selection, self.shape, self.metadata.chunk_grid) + indexer = BasicIndexer(selection, self.shape, self._chunk_grid) sync(self.async_array._set_selection(indexer, value, fields=fields, prototype=prototype)) def get_orthogonal_selection( @@ -3296,7 +3455,7 @@ def get_orthogonal_selection( """ if prototype is None: prototype = default_buffer_prototype() - indexer = OrthogonalIndexer(selection, self.shape, self.metadata.chunk_grid) + indexer = OrthogonalIndexer(selection, self.shape, self._chunk_grid) return sync( self.async_array._get_selection( indexer=indexer, out=out, fields=fields, prototype=prototype @@ -3415,7 +3574,7 @@ def set_orthogonal_selection( """ if prototype is None: prototype = default_buffer_prototype() - indexer = OrthogonalIndexer(selection, self.shape, self.metadata.chunk_grid) + indexer = OrthogonalIndexer(selection, self.shape, self._chunk_grid) return sync( self.async_array._set_selection(indexer, value, fields=fields, prototype=prototype) ) @@ -3503,7 +3662,7 @@ def get_mask_selection( if prototype is None: prototype = default_buffer_prototype() - indexer = MaskIndexer(mask, self.shape, self.metadata.chunk_grid) + indexer = MaskIndexer(mask, self.shape, self._chunk_grid) return sync( self.async_array._get_selection( indexer=indexer, out=out, fields=fields, prototype=prototype @@ -3593,7 +3752,7 @@ def set_mask_selection( """ if prototype is None: prototype = default_buffer_prototype() - indexer = MaskIndexer(mask, self.shape, self.metadata.chunk_grid) + indexer = MaskIndexer(mask, self.shape, self._chunk_grid) sync(self.async_array._set_selection(indexer, value, fields=fields, prototype=prototype)) def get_coordinate_selection( @@ -3681,7 +3840,7 @@ def get_coordinate_selection( """ if prototype is None: prototype = default_buffer_prototype() - indexer = CoordinateIndexer(selection, self.shape, self.metadata.chunk_grid) + indexer = CoordinateIndexer(selection, self.shape, self._chunk_grid) out_array = sync( self.async_array._get_selection( indexer=indexer, out=out, fields=fields, prototype=prototype @@ -3774,7 +3933,7 @@ def set_coordinate_selection( if prototype is None: prototype = default_buffer_prototype() # setup indexer - indexer = CoordinateIndexer(selection, self.shape, self.metadata.chunk_grid) + indexer = CoordinateIndexer(selection, self.shape, self._chunk_grid) # handle value - need ndarray-like flatten value if not is_scalar(value, self.dtype): @@ -3896,7 +4055,7 @@ def get_block_selection( """ if prototype is None: prototype = default_buffer_prototype() - indexer = BlockIndexer(selection, self.shape, self.metadata.chunk_grid) + indexer = BlockIndexer(selection, self.shape, self._chunk_grid) return sync( self.async_array._get_selection( indexer=indexer, out=out, fields=fields, prototype=prototype @@ -3997,7 +4156,7 @@ def set_block_selection( """ if prototype is None: prototype = default_buffer_prototype() - indexer = BlockIndexer(selection, self.shape, self.metadata.chunk_grid) + indexer = BlockIndexer(selection, self.shape, self._chunk_grid) sync(self.async_array._set_selection(indexer, value, fields=fields, prototype=prototype)) @property @@ -4244,7 +4403,7 @@ class ShardsConfigParam(TypedDict): index_location: ShardingCodecIndexLocation | None -type ShardsLike = tuple[int, ...] | ShardsConfigParam | Literal["auto"] +type ShardsLike = tuple[int, ...] | Sequence[Sequence[int]] | ShardsConfigParam | Literal["auto"] async def from_array( @@ -4253,7 +4412,7 @@ async def from_array( data: AnyArray | npt.ArrayLike, write_data: bool = True, name: str | None = None, - chunks: Literal["auto", "keep"] | tuple[int, ...] = "keep", + chunks: ChunksLike | Literal["auto", "keep"] = "keep", shards: ShardsLike | None | Literal["keep"] = "keep", filters: FiltersLike | Literal["keep"] = "keep", compressors: CompressorsLike | Literal["keep"] = "keep", @@ -4285,13 +4444,17 @@ async def from_array( name : str or None, optional The name of the array within the store. If ``name`` is ``None``, the array will be located at the root of the store. - chunks : tuple[int, ...] or "auto" or "keep", optional + chunks : tuple[int, ...] or Sequence[Sequence[int]] or "auto" or "keep", optional Chunk shape of the array. Following values are supported: - "auto": Automatically determine the chunk shape based on the array's shape and dtype. - - "keep": Retain the chunk shape of the data array if it is a zarr Array. - - tuple[int, ...]: A tuple of integers representing the chunk shape. + - "keep": Retain the chunk grid of the data array if it is a zarr Array. + - tuple[int, ...]: A tuple of integers representing the chunk shape (regular grid). + - Sequence[Sequence[int]]: Per-dimension chunk edge lists (rectilinear grid). + Rectilinear chunk grids are experimental and must be explicitly enabled + with ``zarr.config.set({'array.rectilinear_chunks': True})`` while the + feature is stabilizing. If not specified, defaults to "keep" if data is a zarr Array, otherwise "auto". shards : tuple[int, ...], optional @@ -4522,7 +4685,7 @@ async def init_array( store_path: StorePath, shape: ShapeLike, dtype: ZDTypeLike, - chunks: tuple[int, ...] | Literal["auto"] = "auto", + chunks: ChunksLike | Literal["auto"] = "auto", shards: ShardsLike | None = None, filters: FiltersLike = "auto", compressors: CompressorsLike = "auto", @@ -4638,14 +4801,52 @@ async def init_array( else: await ensure_no_existing_node(store_path, zarr_format=zarr_format) + # Detect rectilinear (nested list) chunks or shards, e.g. [[10, 20, 30], [25, 25]] + from zarr.core.chunk_grids import _is_rectilinear_chunks + + rectilinear_meta: RectilinearChunkGrid | None = None + rectilinear_shards = _is_rectilinear_chunks(shards) + + if _is_rectilinear_chunks(chunks): + if zarr_format == 2: + raise ValueError("Zarr format 2 does not support rectilinear chunk grids.") + if shards is not None: + raise ValueError( + "Rectilinear chunks with sharding is not supported. " + "Use rectilinear shards instead: " + "chunks=(inner_size, ...), shards=[[shard_sizes], ...]" + ) + rectilinear_meta = RectilinearChunkGrid( + chunk_shapes=tuple(tuple(dim_edges) for dim_edges in chunks) + ) + # Use first chunk size per dim as placeholder for _auto_partition + chunks_flat: tuple[int, ...] | Literal["auto"] = tuple(dim_edges[0] for dim_edges in chunks) + else: + # Normalize scalar int to per-dimension tuple (e.g. chunks=100000 for a 1D array) + if isinstance(chunks, int): + chunks = tuple(chunks for _ in shape_parsed) + chunks_flat = cast("tuple[int, ...] | Literal['auto']", chunks) + + # Handle rectilinear shards: shards=[[60, 40, 20], [50, 50]] + # means variable-sized shard boundaries with uniform inner chunks + shards_for_partition: ShardsLike | None = shards + if _is_rectilinear_chunks(shards): + if zarr_format == 2: + raise ValueError("Zarr format 2 does not support rectilinear chunk grids.") + rectilinear_meta = RectilinearChunkGrid( + chunk_shapes=tuple(tuple(dim_edges) for dim_edges in shards) + ) + # Use first shard size per dim as placeholder for _auto_partition + shards_for_partition = tuple(dim_edges[0] for dim_edges in shards) + item_size = 1 if isinstance(zdtype, HasItemSize): item_size = zdtype.item_size shard_shape_parsed, chunk_shape_parsed = _auto_partition( array_shape=shape_parsed, - shard_shape=shards, - chunk_shape=chunks, + shard_shape=shards_for_partition, + chunk_shape=chunks_flat, item_size=item_size, ) chunks_out: tuple[int, ...] @@ -4701,10 +4902,15 @@ async def init_array( sharding_codec = ShardingCodec( chunk_shape=chunk_shape_parsed, codecs=sub_codecs, index_location=index_location ) + # Use rectilinear grid for validation when shards are rectilinear + if rectilinear_shards and rectilinear_meta is not None: + validation_grid: ChunkGridMetadata = rectilinear_meta + else: + validation_grid = RegularChunkGrid(chunk_shape=shard_shape_parsed) sharding_codec.validate( shape=chunk_shape_parsed, dtype=zdtype, - chunk_grid=RegularChunkGrid(chunk_shape=shard_shape_parsed), + chunk_grid=validation_grid, ) codecs_out = (sharding_codec,) chunks_out = shard_shape_parsed @@ -4715,11 +4921,16 @@ async def init_array( if order is not None: _warn_order_kwarg() + grid: ChunkGridMetadata + if rectilinear_meta is not None: + grid = rectilinear_meta + else: + grid = RegularChunkGrid(chunk_shape=chunks_out) meta = AsyncArray._create_metadata_v3( shape=shape_parsed, dtype=zdtype, fill_value=fill_value, - chunk_shape=chunks_out, + chunk_grid=grid, chunk_key_encoding=chunk_key_encoding_parsed, codecs=codecs_out, dimension_names=dimension_names, @@ -4738,7 +4949,7 @@ async def create_array( shape: ShapeLike | None = None, dtype: ZDTypeLike | None = None, data: np.ndarray[Any, np.dtype[Any]] | None = None, - chunks: tuple[int, ...] | Literal["auto"] = "auto", + chunks: ChunksLike | Literal["auto"] = "auto", shards: ShardsLike | None = None, filters: FiltersLike = "auto", compressors: CompressorsLike = "auto", @@ -4772,9 +4983,13 @@ async def create_array( data : np.ndarray, optional Array-like data to use for initializing the array. If this parameter is provided, the ``shape`` and ``dtype`` parameters must be ``None``. - chunks : tuple[int, ...] | Literal["auto"], default="auto" + chunks : tuple[int, ...] | Sequence[Sequence[int]] | Literal["auto"], default="auto" Chunk shape of the array. If chunks is "auto", a chunk shape is guessed based on the shape of the array and the dtype. + A nested list of per-dimension edge sizes creates a rectilinear grid. + Rectilinear chunk grids are experimental and must be explicitly enabled + with ``zarr.config.set({'array.rectilinear_chunks': True})`` while the + feature is stabilizing. shards : tuple[int, ...], optional Shard shape of the array. The default value of ``None`` results in no sharding at all. filters : Iterable[Codec] | Literal["auto"], optional @@ -4923,7 +5138,7 @@ async def create_array( def _parse_keep_array_attr( data: AnyArray | npt.ArrayLike, - chunks: Literal["auto", "keep"] | tuple[int, ...], + chunks: ChunksLike | Literal["auto", "keep"], shards: ShardsLike | None | Literal["keep"], filters: FiltersLike | Literal["keep"], compressors: CompressorsLike | Literal["keep"], @@ -4934,7 +5149,7 @@ def _parse_keep_array_attr( chunk_key_encoding: ChunkKeyEncodingLike | None, dimension_names: DimensionNamesLike, ) -> tuple[ - tuple[int, ...] | Literal["auto"], + ChunksLike | Literal["auto"], ShardsLike | None, FiltersLike, CompressorsLike, @@ -4947,9 +5162,12 @@ def _parse_keep_array_attr( ]: if isinstance(data, Array): if chunks == "keep": - chunks = data.chunks + if data._chunk_grid.is_regular: + chunks = data.chunks + else: + chunks = data.write_chunk_sizes if shards == "keep": - shards = data.shards + shards = data.shards if data._chunk_grid.is_regular else None if zarr_format is None: zarr_format = data.metadata.zarr_format if filters == "keep": @@ -5001,8 +5219,10 @@ def _parse_keep_array_attr( compressors = "auto" if serializer == "keep": serializer = "auto" + # After resolving "keep" above, chunks is never "keep" at this point. + chunks_out: ChunksLike | Literal["auto"] = chunks # type: ignore[assignment] return ( - chunks, + chunks_out, shards, filters, compressors, @@ -5458,9 +5678,7 @@ def _iter_chunk_regions( A tuple of slice objects representing the region spanned by each shard in the selection. """ - return _iter_regions( - array.shape, array.chunks, origin=origin, selection_shape=selection_shape, trim_excess=True - ) + return array._chunk_grid.iter_chunk_regions(origin=origin, selection_shape=selection_shape) async def _nchunks_initialized( @@ -5533,11 +5751,32 @@ async def _nbytes_stored( return await store_path.store.getsize_prefix(store_path.path) +def _get_chunk_spec( + metadata: ArrayMetadata, + chunk_grid: ChunkGrid, + chunk_coords: tuple[int, ...], + array_config: ArrayConfig, + prototype: BufferPrototype, +) -> ArraySpec: + """Build an ArraySpec for a single chunk using the behavioral ChunkGrid.""" + spec = chunk_grid[chunk_coords] + if spec is None: + raise IndexError(f"Chunk coordinates {chunk_coords} are out of bounds.") + return ArraySpec( + shape=spec.codec_shape, + dtype=metadata.dtype, + fill_value=metadata.fill_value, + config=array_config, + prototype=prototype, + ) + + async def _get_selection( store_path: StorePath, metadata: ArrayMetadata, codec_pipeline: CodecPipeline, config: ArrayConfig, + chunk_grid: ChunkGrid, indexer: Indexer, *, prototype: BufferPrototype, @@ -5614,7 +5853,7 @@ async def _get_selection( [ ( store_path / metadata.encode_chunk_key(chunk_coords), - metadata.get_chunk_spec(chunk_coords, _config, prototype=prototype), + _get_chunk_spec(metadata, chunk_grid, chunk_coords, _config, prototype), chunk_selection, out_selection, is_complete_chunk, @@ -5634,6 +5873,7 @@ async def _getitem( metadata: ArrayMetadata, codec_pipeline: CodecPipeline, config: ArrayConfig, + chunk_grid: ChunkGrid, selection: BasicSelection, *, prototype: BufferPrototype | None = None, @@ -5651,6 +5891,8 @@ async def _getitem( The codec pipeline for encoding/decoding. config : ArrayConfig The array configuration. + chunk_grid : ChunkGrid + The behavioral chunk grid. selection : BasicSelection A selection object specifying the subset of data to retrieve. prototype : BufferPrototype, optional @@ -5666,10 +5908,10 @@ async def _getitem( indexer = BasicIndexer( selection, shape=metadata.shape, - chunk_grid=metadata.chunk_grid, + chunk_grid=chunk_grid, ) return await _get_selection( - store_path, metadata, codec_pipeline, config, indexer, prototype=prototype + store_path, metadata, codec_pipeline, config, chunk_grid, indexer, prototype=prototype ) @@ -5678,6 +5920,7 @@ async def _get_orthogonal_selection( metadata: ArrayMetadata, codec_pipeline: CodecPipeline, config: ArrayConfig, + chunk_grid: ChunkGrid, selection: OrthogonalSelection, *, out: NDBuffer | None = None, @@ -5697,6 +5940,8 @@ async def _get_orthogonal_selection( The codec pipeline for encoding/decoding. config : ArrayConfig The array configuration. + chunk_grid : ChunkGrid + The behavioral chunk grid. selection : OrthogonalSelection The orthogonal selection specification. out : NDBuffer | None, optional @@ -5713,12 +5958,13 @@ async def _get_orthogonal_selection( """ if prototype is None: prototype = default_buffer_prototype() - indexer = OrthogonalIndexer(selection, metadata.shape, metadata.chunk_grid) + indexer = OrthogonalIndexer(selection, metadata.shape, chunk_grid) return await _get_selection( store_path, metadata, codec_pipeline, config, + chunk_grid, indexer=indexer, out=out, fields=fields, @@ -5731,6 +5977,7 @@ async def _get_mask_selection( metadata: ArrayMetadata, codec_pipeline: CodecPipeline, config: ArrayConfig, + chunk_grid: ChunkGrid, mask: MaskSelection, *, out: NDBuffer | None = None, @@ -5750,6 +5997,8 @@ async def _get_mask_selection( The codec pipeline for encoding/decoding. config : ArrayConfig The array configuration. + chunk_grid : ChunkGrid + The behavioral chunk grid. mask : MaskSelection The boolean mask specifying the selection. out : NDBuffer | None, optional @@ -5766,12 +6015,13 @@ async def _get_mask_selection( """ if prototype is None: prototype = default_buffer_prototype() - indexer = MaskIndexer(mask, metadata.shape, metadata.chunk_grid) + indexer = MaskIndexer(mask, metadata.shape, chunk_grid) return await _get_selection( store_path, metadata, codec_pipeline, config, + chunk_grid, indexer=indexer, out=out, fields=fields, @@ -5784,6 +6034,7 @@ async def _get_coordinate_selection( metadata: ArrayMetadata, codec_pipeline: CodecPipeline, config: ArrayConfig, + chunk_grid: ChunkGrid, selection: CoordinateSelection, *, out: NDBuffer | None = None, @@ -5803,6 +6054,8 @@ async def _get_coordinate_selection( The codec pipeline for encoding/decoding. config : ArrayConfig The array configuration. + chunk_grid : ChunkGrid + The behavioral chunk grid. selection : CoordinateSelection The coordinate selection specification. out : NDBuffer | None, optional @@ -5819,12 +6072,13 @@ async def _get_coordinate_selection( """ if prototype is None: prototype = default_buffer_prototype() - indexer = CoordinateIndexer(selection, metadata.shape, metadata.chunk_grid) + indexer = CoordinateIndexer(selection, metadata.shape, chunk_grid) out_array = await _get_selection( store_path, metadata, codec_pipeline, config, + chunk_grid, indexer=indexer, out=out, fields=fields, @@ -5833,7 +6087,7 @@ async def _get_coordinate_selection( if hasattr(out_array, "shape"): # restore shape - out_array = np.array(out_array).reshape(indexer.sel_shape) + out_array = cast("NDArrayLikeOrScalar", np.array(out_array).reshape(indexer.sel_shape)) return out_array @@ -5842,6 +6096,7 @@ async def _set_selection( metadata: ArrayMetadata, codec_pipeline: CodecPipeline, config: ArrayConfig, + chunk_grid: ChunkGrid, indexer: Indexer, value: npt.ArrayLike, *, @@ -5861,6 +6116,8 @@ async def _set_selection( The codec pipeline for encoding/decoding. config : ArrayConfig The array configuration. + chunk_grid : ChunkGrid + The behavioral chunk grid. indexer : Indexer The indexer specifying the selection. value : npt.ArrayLike @@ -5924,7 +6181,7 @@ async def _set_selection( [ ( store_path / metadata.encode_chunk_key(chunk_coords), - metadata.get_chunk_spec(chunk_coords, _config, prototype), + _get_chunk_spec(metadata, chunk_grid, chunk_coords, _config, prototype), chunk_selection, out_selection, is_complete_chunk, @@ -5941,6 +6198,7 @@ async def _setitem( metadata: ArrayMetadata, codec_pipeline: CodecPipeline, config: ArrayConfig, + chunk_grid: ChunkGrid, selection: BasicSelection, value: npt.ArrayLike, prototype: BufferPrototype | None = None, @@ -5958,6 +6216,8 @@ async def _setitem( The codec pipeline for encoding/decoding. config : ArrayConfig The array configuration. + chunk_grid : ChunkGrid + The behavioral chunk grid. selection : BasicSelection The selection defining the region of the array to set. value : npt.ArrayLike @@ -5971,10 +6231,17 @@ async def _setitem( indexer = BasicIndexer( selection, shape=metadata.shape, - chunk_grid=metadata.chunk_grid, + chunk_grid=chunk_grid, ) return await _set_selection( - store_path, metadata, codec_pipeline, config, indexer, value, prototype=prototype + store_path, + metadata, + codec_pipeline, + config, + chunk_grid, + indexer, + value, + prototype=prototype, ) @@ -5998,15 +6265,17 @@ async def _resize( """ new_shape = parse_shapelike(new_shape) assert len(new_shape) == len(array.metadata.shape) + new_metadata = array.metadata.update_shape(new_shape) + new_chunk_grid = ChunkGrid.from_metadata(new_metadata) # ensure deletion is only run if array is shrinking as the delete_outside_chunks path is unbounded in memory only_growing = all(new >= old for new, old in zip(new_shape, array.metadata.shape, strict=True)) if delete_outside_chunks and not only_growing: # Remove all chunks outside of the new shape - old_chunk_coords = set(array.metadata.chunk_grid.all_chunk_coords(array.metadata.shape)) - new_chunk_coords = set(array.metadata.chunk_grid.all_chunk_coords(new_shape)) + old_chunk_coords = set(array._chunk_grid.all_chunk_coords()) + new_chunk_coords = set(new_chunk_grid.all_chunk_coords()) async def _delete_key(key: str) -> None: await (array.store_path / key).delete() @@ -6023,8 +6292,9 @@ async def _delete_key(key: str) -> None: # Write new metadata await save_metadata(array.store_path, new_metadata) - # Update metadata (in place) + # Update metadata and chunk_grid (in place) object.__setattr__(array, "metadata", new_metadata) + object.__setattr__(array, "_chunk_grid", new_chunk_grid) async def _append( @@ -6090,6 +6360,7 @@ async def _append( array.metadata, array.codec_pipeline, array.config, + array._chunk_grid, append_selection, data, ) diff --git a/src/zarr/core/chunk_grids.py b/src/zarr/core/chunk_grids.py index c903eba013..dcea33f3bf 100644 --- a/src/zarr/core/chunk_grids.py +++ b/src/zarr/core/chunk_grids.py @@ -1,34 +1,544 @@ from __future__ import annotations +import bisect import itertools import math import numbers import operator import warnings -from abc import abstractmethod -from dataclasses import dataclass +from dataclasses import dataclass, field from functools import reduce -from typing import TYPE_CHECKING, Any, Literal +from typing import TYPE_CHECKING, Any, Literal, Protocol, TypeGuard, cast, runtime_checkable import numpy as np +import numpy.typing as npt import zarr -from zarr.abc.metadata import Metadata from zarr.core.common import ( - JSON, - NamedConfig, ShapeLike, ceildiv, - parse_named_configuration, parse_shapelike, ) from zarr.errors import ZarrUserWarning if TYPE_CHECKING: - from collections.abc import Iterator - from typing import Self + from collections.abc import Iterable, Iterator, Sequence from zarr.core.array import ShardsLike + from zarr.core.metadata import ArrayMetadata + + +@dataclass(frozen=True) +class FixedDimension: + """Uniform chunk size. Boundary chunks contain less data but are + encoded at full size by the codec pipeline.""" + + size: int # chunk edge length (>= 0) + extent: int # array dimension length + nchunks: int = field(init=False, repr=False) + ngridcells: int = field(init=False, repr=False) + + def __post_init__(self) -> None: + if self.size < 0: + raise ValueError(f"FixedDimension size must be >= 0, got {self.size}") + if self.extent < 0: + raise ValueError(f"FixedDimension extent must be >= 0, got {self.extent}") + if self.size == 0: + n = 0 + else: + n = ceildiv(self.extent, self.size) + object.__setattr__(self, "nchunks", n) + object.__setattr__(self, "ngridcells", n) + + def index_to_chunk(self, idx: int) -> int: + if idx < 0: + raise IndexError(f"Negative index {idx} is not allowed") + if idx >= self.extent: + raise IndexError(f"Index {idx} is out of bounds for extent {self.extent}") + if self.size == 0: + return 0 + return idx // self.size + + def chunk_offset(self, chunk_ix: int) -> int: + """Byte-aligned start position of chunk *chunk_ix* in array coordinates. + + Does not validate *chunk_ix* — callers must ensure it is in + ``[0, nchunks)``. Use ``ChunkGrid.__getitem__`` for safe access. + """ + return chunk_ix * self.size + + def chunk_size(self, chunk_ix: int) -> int: + """Buffer size for codec processing — always uniform. + + Does not validate *chunk_ix* — callers must ensure it is in + ``[0, nchunks)``. Use ``ChunkGrid.__getitem__`` for safe access. + """ + return self.size + + def data_size(self, chunk_ix: int) -> int: + """Valid data region within the buffer — clipped at extent. + + Does not validate *chunk_ix* — callers must ensure it is in + ``[0, nchunks)``. Use ``ChunkGrid.__getitem__`` for safe access. + """ + if self.size == 0: + return 0 + return max(0, min(self.size, self.extent - chunk_ix * self.size)) + + @property + def unique_edge_lengths(self) -> Iterable[int]: + """Distinct chunk edge lengths for this dimension.""" + return (self.size,) + + def indices_to_chunks(self, indices: npt.NDArray[np.intp]) -> npt.NDArray[np.intp]: + if self.size == 0: + return np.zeros_like(indices) + return indices // self.size + + def with_extent(self, new_extent: int) -> FixedDimension: + """Re-bind to *new_extent* without modifying edges. + + Used when constructing a grid from existing metadata where edges + are already correct. Raises on + ``VaryingDimension`` if edges don't cover the new extent. + """ + return FixedDimension(size=self.size, extent=new_extent) + + def resize(self, new_extent: int) -> FixedDimension: + """Adapt for a user-initiated array resize, growing edges if needed. + + For ``FixedDimension`` this is identical to ``with_extent`` since + regular grids don't store explicit edges. + """ + return FixedDimension(size=self.size, extent=new_extent) + + +@dataclass(frozen=True) +class VaryingDimension: + """Explicit per-chunk sizes. The last chunk may extend past the array + extent, in which case ``data_size`` clips to the valid region while + ``chunk_size`` returns the full edge length for codec processing.""" + + edges: tuple[int, ...] # per-chunk edge lengths (all > 0) + cumulative: tuple[int, ...] # prefix sums for O(log n) lookup + extent: int # array dimension length (may be < sum(edges) after resize) + nchunks: int = field(init=False, repr=False) # cached at construction + ngridcells: int = field(init=False, repr=False) # cached at construction + + def __init__(self, edges: Sequence[int], extent: int) -> None: + edges_tuple = tuple(edges) + if not edges_tuple: + raise ValueError("VaryingDimension edges must not be empty") + if any(e <= 0 for e in edges_tuple): + raise ValueError(f"All edge lengths must be > 0, got {edges_tuple}") + cumulative = tuple(itertools.accumulate(edges_tuple)) + if extent < 0: + raise ValueError(f"VaryingDimension extent must be >= 0, got {extent}") + if extent > cumulative[-1]: + raise ValueError( + f"VaryingDimension extent {extent} exceeds sum of edges {cumulative[-1]}" + ) + object.__setattr__(self, "edges", edges_tuple) + object.__setattr__(self, "cumulative", cumulative) + object.__setattr__(self, "extent", extent) + # Cache nchunks: number of chunks that overlap [0, extent) + if extent == 0: + n = 0 + else: + n = bisect.bisect_left(cumulative, extent) + 1 + object.__setattr__(self, "nchunks", n) + object.__setattr__(self, "ngridcells", len(edges_tuple)) + + def index_to_chunk(self, idx: int) -> int: + if idx < 0 or idx >= self.extent: + raise IndexError(f"Index {idx} out of bounds for dimension with extent {self.extent}") + return bisect.bisect_right(self.cumulative, idx) + + def chunk_offset(self, chunk_ix: int) -> int: + """Start position of chunk *chunk_ix* in array coordinates. + + Does not validate *chunk_ix* — callers must ensure it is in + ``[0, ngridcells)``. Use ``ChunkGrid.__getitem__`` for safe access. + """ + return self.cumulative[chunk_ix - 1] if chunk_ix > 0 else 0 + + def chunk_size(self, chunk_ix: int) -> int: + """Buffer size for codec processing. + + Does not validate *chunk_ix* — callers must ensure it is in + ``[0, ngridcells)``. Use ``ChunkGrid.__getitem__`` for safe access. + """ + return self.edges[chunk_ix] + + def data_size(self, chunk_ix: int) -> int: + """Valid data region within the buffer — clipped at extent. + + Does not validate *chunk_ix* — callers must ensure it is in + ``[0, ngridcells)``. Use ``ChunkGrid.__getitem__`` for safe access. + """ + offset = self.cumulative[chunk_ix - 1] if chunk_ix > 0 else 0 + return max(0, min(self.edges[chunk_ix], self.extent - offset)) + + @property + def unique_edge_lengths(self) -> Iterable[int]: + """Distinct chunk edge lengths for this dimension (lazily deduplicated).""" + seen: set[int] = set() + for e in self.edges: + if e not in seen: + seen.add(e) + yield e + + def indices_to_chunks(self, indices: npt.NDArray[np.intp]) -> npt.NDArray[np.intp]: + return np.searchsorted(self.cumulative, indices, side="right") + + def with_extent(self, new_extent: int) -> VaryingDimension: + """Re-bind to *new_extent* without modifying edges. + + Used when constructing a grid from existing metadata where edges + are already correct. Raises if the + existing edges don't cover *new_extent*. + """ + edge_sum = self.cumulative[-1] + if edge_sum < new_extent: + raise ValueError( + f"VaryingDimension edge sum {edge_sum} is less than new extent {new_extent}" + ) + return VaryingDimension(self.edges, extent=new_extent) + + def resize(self, new_extent: int) -> VaryingDimension: + """Adapt for a user-initiated array resize, growing edges if needed. + + Unlike ``with_extent``, this never fails — if *new_extent* exceeds + the current edge sum, a new chunk is appended to cover the gap. + Shrinking preserves all edges (the spec allows trailing edges + beyond the array extent). + """ + if new_extent == self.extent: + return self + elif new_extent > self.cumulative[-1]: + expanded_edges = list(self.edges) + [new_extent - self.cumulative[-1]] + return VaryingDimension(expanded_edges, extent=new_extent) + else: + return VaryingDimension(self.edges, extent=new_extent) + + +@runtime_checkable +class DimensionGrid(Protocol): + """Structural interface shared by FixedDimension and VaryingDimension.""" + + @property + def nchunks(self) -> int: ... + @property + def ngridcells(self) -> int: ... + @property + def extent(self) -> int: ... + def index_to_chunk(self, idx: int) -> int: ... + def chunk_offset(self, chunk_ix: int) -> int: ... + def chunk_size(self, chunk_ix: int) -> int: ... + def data_size(self, chunk_ix: int) -> int: ... + def indices_to_chunks(self, indices: npt.NDArray[np.intp]) -> npt.NDArray[np.intp]: ... + @property + def unique_edge_lengths(self) -> Iterable[int]: ... + def with_extent(self, new_extent: int) -> DimensionGrid: ... + def resize(self, new_extent: int) -> DimensionGrid: ... + + +@dataclass(frozen=True) +class ChunkSpec: + """Specification of a single chunk's location and size. + + ``slices`` gives the valid data region in array coordinates. + ``codec_shape`` gives the buffer shape for codec processing. + For interior chunks these are equal. For boundary chunks of a regular + grid, ``codec_shape`` is the full declared chunk size while ``shape`` + is clipped. For rectilinear grids, ``shape == codec_shape`` unless the + last chunk extends past the array extent. + """ + + slices: tuple[slice, ...] + codec_shape: tuple[int, ...] + + @property + def shape(self) -> tuple[int, ...]: + return tuple(s.stop - s.start for s in self.slices) + + @property + def is_boundary(self) -> bool: + return self.shape != self.codec_shape + + +# A single dimension's rectilinear chunk spec: bare int (uniform shorthand), +# list of ints (explicit edges), or mixed RLE (e.g. [[10, 3], 5]). + + +def _is_rectilinear_chunks(chunks: Any) -> TypeGuard[Sequence[Sequence[int]]]: + """Check if chunks is a nested sequence (e.g. [[10, 20], [5, 5]]). + + Returns True for inputs like [[10, 20], [5, 5]] or [(10, 20), (5, 5)]. + Returns False for flat sequences like (10, 10) or [10, 10]. + """ + if isinstance(chunks, (str, int, ChunkGrid)): + return False + if not hasattr(chunks, "__iter__"): + return False + try: + first_elem = next(iter(chunks), None) + if first_elem is None: + return False + return hasattr(first_elem, "__iter__") and not isinstance(first_elem, (str, bytes, int)) + except (TypeError, StopIteration): + return False + + +@dataclass(frozen=True) +class ChunkGrid: + """ + Unified chunk grid supporting both regular and rectilinear chunking. + + A chunk grid is a concrete arrangement of chunks for a specific array. + It stores the extent (array dimension length) per dimension, enabling + ``grid[coords]`` to return a ``ChunkSpec`` without external parameters. + + Internally represents each dimension as either FixedDimension (uniform chunks) + or VaryingDimension (per-chunk edge lengths with prefix sums). + """ + + _dimensions: tuple[DimensionGrid, ...] + _is_regular: bool + + def __init__(self, *, dimensions: tuple[DimensionGrid, ...]) -> None: + object.__setattr__(self, "_dimensions", dimensions) + object.__setattr__( + self, "_is_regular", all(isinstance(d, FixedDimension) for d in dimensions) + ) + + def __repr__(self) -> str: + sizes: list[str] = [] + for d in self._dimensions: + if isinstance(d, FixedDimension): + sizes.append(str(d.size)) + elif isinstance(d, VaryingDimension): + sizes.append(repr(tuple(d.edges))) + shape = tuple(d.extent for d in self._dimensions) + return f"ChunkGrid(chunk_sizes=({', '.join(sizes)}), array_shape={shape})" + + @classmethod + def from_metadata(cls, metadata: ArrayMetadata) -> ChunkGrid: + """Construct a behavioral ChunkGrid from array metadata. + + For v2 metadata, builds from shape and chunks. + For v3 metadata, dispatches on the chunk grid type. + """ + from zarr.core.metadata import ArrayV2Metadata + from zarr.core.metadata.v3 import RectilinearChunkGrid, RegularChunkGrid + + if isinstance(metadata, ArrayV2Metadata): + return cls.from_sizes(metadata.shape, tuple(metadata.chunks)) + chunk_grid_meta = metadata.chunk_grid + if isinstance(chunk_grid_meta, RegularChunkGrid): + return cls.from_sizes(metadata.shape, tuple(chunk_grid_meta.chunk_shape)) + elif isinstance(chunk_grid_meta, RectilinearChunkGrid): + return cls.from_sizes(metadata.shape, chunk_grid_meta.chunk_shapes) + else: + raise TypeError(f"Unknown chunk grid metadata type: {type(chunk_grid_meta)}") + + @classmethod + def from_sizes( + cls, + array_shape: ShapeLike, + chunk_sizes: Sequence[int | Sequence[int]], + ) -> ChunkGrid: + """Create a ChunkGrid from per-dimension chunk size specifications. + + Parameters + ---------- + array_shape + The array shape (one extent per dimension). + chunk_sizes + Per-dimension chunk sizes. Each element is either: + + - An ``int`` — regular (fixed) chunk size for that dimension. + - A ``Sequence[int]`` — explicit per-chunk edge lengths. If all + edges are identical and cover the extent, the dimension is + stored as ``FixedDimension``; otherwise as ``VaryingDimension``. + """ + extents = parse_shapelike(array_shape) + if len(extents) != len(chunk_sizes): + raise ValueError( + f"array_shape has {len(extents)} dimensions but chunk_sizes " + f"has {len(chunk_sizes)} dimensions" + ) + dims: list[DimensionGrid] = [] + for dim_spec, extent in zip(chunk_sizes, extents, strict=True): + if isinstance(dim_spec, int): + dims.append(FixedDimension(size=dim_spec, extent=extent)) + else: + edges_list = list(dim_spec) + if not edges_list: + raise ValueError("Each dimension must have at least one chunk") + edge_sum = sum(edges_list) + if ( + edges_list[0] > 0 + and all(e == edges_list[0] for e in edges_list) + and (extent == edge_sum or len(edges_list) == ceildiv(extent, edges_list[0])) + ): + dims.append(FixedDimension(size=edges_list[0], extent=extent)) + else: + dims.append(VaryingDimension(edges_list, extent=extent)) + return cls(dimensions=tuple(dims)) + + # -- Properties -- + + @property + def ndim(self) -> int: + return len(self._dimensions) + + @property + def is_regular(self) -> bool: + return self._is_regular + + @property + def grid_shape(self) -> tuple[int, ...]: + """Number of chunks per dimension.""" + return tuple(d.nchunks for d in self._dimensions) + + @property + def chunk_shape(self) -> tuple[int, ...]: + """Return the uniform chunk shape. Raises if grid is not regular.""" + if not self.is_regular: + raise ValueError( + "chunk_shape is only available for regular chunk grids. " + "Use grid[coords] for per-chunk sizes." + ) + return tuple(d.size for d in self._dimensions if isinstance(d, FixedDimension)) + + @property + def chunk_sizes(self) -> tuple[tuple[int, ...], ...]: + """Per-dimension chunk sizes, including the final boundary chunk. + + Returns the actual data size of each chunk (clipped at the array + extent), matching the dask ``Array.chunks`` convention. Works for + both regular and rectilinear grids. + + Returns + ------- + tuple[tuple[int, ...], ...] + One inner tuple per dimension, each containing the data size + of every chunk along that dimension. + """ + return tuple(tuple(d.data_size(i) for i in range(d.nchunks)) for d in self._dimensions) + + # -- Collection interface -- + + def __getitem__(self, coords: int | tuple[int, ...]) -> ChunkSpec | None: + """Return the ChunkSpec for a chunk at the given grid position, or None if OOB.""" + if isinstance(coords, int): + coords = (coords,) + if len(coords) != self.ndim: + raise ValueError( + f"Expected {self.ndim} coordinate(s) for a {self.ndim}-d chunk grid, " + f"got {len(coords)}." + ) + slices: list[slice] = [] + codec_shape: list[int] = [] + for dim, ix in zip(self._dimensions, coords, strict=True): + if ix < 0 or ix >= dim.nchunks: + return None + offset = dim.chunk_offset(ix) + slices.append(slice(offset, offset + dim.data_size(ix), 1)) + codec_shape.append(dim.chunk_size(ix)) + return ChunkSpec(tuple(slices), tuple(codec_shape)) + + def __iter__(self) -> Iterator[ChunkSpec]: + """Iterate all chunks, yielding ChunkSpec for each.""" + for coords in itertools.product(*(range(d.nchunks) for d in self._dimensions)): + spec = self[coords] + if spec is not None: + yield spec + + def all_chunk_coords( + self, + *, + origin: Sequence[int] | None = None, + selection_shape: Sequence[int] | None = None, + ) -> Iterator[tuple[int, ...]]: + """Iterate over chunk coordinates, optionally restricted to a subregion. + + Parameters + ---------- + origin : Sequence[int] | None + The first chunk coordinate to return. Defaults to the grid origin. + selection_shape : Sequence[int] | None + The number of chunks per dimension to iterate. Defaults to the + remaining extent from origin. + """ + if origin is None: + origin_parsed = (0,) * self.ndim + else: + origin_parsed = tuple(origin) + if selection_shape is None: + selection_shape_parsed = tuple( + g - o for o, g in zip(origin_parsed, self.grid_shape, strict=True) + ) + else: + selection_shape_parsed = tuple(selection_shape) + ranges = tuple( + range(o, o + s) for o, s in zip(origin_parsed, selection_shape_parsed, strict=True) + ) + return itertools.product(*ranges) + + def iter_chunk_regions( + self, + *, + origin: Sequence[int] | None = None, + selection_shape: Sequence[int] | None = None, + ) -> Iterator[tuple[slice, ...]]: + """Iterate over the data regions (slices) spanned by each chunk. + + Parameters + ---------- + origin : Sequence[int] | None + The first chunk coordinate to return. Defaults to the grid origin. + selection_shape : Sequence[int] | None + The number of chunks per dimension to iterate. Defaults to the + remaining extent from origin. + """ + for coords in self.all_chunk_coords(origin=origin, selection_shape=selection_shape): + spec = self[coords] + if spec is not None: + yield spec.slices + + def get_nchunks(self) -> int: + return reduce(operator.mul, (d.nchunks for d in self._dimensions), 1) + + # -- Resize -- + + def update_shape(self, new_shape: tuple[int, ...]) -> ChunkGrid: + """Return a new ChunkGrid adjusted for *new_shape*. + + For regular (FixedDimension) axes the extent is simply re-bound. + For varying (VaryingDimension) axes: + * **grow**: a new chunk whose size equals the growth is appended. + * **shrink**: trailing chunks that lie entirely beyond *new_shape* are + dropped; the last retained chunk is the one whose cumulative offset + first reaches or exceeds the new extent. + * **no change**: the dimension is kept as-is. + + Raises + ------ + ValueError + If *new_shape* has the wrong number of dimensions. + """ + if len(new_shape) != self.ndim: + raise ValueError( + f"new_shape has {len(new_shape)} dimensions but " + f"chunk grid has {self.ndim} dimensions" + ) + dims = tuple( + dim.resize(new_extent) + for dim, new_extent in zip(self._dimensions, new_shape, strict=True) + ) + return ChunkGrid(dimensions=dims) def _guess_chunks( @@ -156,58 +666,6 @@ def normalize_chunks(chunks: Any, shape: tuple[int, ...], typesize: int) -> tupl return tuple(int(c) for c in chunks) -@dataclass(frozen=True) -class ChunkGrid(Metadata): - @classmethod - def from_dict(cls, data: dict[str, JSON] | ChunkGrid | NamedConfig[str, Any]) -> ChunkGrid: - if isinstance(data, ChunkGrid): - return data - - name_parsed, _ = parse_named_configuration(data) - if name_parsed == "regular": - return RegularChunkGrid._from_dict(data) - raise ValueError(f"Unknown chunk grid. Got {name_parsed}.") - - @abstractmethod - def all_chunk_coords(self, array_shape: tuple[int, ...]) -> Iterator[tuple[int, ...]]: - pass - - @abstractmethod - def get_nchunks(self, array_shape: tuple[int, ...]) -> int: - pass - - -@dataclass(frozen=True) -class RegularChunkGrid(ChunkGrid): - chunk_shape: tuple[int, ...] - - def __init__(self, *, chunk_shape: ShapeLike) -> None: - chunk_shape_parsed = parse_shapelike(chunk_shape) - - object.__setattr__(self, "chunk_shape", chunk_shape_parsed) - - @classmethod - def _from_dict(cls, data: dict[str, JSON] | NamedConfig[str, Any]) -> Self: - _, configuration_parsed = parse_named_configuration(data, "regular") - - return cls(**configuration_parsed) # type: ignore[arg-type] - - def to_dict(self) -> dict[str, JSON]: - return {"name": "regular", "configuration": {"chunk_shape": tuple(self.chunk_shape)}} - - def all_chunk_coords(self, array_shape: tuple[int, ...]) -> Iterator[tuple[int, ...]]: - return itertools.product( - *(range(ceildiv(s, c)) for s, c in zip(array_shape, self.chunk_shape, strict=False)) - ) - - def get_nchunks(self, array_shape: tuple[int, ...]) -> int: - return reduce( - operator.mul, - itertools.starmap(ceildiv, zip(array_shape, self.chunk_shape, strict=True)), - 1, - ) - - def _guess_num_chunks_per_axis_shard( chunk_shape: tuple[int, ...], item_size: int, max_bytes: int, array_shape: tuple[int, ...] ) -> int: @@ -301,6 +759,6 @@ def _auto_partition( elif isinstance(shard_shape, dict): _shards_out = tuple(shard_shape["shape"]) else: - _shards_out = shard_shape + _shards_out = cast("tuple[int, ...]", shard_shape) return _shards_out, _chunks_out diff --git a/src/zarr/core/codec_pipeline.py b/src/zarr/core/codec_pipeline.py index d8c4cabdf9..a9de9b4dbe 100644 --- a/src/zarr/core/codec_pipeline.py +++ b/src/zarr/core/codec_pipeline.py @@ -28,8 +28,8 @@ from zarr.abc.store import ByteGetter, ByteSetter from zarr.core.array_spec import ArraySpec from zarr.core.buffer import Buffer, BufferPrototype, NDBuffer - from zarr.core.chunk_grids import ChunkGrid from zarr.core.dtype.wrapper import TBaseDType, TBaseScalar, ZDType + from zarr.core.metadata.v3 import ChunkGridMetadata def _unzip2[T, U](iterable: Iterable[tuple[T, U]]) -> tuple[list[T], list[U]]: @@ -136,7 +136,7 @@ def validate( *, shape: tuple[int, ...], dtype: ZDType[TBaseDType, TBaseScalar], - chunk_grid: ChunkGrid, + chunk_grid: ChunkGridMetadata, ) -> None: for codec in self: codec.validate(shape=shape, dtype=dtype, chunk_grid=chunk_grid) diff --git a/src/zarr/core/common.py b/src/zarr/core/common.py index 077f459d3b..a16257df7c 100644 --- a/src/zarr/core/common.py +++ b/src/zarr/core/common.py @@ -37,6 +37,7 @@ BytesLike = bytes | bytearray | memoryview ShapeLike = Iterable[int | np.integer[Any]] | int | np.integer[Any] +ChunksLike = ShapeLike | Sequence[Sequence[int]] | None # For backwards compatibility ChunkCoords = tuple[int, ...] ZarrFormat = Literal[2, 3] @@ -241,3 +242,89 @@ def _warn_order_kwarg() -> None: def _default_zarr_format() -> ZarrFormat: """Return the default zarr_version""" return cast("ZarrFormat", int(zarr_config.get("default_zarr_format", 3))) + + +def expand_rle(data: Sequence[int | list[int]]) -> list[int]: + """Expand a mixed array of bare integers and RLE pairs. + + Per the rectilinear chunk grid spec, each element can be: + - a bare integer (an explicit edge length) + - a two-element array ``[value, count]`` (run-length encoded) + """ + result: list[int] = [] + for item in data: + if isinstance(item, (int, float)) and not isinstance(item, bool): + val = int(item) + if val < 1: + raise ValueError(f"Chunk edge length must be >= 1, got {val}") + result.append(val) + elif isinstance(item, list) and len(item) == 2: + size, count = int(item[0]), int(item[1]) + if size < 1: + raise ValueError(f"Chunk edge length must be >= 1, got {size}") + if count < 1: + raise ValueError(f"RLE repeat count must be >= 1, got {count}") + result.extend([size] * count) + else: + raise ValueError(f"RLE entries must be an integer or [size, count], got {item}") + return result + + +def compress_rle(sizes: Sequence[int]) -> list[int | list[int]]: + """Compress chunk sizes to mixed RLE format per the rectilinear spec. + + Runs of length > 1 are emitted as ``[value, count]`` pairs; runs of + length 1 are emitted as bare integers:: + + [10, 10, 10, 5] -> [[10, 3], 5] + """ + if not sizes: + return [] + result: list[int | list[int]] = [] + current = sizes[0] + count = 1 + for s in sizes[1:]: + if s == current: + count += 1 + else: + result.append([current, count] if count > 1 else current) + current = s + count = 1 + result.append([current, count] if count > 1 else current) + return result + + +def validate_rectilinear_kind(kind: str | None) -> None: + """Validate the ``kind`` field of a rectilinear chunk grid configuration. + + The rectilinear spec requires ``kind: "inline"``. + """ + if kind is None: + raise ValueError( + "Rectilinear chunk grid configuration requires a 'kind' field. " + "Only 'inline' is currently supported." + ) + if kind != "inline": + raise ValueError( + f"Unsupported rectilinear chunk grid kind: {kind!r}. " + "Only 'inline' is currently supported." + ) + + +def validate_rectilinear_edges( + chunk_shapes: Sequence[int | Sequence[int]], array_shape: Sequence[int] +) -> None: + """Validate that rectilinear chunk edges cover the array extent per dimension. + + Bare-int dimensions (regular step) always cover any extent, so they are + skipped. Explicit edge lists must sum to at least the array extent. + """ + for i, (dim_spec, extent) in enumerate(zip(chunk_shapes, array_shape, strict=True)): + if isinstance(dim_spec, int): + continue + edge_sum = sum(dim_spec) + if edge_sum < extent: + raise ValueError( + f"Rectilinear chunk edges for dimension {i} sum to {edge_sum} " + f"but array shape extent is {extent} (edge sum must be >= extent)" + ) diff --git a/src/zarr/core/config.py b/src/zarr/core/config.py index f8f8ea4f5f..fceb3657b2 100644 --- a/src/zarr/core/config.py +++ b/src/zarr/core/config.py @@ -97,6 +97,7 @@ def enable_gpu(self) -> ConfigSet: "order": "C", "write_empty_chunks": False, "target_shard_size_bytes": None, + "rectilinear_chunks": False, }, "async": {"concurrency": 10, "timeout": None}, "threading": {"max_workers": None}, diff --git a/src/zarr/core/group.py b/src/zarr/core/group.py index 760f91722c..b810041e7b 100644 --- a/src/zarr/core/group.py +++ b/src/zarr/core/group.py @@ -40,6 +40,7 @@ ZATTRS_JSON, ZGROUP_JSON, ZMETADATA_V2_JSON, + ChunksLike, DimensionNamesLike, NodeType, ShapeLike, @@ -1020,7 +1021,7 @@ async def create_array( shape: ShapeLike | None = None, dtype: ZDTypeLike | None = None, data: np.ndarray[Any, np.dtype[Any]] | None = None, - chunks: tuple[int, ...] | Literal["auto"] = "auto", + chunks: ChunksLike | Literal["auto"] = "auto", shards: ShardsLike | None = None, filters: FiltersLike = "auto", compressors: CompressorsLike = "auto", @@ -2473,7 +2474,7 @@ def create( shape: ShapeLike | None = None, dtype: ZDTypeLike | None = None, data: np.ndarray[Any, np.dtype[Any]] | None = None, - chunks: tuple[int, ...] | Literal["auto"] = "auto", + chunks: ChunksLike | Literal["auto"] = "auto", shards: ShardsLike | None = None, filters: FiltersLike = "auto", compressors: CompressorsLike = "auto", @@ -2617,7 +2618,7 @@ def create_array( shape: ShapeLike | None = None, dtype: ZDTypeLike | None = None, data: np.ndarray[Any, np.dtype[Any]] | None = None, - chunks: tuple[int, ...] | Literal["auto"] = "auto", + chunks: ChunksLike | Literal["auto"] = "auto", shards: ShardsLike | None = None, filters: FiltersLike = "auto", compressors: CompressorsLike = "auto", @@ -3015,7 +3016,7 @@ def array( *, shape: ShapeLike, dtype: npt.DTypeLike, - chunks: tuple[int, ...] | Literal["auto"] = "auto", + chunks: ChunksLike | Literal["auto"] = "auto", shards: tuple[int, ...] | Literal["auto"] | None = None, filters: FiltersLike = "auto", compressors: CompressorsLike = "auto", diff --git a/src/zarr/core/indexing.py b/src/zarr/core/indexing.py index 4461074a64..cb81164209 100644 --- a/src/zarr/core/indexing.py +++ b/src/zarr/core/indexing.py @@ -1,7 +1,6 @@ from __future__ import annotations import itertools -import math import numbers import operator from collections.abc import Iterator, Sequence @@ -36,7 +35,7 @@ if TYPE_CHECKING: from zarr.core.array import AsyncArray from zarr.core.buffer import NDArrayLikeOrScalar - from zarr.core.chunk_grids import ChunkGrid + from zarr.core.chunk_grids import ChunkGrid, DimensionGrid from zarr.types import AnyArray @@ -330,15 +329,6 @@ def is_pure_orthogonal_indexing(selection: Selection, ndim: int) -> TypeGuard[Or ) -def get_chunk_shape(chunk_grid: ChunkGrid) -> tuple[int, ...]: - from zarr.core.chunk_grids import RegularChunkGrid - - assert isinstance(chunk_grid, RegularChunkGrid), ( - "Only regular chunk grid is supported, currently." - ) - return chunk_grid.chunk_shape - - def normalize_integer_selection(dim_sel: int, dim_len: int) -> int: # normalize type to int dim_sel = int(dim_sel) @@ -378,35 +368,41 @@ class ChunkDimProjection(NamedTuple): class IntDimIndexer: dim_sel: int dim_len: int - dim_chunk_len: int + dim_grid: DimensionGrid nitems: int = 1 - def __init__(self, dim_sel: int, dim_len: int, dim_chunk_len: int) -> None: + def __init__(self, dim_sel: int, dim_len: int, dim_grid: DimensionGrid) -> None: object.__setattr__(self, "dim_sel", normalize_integer_selection(dim_sel, dim_len)) object.__setattr__(self, "dim_len", dim_len) - object.__setattr__(self, "dim_chunk_len", dim_chunk_len) + object.__setattr__(self, "dim_grid", dim_grid) def __iter__(self) -> Iterator[ChunkDimProjection]: - dim_chunk_ix = self.dim_sel // self.dim_chunk_len - dim_offset = dim_chunk_ix * self.dim_chunk_len + g = self.dim_grid + dim_chunk_ix = g.index_to_chunk(self.dim_sel) + dim_offset = g.chunk_offset(dim_chunk_ix) dim_chunk_sel = self.dim_sel - dim_offset dim_out_sel = None - is_complete_chunk = self.dim_chunk_len == 1 + is_complete_chunk = g.data_size(dim_chunk_ix) == 1 yield ChunkDimProjection(dim_chunk_ix, dim_chunk_sel, dim_out_sel, is_complete_chunk) @dataclass(frozen=True) class SliceDimIndexer: dim_len: int - dim_chunk_len: int nitems: int nchunks: int + dim_grid: DimensionGrid start: int stop: int step: int - def __init__(self, dim_sel: slice, dim_len: int, dim_chunk_len: int) -> None: + def __init__( + self, + dim_sel: slice, + dim_len: int, + dim_grid: DimensionGrid, + ) -> None: # normalize start, stop, step = dim_sel.indices(dim_len) if step < 1: @@ -417,23 +413,25 @@ def __init__(self, dim_sel: slice, dim_len: int, dim_chunk_len: int) -> None: object.__setattr__(self, "step", step) object.__setattr__(self, "dim_len", dim_len) - object.__setattr__(self, "dim_chunk_len", dim_chunk_len) + object.__setattr__(self, "dim_grid", dim_grid) object.__setattr__(self, "nitems", max(0, ceildiv((stop - start), step))) - object.__setattr__(self, "nchunks", ceildiv(dim_len, dim_chunk_len)) + object.__setattr__(self, "nchunks", dim_grid.nchunks) def __iter__(self) -> Iterator[ChunkDimProjection]: # figure out the range of chunks we need to visit - dim_chunk_ix_from = 0 if self.start == 0 else self.start // self.dim_chunk_len - dim_chunk_ix_to = ceildiv(self.stop, self.dim_chunk_len) + if self.start >= self.stop: + return # empty slice + g = self.dim_grid + dim_chunk_ix_from = g.index_to_chunk(self.start) if self.start > 0 else 0 + dim_chunk_ix_to = g.index_to_chunk(self.stop - 1) + 1 if self.stop > 0 else 0 # iterate over chunks in range for dim_chunk_ix in range(dim_chunk_ix_from, dim_chunk_ix_to): # compute offsets for chunk within overall array - dim_offset = dim_chunk_ix * self.dim_chunk_len - dim_limit = min(self.dim_len, (dim_chunk_ix + 1) * self.dim_chunk_len) - + dim_offset = g.chunk_offset(dim_chunk_ix) # determine chunk length, accounting for trailing chunk - dim_chunk_len = dim_limit - dim_offset + dim_chunk_len = g.data_size(dim_chunk_ix) + dim_limit = dim_offset + dim_chunk_len if self.start < dim_offset: # selection starts before current chunk @@ -443,7 +441,6 @@ def __iter__(self) -> Iterator[ChunkDimProjection]: dim_chunk_sel_start += self.step - remainder # compute number of previous items, provides offset into output array dim_out_offset = ceildiv((dim_offset - self.start), self.step) - else: # selection starts within current chunk dim_chunk_sel_start = self.start - dim_offset @@ -452,7 +449,6 @@ def __iter__(self) -> Iterator[ChunkDimProjection]: if self.stop > dim_limit: # selection ends after current chunk dim_chunk_sel_stop = dim_chunk_len - else: # selection ends within current chunk dim_chunk_sel_stop = self.stop - dim_offset @@ -465,7 +461,6 @@ def __iter__(self) -> Iterator[ChunkDimProjection]: continue dim_out_sel = slice(dim_out_offset, dim_out_offset + dim_chunk_nitems) - is_complete_chunk = ( dim_chunk_sel_start == 0 and (self.stop >= dim_limit) and self.step in [1, None] ) @@ -583,21 +578,19 @@ def __init__( shape: tuple[int, ...], chunk_grid: ChunkGrid, ) -> None: - chunk_shape = get_chunk_shape(chunk_grid) + dim_grids = chunk_grid._dimensions # handle ellipsis selection_normalized = replace_ellipsis(selection, shape) # setup per-dimension indexers dim_indexers: list[IntDimIndexer | SliceDimIndexer] = [] - for dim_sel, dim_len, dim_chunk_len in zip( - selection_normalized, shape, chunk_shape, strict=True - ): + for dim_sel, dim_len, dim_grid in zip(selection_normalized, shape, dim_grids, strict=True): dim_indexer: IntDimIndexer | SliceDimIndexer if is_integer(dim_sel): - dim_indexer = IntDimIndexer(dim_sel, dim_len, dim_chunk_len) + dim_indexer = IntDimIndexer(dim_sel, dim_len, dim_grid) elif is_slice(dim_sel): - dim_indexer = SliceDimIndexer(dim_sel, dim_len, dim_chunk_len) + dim_indexer = SliceDimIndexer(dim_sel, dim_len, dim_grid) else: raise IndexError( @@ -630,7 +623,7 @@ def __iter__(self) -> Iterator[ChunkProjection]: class BoolArrayDimIndexer: dim_sel: npt.NDArray[np.bool_] dim_len: int - dim_chunk_len: int + dim_grid: DimensionGrid nchunks: int chunk_nitems: npt.NDArray[Any] @@ -638,7 +631,12 @@ class BoolArrayDimIndexer: nitems: int dim_chunk_ixs: npt.NDArray[np.intp] - def __init__(self, dim_sel: npt.NDArray[np.bool_], dim_len: int, dim_chunk_len: int) -> None: + def __init__( + self, + dim_sel: npt.NDArray[np.bool_], + dim_len: int, + dim_grid: DimensionGrid, + ) -> None: # check number of dimensions if not is_bool_array(dim_sel, 1): raise IndexError("Boolean arrays in an orthogonal selection must be 1-dimensional only") @@ -649,13 +647,16 @@ def __init__(self, dim_sel: npt.NDArray[np.bool_], dim_len: int, dim_chunk_len: f"Boolean array has the wrong length for dimension; expected {dim_len}, got {dim_sel.shape[0]}" ) + g = dim_grid + nchunks = g.nchunks + # precompute number of selected items for each chunk - nchunks = ceildiv(dim_len, dim_chunk_len) chunk_nitems = np.zeros(nchunks, dtype="i8") for dim_chunk_ix in range(nchunks): - dim_offset = dim_chunk_ix * dim_chunk_len + dim_offset = g.chunk_offset(dim_chunk_ix) + chunk_len = g.data_size(dim_chunk_ix) chunk_nitems[dim_chunk_ix] = np.count_nonzero( - dim_sel[dim_offset : dim_offset + dim_chunk_len] + dim_sel[dim_offset : dim_offset + chunk_len] ) chunk_nitems_cumsum = np.cumsum(chunk_nitems) nitems = chunk_nitems_cumsum[-1] @@ -664,7 +665,7 @@ def __init__(self, dim_sel: npt.NDArray[np.bool_], dim_len: int, dim_chunk_len: # store attributes object.__setattr__(self, "dim_sel", dim_sel) object.__setattr__(self, "dim_len", dim_len) - object.__setattr__(self, "dim_chunk_len", dim_chunk_len) + object.__setattr__(self, "dim_grid", dim_grid) object.__setattr__(self, "nchunks", nchunks) object.__setattr__(self, "chunk_nitems", chunk_nitems) object.__setattr__(self, "chunk_nitems_cumsum", chunk_nitems_cumsum) @@ -672,15 +673,19 @@ def __init__(self, dim_sel: npt.NDArray[np.bool_], dim_len: int, dim_chunk_len: object.__setattr__(self, "dim_chunk_ixs", dim_chunk_ixs) def __iter__(self) -> Iterator[ChunkDimProjection]: + g = self.dim_grid + # iterate over chunks with at least one item for dim_chunk_ix in self.dim_chunk_ixs: # find region in chunk - dim_offset = dim_chunk_ix * self.dim_chunk_len - dim_chunk_sel = self.dim_sel[dim_offset : dim_offset + self.dim_chunk_len] - - # pad out if final chunk - if dim_chunk_sel.shape[0] < self.dim_chunk_len: - tmp = np.zeros(self.dim_chunk_len, dtype=bool) + dim_offset = g.chunk_offset(dim_chunk_ix) + chunk_len = g.data_size(dim_chunk_ix) + dim_chunk_sel = self.dim_sel[dim_offset : dim_offset + chunk_len] + + # pad out if boundary chunk (codec buffer may be larger than valid data region) + codec_size = g.chunk_size(dim_chunk_ix) + if dim_chunk_sel.shape[0] < codec_size: + tmp = np.zeros(codec_size, dtype=bool) tmp[: dim_chunk_sel.shape[0]] = dim_chunk_sel dim_chunk_sel = tmp @@ -739,7 +744,7 @@ class IntArrayDimIndexer: """Integer array selection against a single dimension.""" dim_len: int - dim_chunk_len: int + dim_grid: DimensionGrid nchunks: int nitems: int order: Order @@ -753,7 +758,7 @@ def __init__( self, dim_sel: npt.NDArray[np.intp], dim_len: int, - dim_chunk_len: int, + dim_grid: DimensionGrid, wraparound: bool = True, boundscheck: bool = True, order: Order = Order.UNKNOWN, @@ -764,7 +769,8 @@ def __init__( raise IndexError("integer arrays in an orthogonal selection must be 1-dimensional only") nitems = len(dim_sel) - nchunks = ceildiv(dim_len, dim_chunk_len) + g = dim_grid + nchunks = g.nchunks # handle wraparound if wraparound: @@ -777,7 +783,7 @@ def __init__( # determine which chunk is needed for each selection item # note: for dense integer selections, the division operation here is the # bottleneck - dim_sel_chunk = dim_sel // dim_chunk_len + dim_sel_chunk = g.indices_to_chunks(dim_sel) # determine order of indices if order == Order.UNKNOWN: @@ -806,7 +812,7 @@ def __init__( # store attributes object.__setattr__(self, "dim_len", dim_len) - object.__setattr__(self, "dim_chunk_len", dim_chunk_len) + object.__setattr__(self, "dim_grid", dim_grid) object.__setattr__(self, "nchunks", nchunks) object.__setattr__(self, "nitems", nitems) object.__setattr__(self, "order", order) @@ -817,6 +823,8 @@ def __init__( object.__setattr__(self, "chunk_nitems_cumsum", chunk_nitems_cumsum) def __iter__(self) -> Iterator[ChunkDimProjection]: + g = self.dim_grid + for dim_chunk_ix in self.dim_chunk_ixs: dim_out_sel: slice | npt.NDArray[np.intp] # find region in output @@ -831,7 +839,7 @@ def __iter__(self) -> Iterator[ChunkDimProjection]: dim_out_sel = self.dim_out_sel[start:stop] # find region in chunk - dim_offset = dim_chunk_ix * self.dim_chunk_len + dim_offset = g.chunk_offset(dim_chunk_ix) dim_chunk_sel = self.dim_sel[start:stop] - dim_offset is_complete_chunk = False # TODO yield ChunkDimProjection(dim_chunk_ix, dim_chunk_sel, dim_out_sel, is_complete_chunk) @@ -891,13 +899,13 @@ def oindex_set(a: npt.NDArray[Any], selection: Selection, value: Any) -> None: @dataclass(frozen=True) class OrthogonalIndexer(Indexer): dim_indexers: list[IntDimIndexer | SliceDimIndexer | IntArrayDimIndexer | BoolArrayDimIndexer] + dim_grids: tuple[DimensionGrid, ...] shape: tuple[int, ...] - chunk_shape: tuple[int, ...] is_advanced: bool drop_axes: tuple[int, ...] def __init__(self, selection: Selection, shape: tuple[int, ...], chunk_grid: ChunkGrid) -> None: - chunk_shape = get_chunk_shape(chunk_grid) + dim_grids = chunk_grid._dimensions # handle ellipsis selection = replace_ellipsis(selection, shape) @@ -909,19 +917,19 @@ def __init__(self, selection: Selection, shape: tuple[int, ...], chunk_grid: Chu dim_indexers: list[ IntDimIndexer | SliceDimIndexer | IntArrayDimIndexer | BoolArrayDimIndexer ] = [] - for dim_sel, dim_len, dim_chunk_len in zip(selection, shape, chunk_shape, strict=True): + for dim_sel, dim_len, dim_grid in zip(selection, shape, dim_grids, strict=True): dim_indexer: IntDimIndexer | SliceDimIndexer | IntArrayDimIndexer | BoolArrayDimIndexer if is_integer(dim_sel): - dim_indexer = IntDimIndexer(dim_sel, dim_len, dim_chunk_len) + dim_indexer = IntDimIndexer(dim_sel, dim_len, dim_grid) elif isinstance(dim_sel, slice): - dim_indexer = SliceDimIndexer(dim_sel, dim_len, dim_chunk_len) + dim_indexer = SliceDimIndexer(dim_sel, dim_len, dim_grid) elif is_integer_array(dim_sel): - dim_indexer = IntArrayDimIndexer(dim_sel, dim_len, dim_chunk_len) + dim_indexer = IntArrayDimIndexer(dim_sel, dim_len, dim_grid) elif is_bool_array(dim_sel): - dim_indexer = BoolArrayDimIndexer(dim_sel, dim_len, dim_chunk_len) + dim_indexer = BoolArrayDimIndexer(dim_sel, dim_len, dim_grid) else: raise IndexError( @@ -944,8 +952,8 @@ def __init__(self, selection: Selection, shape: tuple[int, ...], chunk_grid: Chu drop_axes = () object.__setattr__(self, "dim_indexers", dim_indexers) + object.__setattr__(self, "dim_grids", dim_grids) object.__setattr__(self, "shape", shape) - object.__setattr__(self, "chunk_shape", chunk_shape) object.__setattr__(self, "is_advanced", is_advanced) object.__setattr__(self, "drop_axes", drop_axes) @@ -972,7 +980,11 @@ def __iter__(self) -> Iterator[ChunkProjection]: # N.B., numpy doesn't support orthogonal indexing directly # for multiple array-indexed dimensions, so we need to # convert the orthogonal selection into coordinate arrays. - chunk_selection = ix_(chunk_selection, self.chunk_shape) + chunk_shape = tuple( + g.chunk_size(p.dim_chunk_ix) + for g, p in zip(self.dim_grids, dim_projections, strict=True) + ) + chunk_selection = ix_(chunk_selection, chunk_shape) # special case for non-monotonic indices if not is_basic_selection(out_selection): @@ -1038,7 +1050,7 @@ class BlockIndexer(Indexer): def __init__( self, selection: BasicSelection, shape: tuple[int, ...], chunk_grid: ChunkGrid ) -> None: - chunk_shape = get_chunk_shape(chunk_grid) + dim_grids = chunk_grid._dimensions # handle ellipsis selection_normalized = replace_ellipsis(selection, shape) @@ -1048,17 +1060,20 @@ def __init__( # setup per-dimension indexers dim_indexers = [] - for dim_sel, dim_len, dim_chunk_size in zip( - selection_normalized, shape, chunk_shape, strict=True - ): - dim_numchunks = int(np.ceil(dim_len / dim_chunk_size)) + for dim_sel, dim_len, dim_grid in zip(selection_normalized, shape, dim_grids, strict=True): + dim_numchunks = dim_grid.nchunks if is_integer(dim_sel): if dim_sel < 0: dim_sel = dim_numchunks + dim_sel - start = dim_sel * dim_chunk_size - stop = start + dim_chunk_size + if dim_sel < 0 or dim_sel >= dim_numchunks: + raise BoundsCheckError( + f"block index out of bounds for dimension with {dim_numchunks} chunk(s)" + ) + + start = dim_grid.chunk_offset(dim_sel) + stop = start + dim_grid.chunk_size(dim_sel) slice_ = slice(start, stop) elif is_slice(dim_sel): @@ -1078,8 +1093,8 @@ def __init__( if stop < 0: stop = dim_numchunks + stop - start *= dim_chunk_size - stop *= dim_chunk_size + start = dim_grid.chunk_offset(start) if start < dim_numchunks else dim_len + stop = dim_grid.chunk_offset(stop) if stop < dim_numchunks else dim_len slice_ = slice(start, stop) else: @@ -1088,10 +1103,10 @@ def __init__( f"expected integer or slice, got {type(dim_sel)!r}" ) - dim_indexer = SliceDimIndexer(slice_, dim_len, dim_chunk_size) + dim_indexer = SliceDimIndexer(slice_, dim_len, dim_grid) dim_indexers.append(dim_indexer) - if start >= dim_len or start < 0: + if slice_.start >= dim_len or slice_.start < 0: msg = f"index out of bounds for dimension with length {dim_len}" raise BoundsCheckError(msg) @@ -1159,19 +1174,19 @@ class CoordinateIndexer(Indexer): chunk_rixs: npt.NDArray[np.intp] chunk_mixs: tuple[npt.NDArray[np.intp], ...] shape: tuple[int, ...] - chunk_shape: tuple[int, ...] + dim_grids: tuple[DimensionGrid, ...] drop_axes: tuple[int, ...] def __init__( self, selection: CoordinateSelection, shape: tuple[int, ...], chunk_grid: ChunkGrid ) -> None: - chunk_shape = get_chunk_shape(chunk_grid) + dim_grids = chunk_grid._dimensions cdata_shape: tuple[int, ...] if shape == (): cdata_shape = (1,) else: - cdata_shape = tuple(math.ceil(s / c) for s, c in zip(shape, chunk_shape, strict=True)) + cdata_shape = tuple(g.nchunks for g in dim_grids) nchunks = reduce(operator.mul, cdata_shape, 1) # some initial normalization @@ -1201,8 +1216,8 @@ def __init__( # compute chunk index for each point in the selection chunks_multi_index = tuple( - dim_sel // dim_chunk_len - for (dim_sel, dim_chunk_len) in zip(selection_normalized, chunk_shape, strict=True) + g.indices_to_chunks(dim_sel) + for (dim_sel, g) in zip(selection_normalized, dim_grids, strict=True) ) # broadcast selection - this will raise error if array dimensions don't match @@ -1248,7 +1263,7 @@ def __init__( object.__setattr__(self, "chunk_nitems_cumsum", chunk_nitems_cumsum) object.__setattr__(self, "chunk_rixs", chunk_rixs) object.__setattr__(self, "chunk_mixs", chunk_mixs) - object.__setattr__(self, "chunk_shape", chunk_shape) + object.__setattr__(self, "dim_grids", dim_grids) object.__setattr__(self, "shape", shape) object.__setattr__(self, "drop_axes", ()) @@ -1268,8 +1283,8 @@ def __iter__(self) -> Iterator[ChunkProjection]: out_selection = self.sel_sort[start:stop] chunk_offsets = tuple( - dim_chunk_ix * dim_chunk_len - for dim_chunk_ix, dim_chunk_len in zip(chunk_coords, self.chunk_shape, strict=True) + g.chunk_offset(dim_chunk_ix) + for dim_chunk_ix, g in zip(chunk_coords, self.dim_grids, strict=True) ) chunk_selection = tuple( dim_sel[start:stop] - dim_chunk_offset diff --git a/src/zarr/core/metadata/v2.py b/src/zarr/core/metadata/v2.py index f0781e1313..8626d480a7 100644 --- a/src/zarr/core/metadata/v2.py +++ b/src/zarr/core/metadata/v2.py @@ -7,7 +7,6 @@ from zarr.abc.metadata import Metadata from zarr.abc.numcodec import Numcodec, _is_numcodec -from zarr.core.chunk_grids import RegularChunkGrid from zarr.core.dtype import get_data_type_from_json from zarr.core.dtype.common import OBJECT_CODEC_IDS, DTypeSpec_V2 from zarr.errors import ZarrUserWarning @@ -19,6 +18,7 @@ import numpy.typing as npt from zarr.core.buffer import Buffer, BufferPrototype + from zarr.core.chunk_grids import ChunkGrid from zarr.core.dtype.wrapper import ( TBaseDType, TBaseScalar, @@ -116,8 +116,22 @@ def ndim(self) -> int: return len(self.shape) @cached_property - def chunk_grid(self) -> RegularChunkGrid: - return RegularChunkGrid(chunk_shape=self.chunks) + def chunk_grid(self) -> ChunkGrid: + """Backwards-compatible chunk grid property. + + .. deprecated:: + Access the chunk grid via the array layer instead. + This property will be removed in a future release. + """ + from zarr.core.chunk_grids import ChunkGrid + + warnings.warn( + "ArrayV2Metadata.chunk_grid is deprecated. " + "Use ChunkGrid.from_metadata(metadata) instead.", + DeprecationWarning, + stacklevel=2, + ) + return ChunkGrid.from_sizes(self.shape, tuple(self.chunks)) @property def shards(self) -> tuple[int, ...] | None: diff --git a/src/zarr/core/metadata/v3.py b/src/zarr/core/metadata/v3.py index 2a5da50c7b..1d9018c856 100644 --- a/src/zarr/core/metadata/v3.py +++ b/src/zarr/core/metadata/v3.py @@ -1,30 +1,14 @@ from __future__ import annotations -from collections.abc import Mapping -from typing import TYPE_CHECKING, NotRequired, TypedDict, TypeGuard, cast - -from zarr.abc.metadata import Metadata -from zarr.core.buffer.core import default_buffer_prototype -from zarr.core.dtype import VariableLengthUTF8, ZDType, get_data_type_from_json -from zarr.core.dtype.common import check_dtype_spec_v3 - -if TYPE_CHECKING: - from typing import Self - - from zarr.core.buffer import Buffer, BufferPrototype - from zarr.core.chunk_grids import ChunkGrid - from zarr.core.common import JSON - from zarr.core.dtype.wrapper import TBaseDType, TBaseScalar - - import json -from collections.abc import Iterable +from collections.abc import Iterable, Mapping, Sequence from dataclasses import dataclass, field, replace -from typing import Any, Literal +from typing import TYPE_CHECKING, Any, Literal, NotRequired, TypedDict, TypeGuard, cast from zarr.abc.codec import ArrayArrayCodec, ArrayBytesCodec, BytesBytesCodec, Codec +from zarr.abc.metadata import Metadata from zarr.core.array_spec import ArrayConfig, ArraySpec -from zarr.core.chunk_grids import ChunkGrid, RegularChunkGrid +from zarr.core.buffer.core import default_buffer_prototype from zarr.core.chunk_key_encodings import ( ChunkKeyEncoding, ChunkKeyEncodingLike, @@ -33,16 +17,30 @@ from zarr.core.common import ( JSON, ZARR_JSON, + ChunksLike, DimensionNamesLike, NamedConfig, + NamedRequiredConfig, + compress_rle, + expand_rle, parse_named_configuration, parse_shapelike, + validate_rectilinear_edges, + validate_rectilinear_kind, ) from zarr.core.config import config +from zarr.core.dtype import VariableLengthUTF8, ZDType, get_data_type_from_json +from zarr.core.dtype.common import check_dtype_spec_v3 from zarr.core.metadata.common import parse_attributes from zarr.errors import MetadataValidationError, NodeTypeValidationError, UnknownCodecError from zarr.registry import get_codec_class +if TYPE_CHECKING: + from typing import Self + + from zarr.core.buffer import Buffer, BufferPrototype + from zarr.core.dtype.wrapper import TBaseDType, TBaseScalar + def parse_zarr_format(data: object) -> Literal[3]: if data == 3: @@ -174,6 +172,243 @@ def parse_extra_fields( return dict(data) +# JSON type for a single dimension's rectilinear spec: +# bare int (uniform shorthand), or list of ints / [value, count] RLE pairs. +RectilinearDimSpecJSON = int | list[int | list[int]] + + +class RegularChunkGridConfig(TypedDict): + chunk_shape: tuple[int, ...] + + +class RectilinearChunkGridConfig(TypedDict): + kind: Literal["inline"] + chunk_shapes: tuple[RectilinearDimSpecJSON, ...] + + +RegularChunkGridJSON = NamedRequiredConfig[Literal["regular"], RegularChunkGridConfig] +RectilinearChunkGridJSON = NamedRequiredConfig[Literal["rectilinear"], RectilinearChunkGridConfig] + +ChunkGridJSON = RegularChunkGridJSON | RectilinearChunkGridJSON + + +def _parse_chunk_shape(chunk_shape: Iterable[int]) -> tuple[int, ...]: + """Validate and normalize a regular chunk shape. + + Delegates to ``_validate_chunk_shapes`` — a regular chunk shape is just + a sequence of bare ints (one per dimension), each of which must be >= 1. + """ + result = _validate_chunk_shapes(tuple(chunk_shape)) + # Regular grids only have bare ints — cast is safe after validation + return cast(tuple[int, ...], result) + + +def _validate_chunk_shapes( + chunk_shapes: Sequence[int | Sequence[int]], +) -> tuple[int | tuple[int, ...], ...]: + """Validate per-dimension chunk specifications. + + Each element is either a bare ``int`` (regular step size, must be >= 1) + or a sequence of explicit edge lengths (all must be >= 1, non-empty). + """ + result: list[int | tuple[int, ...]] = [] + for dim_idx, dim_spec in enumerate(chunk_shapes): + if isinstance(dim_spec, int): + if dim_spec < 1: + raise ValueError( + f"Dimension {dim_idx}: integer chunk edge length must be >= 1, got {dim_spec}" + ) + result.append(dim_spec) + else: + edges = tuple(dim_spec) + if not edges: + raise ValueError(f"Dimension {dim_idx} has no chunk edges.") + bad = [i for i, e in enumerate(edges) if e < 1] + if bad: + raise ValueError( + f"Dimension {dim_idx} has invalid edge lengths at indices {bad}: " + f"{[edges[i] for i in bad]}" + ) + result.append(edges) + return tuple(result) + + +@dataclass(frozen=True, kw_only=True) +class RegularChunkGrid(Metadata): + """Metadata-only description of a regular chunk grid. + + Stores just the chunk shape — no array extent, no behavioral logic. + This is what lives on ``ArrayV3Metadata.chunk_grid``. + """ + + chunk_shape: tuple[int, ...] + + def __post_init__(self) -> None: + chunk_shape_parsed = _parse_chunk_shape(self.chunk_shape) + object.__setattr__(self, "chunk_shape", chunk_shape_parsed) + + @property + def ndim(self) -> int: + return len(self.chunk_shape) + + def to_dict(self) -> RegularChunkGridJSON: # type: ignore[override] + return { + "name": "regular", + "configuration": {"chunk_shape": self.chunk_shape}, + } + + @classmethod + def from_dict(cls, data: RegularChunkGridJSON) -> Self: # type: ignore[override] + parse_named_configuration(data, "regular") # validate name + configuration = data["configuration"] + return cls(chunk_shape=_parse_chunk_shape(configuration["chunk_shape"])) + + +@dataclass(frozen=True, kw_only=True) +class RectilinearChunkGrid(Metadata): + """Metadata-only description of a rectilinear chunk grid. + + Each element of ``chunk_shapes`` is either: + + - A bare ``int`` — a regular step size that repeats to cover the axis + (the spec's single-integer shorthand). + - A ``tuple[int, ...]`` — explicit per-chunk edge lengths (already + expanded from any RLE encoding). + + This distinction matters for faithful round-tripping: a bare int + serializes back as a bare int, while a single-element tuple serializes + as a list. + """ + + chunk_shapes: tuple[int | tuple[int, ...], ...] + + def __post_init__(self) -> None: + from zarr.core.config import config + + if not config.get("array.rectilinear_chunks"): + raise ValueError( + "Rectilinear chunk grids are experimental and disabled by default. " + "Enable them with: zarr.config.set({'array.rectilinear_chunks': True}) " + "or set the environment variable ZARR_ARRAY__RECTILINEAR_CHUNKS=True" + ) + object.__setattr__(self, "chunk_shapes", _validate_chunk_shapes(self.chunk_shapes)) + + @property + def ndim(self) -> int: + return len(self.chunk_shapes) + + def to_dict(self) -> RectilinearChunkGridJSON: # type: ignore[override] + serialized_dims: list[RectilinearDimSpecJSON] = [] + for dim_spec in self.chunk_shapes: + if isinstance(dim_spec, int): + # Bare int shorthand — serialize as-is + serialized_dims.append(dim_spec) + else: + rle = compress_rle(dim_spec) + # Use RLE only if it's actually shorter + if len(rle) < len(dim_spec): + serialized_dims.append(rle) + else: + serialized_dims.append(list(dim_spec)) + return { + "name": "rectilinear", + "configuration": { + "kind": "inline", + "chunk_shapes": tuple(serialized_dims), + }, + } + + def update_shape( + self, old_shape: tuple[int, ...], new_shape: tuple[int, ...] + ) -> RectilinearChunkGrid: + """Return a new RectilinearChunkGrid with edges adjusted for *new_shape*. + + - Bare-int dimensions stay as bare ints (they cover any extent). + - Explicit-edge dimensions: if the new extent exceeds the sum of + edges, a new chunk is appended to cover the additional extent. + Otherwise edges are kept as-is (the spec allows trailing edges + beyond the array extent). + """ + new_chunk_shapes: list[int | tuple[int, ...]] = [] + for dim_spec, new_ext in zip(self.chunk_shapes, new_shape, strict=True): + if isinstance(dim_spec, int): + # Bare int covers any extent — no change needed + new_chunk_shapes.append(dim_spec) + else: + edge_sum = sum(dim_spec) + if new_ext > edge_sum: + new_chunk_shapes.append((*dim_spec, new_ext - edge_sum)) + else: + new_chunk_shapes.append(dim_spec) + return RectilinearChunkGrid(chunk_shapes=tuple(new_chunk_shapes)) + + @classmethod + def from_dict(cls, data: RectilinearChunkGridJSON) -> Self: # type: ignore[override] + parse_named_configuration(data, "rectilinear") # validate name + configuration = data["configuration"] + validate_rectilinear_kind(configuration.get("kind")) + raw_shapes = configuration["chunk_shapes"] + parsed: list[int | tuple[int, ...]] = [] + for dim_spec in raw_shapes: + if isinstance(dim_spec, int): + if dim_spec < 1: + raise ValueError(f"Integer chunk edge length must be >= 1, got {dim_spec}") + parsed.append(dim_spec) + elif isinstance(dim_spec, list): + parsed.append(tuple(expand_rle(dim_spec))) + else: + raise TypeError( + f"Invalid chunk_shapes entry: expected int or list, got {type(dim_spec)}" + ) + return cls(chunk_shapes=tuple(parsed)) + + +ChunkGridMetadata = RegularChunkGrid | RectilinearChunkGrid + + +def resolve_chunks( + chunks: ChunksLike, + shape: tuple[int, ...], + typesize: int, +) -> ChunkGridMetadata: + """Construct a chunk grid from user-facing input (e.g. ``create_array(chunks=...)``). + + Nested sequences like ``[[10, 20], [5, 5]]`` produce a ``RectilinearChunkGrid``. + Flat inputs like ``(10, 10)`` or a scalar ``int`` produce a ``RegularChunkGrid`` + after normalization via :func:`~zarr.core.chunk_grids.normalize_chunks`. + + See Also + -------- + parse_chunk_grid : Deserialize a chunk grid from stored JSON metadata. + """ + from zarr.core.chunk_grids import _is_rectilinear_chunks, normalize_chunks + + if _is_rectilinear_chunks(chunks): + return RectilinearChunkGrid(chunk_shapes=tuple(tuple(c) for c in chunks)) + + return RegularChunkGrid(chunk_shape=normalize_chunks(chunks, shape, typesize)) + + +def parse_chunk_grid( + data: dict[str, JSON] | ChunkGridMetadata | NamedConfig[str, Any], +) -> ChunkGridMetadata: + """Deserialize a chunk grid from stored JSON metadata or pass through an existing instance. + + See Also + -------- + resolve_chunks : Construct a chunk grid from user-facing input. + """ + if isinstance(data, (RegularChunkGrid, RectilinearChunkGrid)): + return data + + name, _ = parse_named_configuration(data) + if name == "regular": + return RegularChunkGrid.from_dict(data) # type: ignore[arg-type] + if name == "rectilinear": + return RectilinearChunkGrid.from_dict(data) # type: ignore[arg-type] + raise ValueError(f"Unknown chunk grid name: {name!r}") + + class ArrayMetadataJSON_V3(TypedDict): """ A typed dictionary model for zarr v3 metadata. @@ -199,7 +434,7 @@ class ArrayMetadataJSON_V3(TypedDict): class ArrayV3Metadata(Metadata): shape: tuple[int, ...] data_type: ZDType[TBaseDType, TBaseScalar] - chunk_grid: ChunkGrid + chunk_grid: ChunkGridMetadata chunk_key_encoding: ChunkKeyEncoding fill_value: Any codecs: tuple[Codec, ...] @@ -215,7 +450,7 @@ def __init__( *, shape: Iterable[int], data_type: ZDType[TBaseDType, TBaseScalar], - chunk_grid: dict[str, JSON] | ChunkGrid | NamedConfig[str, Any], + chunk_grid: dict[str, JSON] | ChunkGridMetadata | NamedConfig[str, Any], chunk_key_encoding: ChunkKeyEncodingLike, fill_value: object, codecs: Iterable[Codec | dict[str, JSON] | NamedConfig[str, Any] | str], @@ -229,7 +464,7 @@ def __init__( """ shape_parsed = parse_shapelike(shape) - chunk_grid_parsed = ChunkGrid.from_dict(chunk_grid) + chunk_grid_parsed = parse_chunk_grid(chunk_grid) chunk_key_encoding_parsed = parse_chunk_key_encoding(chunk_key_encoding) dimension_names_parsed = parse_dimension_names(dimension_names) # Note: relying on a type method is numpy-specific @@ -262,12 +497,10 @@ def __init__( self._validate_metadata() def _validate_metadata(self) -> None: - if isinstance(self.chunk_grid, RegularChunkGrid) and len(self.shape) != len( - self.chunk_grid.chunk_shape - ): - raise ValueError( - "`chunk_shape` and `shape` need to have the same number of dimensions." - ) + if len(self.shape) != self.chunk_grid.ndim: + raise ValueError("`chunk_grid` and `shape` need to have the same number of dimensions.") + if isinstance(self.chunk_grid, RectilinearChunkGrid): + validate_rectilinear_edges(self.chunk_grid.chunk_shapes, self.shape) if self.dimension_names is not None and len(self.shape) != len(self.dimension_names): raise ValueError( "`dimension_names` and `shape` need to have the same number of dimensions." @@ -285,63 +518,46 @@ def ndim(self) -> int: def dtype(self) -> ZDType[TBaseDType, TBaseScalar]: return self.data_type + # TODO: move these behavioral properties to the Array class. + # They require knowledge of codecs (ShardingCodec) and don't belong on a metadata DTO. + @property def chunks(self) -> tuple[int, ...]: - if isinstance(self.chunk_grid, RegularChunkGrid): - from zarr.codecs.sharding import ShardingCodec + if not isinstance(self.chunk_grid, RegularChunkGrid): + msg = ( + "The `chunks` attribute is only defined for arrays using regular chunk grids. " + "This array has a rectilinear chunk grid. Use `read_chunk_sizes` for general access." + ) + raise NotImplementedError(msg) - if len(self.codecs) == 1 and isinstance(self.codecs[0], ShardingCodec): - sharding_codec = self.codecs[0] - assert isinstance(sharding_codec, ShardingCodec) # for mypy - return sharding_codec.chunk_shape - else: - return self.chunk_grid.chunk_shape + from zarr.codecs.sharding import ShardingCodec - msg = ( - f"The `chunks` attribute is only defined for arrays using `RegularChunkGrid`." - f"This array has a {self.chunk_grid} instead." - ) - raise NotImplementedError(msg) + if len(self.codecs) == 1 and isinstance(self.codecs[0], ShardingCodec): + return self.codecs[0].chunk_shape + return self.chunk_grid.chunk_shape @property def shards(self) -> tuple[int, ...] | None: - if isinstance(self.chunk_grid, RegularChunkGrid): - from zarr.codecs.sharding import ShardingCodec - - if len(self.codecs) == 1 and isinstance(self.codecs[0], ShardingCodec): - return self.chunk_grid.chunk_shape - else: - return None - - msg = ( - f"The `shards` attribute is only defined for arrays using `RegularChunkGrid`." - f"This array has a {self.chunk_grid} instead." - ) - raise NotImplementedError(msg) + from zarr.codecs.sharding import ShardingCodec + + if len(self.codecs) == 1 and isinstance(self.codecs[0], ShardingCodec): + if not isinstance(self.chunk_grid, RegularChunkGrid): + msg = ( + "The `shards` attribute is only defined for arrays using regular chunk grids. " + "This array has a rectilinear chunk grid. Use `write_chunk_sizes` for general access." + ) + raise NotImplementedError(msg) + return self.chunk_grid.chunk_shape + return None @property def inner_codecs(self) -> tuple[Codec, ...]: - if isinstance(self.chunk_grid, RegularChunkGrid): - from zarr.codecs.sharding import ShardingCodec + from zarr.codecs.sharding import ShardingCodec - if len(self.codecs) == 1 and isinstance(self.codecs[0], ShardingCodec): - return self.codecs[0].codecs + if len(self.codecs) == 1 and isinstance(self.codecs[0], ShardingCodec): + return self.codecs[0].codecs return self.codecs - def get_chunk_spec( - self, _chunk_coords: tuple[int, ...], array_config: ArrayConfig, prototype: BufferPrototype - ) -> ArraySpec: - assert isinstance(self.chunk_grid, RegularChunkGrid), ( - "Currently, only regular chunk grid is supported" - ) - return ArraySpec( - shape=self.chunk_grid.chunk_shape, - dtype=self.dtype, - fill_value=self.fill_value, - config=array_config, - prototype=prototype, - ) - def encode_chunk_key(self, chunk_coords: tuple[int, ...]) -> str: return self.chunk_key_encoding.encode_chunk_key(chunk_coords) @@ -415,6 +631,8 @@ def to_dict(self) -> dict[str, JSON]: extra_fields = out_dict.pop("extra_fields") out_dict = out_dict | extra_fields # type: ignore[operator] + out_dict["chunk_grid"] = self.chunk_grid.to_dict() + out_dict["fill_value"] = self.data_type.to_json_scalar( self.fill_value, zarr_format=self.zarr_format ) @@ -436,7 +654,10 @@ def to_dict(self) -> dict[str, JSON]: return out_dict def update_shape(self, shape: tuple[int, ...]) -> Self: - return replace(self, shape=shape) + chunk_grid = self.chunk_grid + if isinstance(chunk_grid, RectilinearChunkGrid): + chunk_grid = chunk_grid.update_shape(self.shape, shape) + return replace(self, shape=shape, chunk_grid=chunk_grid) def update_attributes(self, attributes: dict[str, JSON]) -> Self: return replace(self, attributes=attributes) diff --git a/src/zarr/experimental/__init__.py b/src/zarr/experimental/__init__.py index 3863510c65..f7caaf96a1 100644 --- a/src/zarr/experimental/__init__.py +++ b/src/zarr/experimental/__init__.py @@ -1 +1,5 @@ """The experimental module is a site for exporting new or experimental Zarr features.""" + +from zarr.core.chunk_grids import ChunkGrid, ChunkSpec + +__all__ = ["ChunkGrid", "ChunkSpec"] diff --git a/src/zarr/metadata/migrate_v3.py b/src/zarr/metadata/migrate_v3.py index a72939100d..80c50585be 100644 --- a/src/zarr/metadata/migrate_v3.py +++ b/src/zarr/metadata/migrate_v3.py @@ -27,7 +27,7 @@ from zarr.core.dtype.wrapper import TBaseDType, TBaseScalar, ZDType from zarr.core.group import GroupMetadata from zarr.core.metadata.v2 import ArrayV2Metadata -from zarr.core.metadata.v3 import ArrayV3Metadata +from zarr.core.metadata.v3 import ArrayV3Metadata, RegularChunkGrid from zarr.core.sync import sync from zarr.registry import get_codec_class from zarr.storage import StorePath @@ -211,7 +211,7 @@ def _convert_array_metadata(metadata_v2: ArrayV2Metadata) -> ArrayV3Metadata: return ArrayV3Metadata( shape=metadata_v2.shape, data_type=metadata_v2.dtype, - chunk_grid=metadata_v2.chunk_grid, + chunk_grid=RegularChunkGrid(chunk_shape=metadata_v2.chunks), chunk_key_encoding=chunk_key_encoding, fill_value=metadata_v2.fill_value, codecs=codecs, diff --git a/src/zarr/testing/strategies.py b/src/zarr/testing/strategies.py index 330f220b56..3a0cc58df0 100644 --- a/src/zarr/testing/strategies.py +++ b/src/zarr/testing/strategies.py @@ -14,11 +14,11 @@ from zarr.abc.store import RangeByteRequest, Store from zarr.codecs.bytes import BytesCodec from zarr.core.array import Array -from zarr.core.chunk_grids import RegularChunkGrid from zarr.core.chunk_key_encodings import DefaultChunkKeyEncoding from zarr.core.common import JSON, ZarrFormat from zarr.core.dtype import get_data_type_from_native_dtype from zarr.core.metadata import ArrayV2Metadata, ArrayV3Metadata +from zarr.core.metadata.v3 import RegularChunkGrid from zarr.core.sync import sync from zarr.storage import MemoryStore, StoreLike from zarr.storage._common import _dereference_path @@ -140,7 +140,7 @@ def array_metadata( # separator = draw(st.sampled_from(['/', '\\'])) shape = draw(array_shapes()) ndim = len(shape) - chunk_shape = draw(array_shapes(min_dims=ndim, max_dims=ndim)) + chunk_shape = draw(array_shapes(min_dims=ndim, max_dims=ndim, min_side=1)) np_dtype = draw(dtypes()) dtype = get_data_type_from_native_dtype(np_dtype) fill_value = draw(npst.from_dtype(np_dtype)) @@ -194,11 +194,17 @@ def chunk_shapes(draw: st.DrawFn, *, shape: tuple[int, ...]) -> tuple[int, ...]: # We want this strategy to shrink towards arrays with smaller number of chunks # 1. st.integers() shrinks towards smaller values. So we use that to generate number of chunks numchunks = draw( - st.tuples(*[st.integers(min_value=0 if size == 0 else 1, max_value=size) for size in shape]) + st.tuples( + *[ + st.integers(min_value=0 if size == 0 else 1, max_value=max(size, 1)) + for size in shape + ] + ) ) # 2. and now generate the chunks tuple + # Chunk sizes must be >= 1 per spec; for zero-extent dimensions use 1. chunks = tuple( - size // nchunks if nchunks > 0 else 0 + max(1, size // nchunks) if nchunks > 0 else 1 for size, nchunks in zip(shape, numchunks, strict=True) ) @@ -228,7 +234,7 @@ def np_array_and_chunks( draw: st.DrawFn, *, arrays: st.SearchStrategy[npt.NDArray[Any]] = numpy_arrays(), # noqa: B008 -) -> tuple[np.ndarray, tuple[int, ...]]: # type: ignore[type-arg] +) -> tuple[np.ndarray[Any, Any], tuple[int, ...]]: """A hypothesis strategy to generate small sized random arrays. Returns: a tuple of the array and a suitable random chunking for it. @@ -260,14 +266,14 @@ def arrays( nparray = draw(arrays, label="array data") chunk_shape = draw(chunk_shapes(shape=nparray.shape), label="chunk shape") dim_names: None | list[str | None] = None - if zarr_format == 3 and all(c > 0 for c in chunk_shape): - shard_shape = draw( - st.none() | shard_shapes(shape=nparray.shape, chunk_shape=chunk_shape), - label="shard shape", - ) + shard_shape = None + if zarr_format == 3: dim_names = draw(dimension_names(ndim=nparray.ndim), label="dimension names") - else: - shard_shape = None + if all(s > 0 for s in nparray.shape) and all(c > 0 for c in chunk_shape): + shard_shape = draw( + st.none() | shard_shapes(shape=nparray.shape, chunk_shape=chunk_shape), + label="shard shape", + ) # test that None works too. fill_value = draw(st.one_of([st.none(), npst.from_dtype(nparray.dtype)])) # compressor = draw(compressors) @@ -324,6 +330,60 @@ def simple_arrays( ) +@st.composite +def rectilinear_chunks(draw: st.DrawFn, *, shape: tuple[int, ...]) -> list[list[int]]: + """Generate valid rectilinear chunk shapes for a given array shape. + + Each dimension is partitioned into 1..min(size, 10) chunks by drawing + unique divider points within [1, size-1]. + """ + chunk_shapes: list[list[int]] = [] + for size in shape: + assert size > 0 + max_chunks = min(size, 10) + nchunks = draw(st.integers(min_value=1, max_value=max_chunks)) + if nchunks == 1: + chunk_shapes.append([size]) + else: + dividers = sorted( + draw( + st.lists( + st.integers(min_value=1, max_value=size - 1), + min_size=nchunks - 1, + max_size=nchunks - 1, + unique=True, + ) + ) + ) + chunk_shapes.append( + [a - b for a, b in zip(dividers + [size], [0] + dividers, strict=False)] + ) + return chunk_shapes + + +# Rectilinear arrays need min_side >= 2 so divider generation works +_rectilinear_shapes = npst.array_shapes(max_dims=3, min_side=2, max_side=20) + + +@st.composite +def rectilinear_arrays( + draw: st.DrawFn, + *, + shapes: st.SearchStrategy[tuple[int, ...]] = _rectilinear_shapes, +) -> Any: + """Generate a zarr v3 array with rectilinear (variable) chunk grid.""" + shape = draw(shapes) + chunk_shapes = draw(rectilinear_chunks(shape=shape)) + + nparray = np.arange(int(np.prod(shape)), dtype="int32").reshape(shape) + store = MemoryStore() + with zarr.config.set({"array.rectilinear_chunks": True}): + a = zarr.create_array(store=store, shape=shape, chunks=chunk_shapes, dtype="int32") + a[:] = nparray + + return a + + def is_negative_slice(idx: Any) -> bool: return isinstance(idx, slice) and idx.step is not None and idx.step < 0 diff --git a/tests/conftest.py b/tests/conftest.py index 86db02f6bf..d4ba254480 100644 --- a/tests/conftest.py +++ b/tests/conftest.py @@ -22,7 +22,7 @@ _parse_chunk_encoding_v3, _parse_chunk_key_encoding, ) -from zarr.core.chunk_grids import RegularChunkGrid, _auto_partition +from zarr.core.chunk_grids import _auto_partition from zarr.core.common import ( JSON, DimensionNamesLike, @@ -37,7 +37,7 @@ ) from zarr.core.dtype.common import HasItemSize from zarr.core.metadata.v2 import ArrayV2Metadata -from zarr.core.metadata.v3 import ArrayV3Metadata +from zarr.core.metadata.v3 import ArrayV3Metadata, RegularChunkGrid from zarr.core.sync import sync from zarr.storage import FsspecStore, LocalStore, MemoryStore, StorePath, ZipStore from zarr.testing.store import LatencyStore @@ -390,7 +390,7 @@ def create_array_metadata( return ArrayV3Metadata( shape=shape_parsed, data_type=dtype_parsed, - chunk_grid=RegularChunkGrid(chunk_shape=chunks_out), + chunk_grid={"name": "regular", "configuration": {"chunk_shape": chunks_out}}, chunk_key_encoding=chunk_key_encoding_parsed, fill_value=fill_value, codecs=codecs_out, diff --git a/tests/test_api.py b/tests/test_api.py index a306ff3dc3..33cd3bd301 100644 --- a/tests/test_api.py +++ b/tests/test_api.py @@ -280,6 +280,19 @@ async def test_open_array(memory_store: MemoryStore, zarr_format: ZarrFormat) -> zarr.api.synchronous.open(store="doesnotexist", mode="r", zarr_format=zarr_format) +def test_open_array_rectilinear_chunks(tmp_path: Path) -> None: + """zarr.open with rectilinear (dask-style) chunks preserves the chunk grid.""" + from zarr.core.metadata.v3 import RectilinearChunkGrid + + chunks = ((3, 3, 4), (5, 5)) + with zarr.config.set({"array.rectilinear_chunks": True}): + z = zarr.open(store=tmp_path, shape=(10, 10), dtype="float64", chunks=chunks, mode="w") + assert isinstance(z, Array) + assert z.shape == (10, 10) + assert isinstance(z.metadata.chunk_grid, RectilinearChunkGrid) + assert z.read_chunk_sizes == ((3, 3, 4), (5, 5)) + + @pytest.mark.asyncio async def test_async_array_open_array_not_found() -> None: """Test that AsyncArray.open raises ArrayNotFoundError when array doesn't exist""" diff --git a/tests/test_array.py b/tests/test_array.py index bf6f651283..0e9d750608 100644 --- a/tests/test_array.py +++ b/tests/test_array.py @@ -786,8 +786,6 @@ def test_resize_growing_skips_chunk_enumeration( store: MemoryStore, zarr_format: ZarrFormat ) -> None: """Growing an array should not enumerate chunk coords for deletion (#3650 mitigation).""" - from zarr.core.chunk_grids import RegularChunkGrid - z = zarr.create( shape=(10, 10), chunks=(5, 5), @@ -798,11 +796,13 @@ def test_resize_growing_skips_chunk_enumeration( ) z[:] = np.ones((10, 10), dtype="i4") + grid_cls = type(z._chunk_grid) + # growth only - ensure no chunk coords are enumerated with mock.patch.object( - RegularChunkGrid, + grid_cls, "all_chunk_coords", - wraps=z.metadata.chunk_grid.all_chunk_coords, + wraps=z._chunk_grid.all_chunk_coords, ) as mock_coords: z.resize((20, 20)) mock_coords.assert_not_called() @@ -813,9 +813,9 @@ def test_resize_growing_skips_chunk_enumeration( # shrink - ensure no regression of behaviour with mock.patch.object( - RegularChunkGrid, + grid_cls, "all_chunk_coords", - wraps=z.metadata.chunk_grid.all_chunk_coords, + wraps=z._chunk_grid.all_chunk_coords, ) as mock_coords: z.resize((5, 5)) assert mock_coords.call_count > 0 @@ -836,9 +836,9 @@ def test_resize_growing_skips_chunk_enumeration( z2[:] = np.ones((10, 10), dtype="i4") with mock.patch.object( - RegularChunkGrid, + grid_cls, "all_chunk_coords", - wraps=z2.metadata.chunk_grid.all_chunk_coords, + wraps=z2._chunk_grid.all_chunk_coords, ) as mock_coords: z2.resize((20, 5)) assert mock_coords.call_count > 0 @@ -1576,7 +1576,7 @@ async def test_with_data(impl: Literal["sync", "async"], store: Store) -> None: elif impl == "async": arr = await create_array(store, name=name, data=data, zarr_format=3) stored = await arr._get_selection( - BasicIndexer(..., shape=arr.shape, chunk_grid=arr.metadata.chunk_grid), + BasicIndexer(..., shape=arr.shape, chunk_grid=arr._chunk_grid), prototype=default_buffer_prototype(), ) else: diff --git a/tests/test_cli/test_migrate_v3.py b/tests/test_cli/test_migrate_v3.py index 6e169e5f48..7213aada12 100644 --- a/tests/test_cli/test_migrate_v3.py +++ b/tests/test_cli/test_migrate_v3.py @@ -16,7 +16,6 @@ from zarr.codecs.numcodecs import LZMA, Delta from zarr.codecs.transpose import TransposeCodec from zarr.codecs.zstd import ZstdCodec -from zarr.core.chunk_grids import RegularChunkGrid from zarr.core.chunk_key_encodings import V2ChunkKeyEncoding from zarr.core.common import JSON, ZarrFormat from zarr.core.dtype.npy.int import UInt8, UInt16 @@ -61,7 +60,7 @@ def test_migrate_array(local_store: LocalStore) -> None: expected_metadata = ArrayV3Metadata( shape=shape, data_type=UInt16(endianness="little"), - chunk_grid=RegularChunkGrid(chunk_shape=chunks), + chunk_grid={"name": "regular", "configuration": {"chunk_shape": chunks}}, chunk_key_encoding=V2ChunkKeyEncoding(separator="."), fill_value=fill_value, codecs=( diff --git a/tests/test_codec_pipeline.py b/tests/test_codec_pipeline.py index 8d044c10d7..48e15b0643 100644 --- a/tests/test_codec_pipeline.py +++ b/tests/test_codec_pipeline.py @@ -3,6 +3,7 @@ import pytest import zarr +from zarr.core.array import _get_chunk_spec from zarr.core.buffer.core import default_buffer_prototype from zarr.core.indexing import BasicIndexer from zarr.storage import MemoryStore @@ -42,7 +43,7 @@ async def test_read_returns_get_results( indexer = BasicIndexer( read_slice, shape=metadata.shape, - chunk_grid=metadata.chunk_grid, + chunk_grid=async_arr._chunk_grid, ) out_buffer = prototype.nd_buffer.empty( @@ -55,7 +56,7 @@ async def test_read_returns_get_results( [ ( async_arr.store_path / metadata.encode_chunk_key(chunk_coords), - metadata.get_chunk_spec(chunk_coords, config, prototype=prototype), + _get_chunk_spec(metadata, async_arr._chunk_grid, chunk_coords, config, prototype), chunk_selection, out_selection, is_complete_chunk, diff --git a/tests/test_codecs/test_sharding.py b/tests/test_codecs/test_sharding.py index d7cbeb5bdb..43d03caf11 100644 --- a/tests/test_codecs/test_sharding.py +++ b/tests/test_codecs/test_sharding.py @@ -1,5 +1,4 @@ import pickle -import re from typing import Any import numpy as np @@ -489,9 +488,9 @@ def test_invalid_metadata(store: Store) -> None: def test_invalid_shard_shape() -> None: with pytest.raises( ValueError, - match=re.escape( - "The array's `chunk_shape` (got (16, 16)) needs to be divisible " - "by the shard's inner `chunk_shape` (got (9,))." + match=( + f"Chunk edge length {16} in dimension {0} is not " + f"divisible by the shard's inner chunk size {9}\\." ), ): zarr.create_array( diff --git a/tests/test_config.py b/tests/test_config.py index c3102e8efe..2704505bc8 100644 --- a/tests/test_config.py +++ b/tests/test_config.py @@ -54,6 +54,7 @@ def test_config_defaults_set() -> None: "order": "C", "write_empty_chunks": False, "target_shard_size_bytes": None, + "rectilinear_chunks": False, }, "async": {"concurrency": 10, "timeout": None}, "threading": {"max_workers": None}, diff --git a/tests/test_group.py b/tests/test_group.py index 6f1f4e68fa..e53b0b9ea0 100644 --- a/tests/test_group.py +++ b/tests/test_group.py @@ -1176,9 +1176,7 @@ async def test_asyncgroup_create_array( assert subnode.store_path.store == store assert subnode.shape == shape assert subnode.dtype == dtype - # todo: fix the type annotation of array.metadata.chunk_grid so that we get some autocomplete - # here. - assert subnode.metadata.chunk_grid.chunk_shape == chunk_shape + assert subnode._chunk_grid.chunk_shape == chunk_shape assert subnode.metadata.zarr_format == zarr_format diff --git a/tests/test_indexing.py b/tests/test_indexing.py index 9c734fb0c3..ef98cf3345 100644 --- a/tests/test_indexing.py +++ b/tests/test_indexing.py @@ -1236,8 +1236,8 @@ def test_get_block_selection_1d(store: StorePath) -> None: _test_get_block_selection(a, z, selection, expected_idx) bad_selections = block_selections_1d_bad + [ - z.metadata.chunk_grid.get_nchunks(z.shape) + 1, # out of bounds - -(z.metadata.chunk_grid.get_nchunks(z.shape) + 1), # out of bounds + z._chunk_grid.get_nchunks() + 1, # out of bounds + -(z._chunk_grid.get_nchunks() + 1), # out of bounds ] for selection_bad in bad_selections: @@ -1950,9 +1950,11 @@ def test_indexing_with_zarr_array(store: StorePath) -> None: @pytest.mark.parametrize("store", ["local", "memory"], indirect=["store"]) -@pytest.mark.parametrize("shape", [(0, 2, 3), (0), (3, 0)]) +@pytest.mark.parametrize("shape", [(0, 2, 3), (0,), (3, 0)]) def test_zero_sized_chunks(store: StorePath, shape: list[int]) -> None: - z = zarr.create_array(store=store, shape=shape, chunks=shape, zarr_format=3, dtype="f8") + # Chunk sizes must be >= 1 per spec; use 1 for zero-extent dimensions. + chunks = tuple(max(1, s) for s in shape) + z = zarr.create_array(store=store, shape=shape, chunks=chunks, zarr_format=3, dtype="f8") z[...] = 42 assert_array_equal(z[...], np.zeros(shape, dtype="f8")) diff --git a/tests/test_properties.py b/tests/test_properties.py index bab659c976..4b6c151382 100644 --- a/tests/test_properties.py +++ b/tests/test_properties.py @@ -25,6 +25,7 @@ basic_indices, numpy_arrays, orthogonal_indices, + rectilinear_arrays, simple_arrays, stores, zarr_formats, @@ -111,7 +112,7 @@ def test_array_creates_implicit_groups(array): @pytest.mark.filterwarnings("ignore::zarr.core.dtype.common.UnstableSpecificationWarning") @given(data=st.data()) async def test_basic_indexing(data: st.DataObject) -> None: - zarray = data.draw(simple_arrays()) + zarray = data.draw(st.one_of(simple_arrays(), rectilinear_arrays())) nparray = zarray[:] indexer = data.draw(basic_indices(shape=nparray.shape)) @@ -138,7 +139,12 @@ async def test_basic_indexing(data: st.DataObject) -> None: @pytest.mark.filterwarnings("ignore::zarr.core.dtype.common.UnstableSpecificationWarning") async def test_oindex(data: st.DataObject) -> None: # integer_array_indices can't handle 0-size dimensions. - zarray = data.draw(simple_arrays(shapes=npst.array_shapes(max_dims=4, min_side=1))) + zarray = data.draw( + st.one_of( + simple_arrays(shapes=npst.array_shapes(max_dims=4, min_side=1)), + rectilinear_arrays(shapes=npst.array_shapes(max_dims=3, min_side=2, max_side=20)), + ) + ) nparray = zarray[:] zindexer, npindexer = data.draw(orthogonal_indices(shape=nparray.shape)) @@ -170,7 +176,12 @@ async def test_oindex(data: st.DataObject) -> None: @pytest.mark.filterwarnings("ignore::zarr.core.dtype.common.UnstableSpecificationWarning") async def test_vindex(data: st.DataObject) -> None: # integer_array_indices can't handle 0-size dimensions. - zarray = data.draw(simple_arrays(shapes=npst.array_shapes(max_dims=4, min_side=1))) + zarray = data.draw( + st.one_of( + simple_arrays(shapes=npst.array_shapes(max_dims=4, min_side=1)), + rectilinear_arrays(shapes=npst.array_shapes(max_dims=3, min_side=2, max_side=20)), + ) + ) nparray = zarray[:] indexer = data.draw( npst.integer_array_indices( diff --git a/tests/test_unified_chunk_grid.py b/tests/test_unified_chunk_grid.py new file mode 100644 index 0000000000..92bb1abae9 --- /dev/null +++ b/tests/test_unified_chunk_grid.py @@ -0,0 +1,2731 @@ +""" +Tests for the unified ChunkGrid design (POC). + +Tests the core ChunkGrid with FixedDimension/VaryingDimension internals, +ChunkSpec, serialization round-trips, indexing with rectilinear grids, +and end-to-end array creation + read/write. +""" + +from __future__ import annotations + +from typing import TYPE_CHECKING, Any + +import numpy as np +import pytest + +import zarr +from zarr.core.chunk_grids import ( + ChunkGrid, + ChunkSpec, + FixedDimension, + VaryingDimension, + _is_rectilinear_chunks, +) +from zarr.core.common import compress_rle, expand_rle +from zarr.core.metadata.v3 import ( + RectilinearChunkGrid, + parse_chunk_grid, +) +from zarr.core.metadata.v3 import ( + RegularChunkGrid as RegularChunkGridMeta, +) +from zarr.errors import BoundsCheckError +from zarr.storage import MemoryStore + +if TYPE_CHECKING: + from collections.abc import Generator + from pathlib import Path + + +@pytest.fixture(autouse=True) +def _enable_rectilinear_chunks() -> Generator[None, None, None]: + """Enable rectilinear chunks for all tests in this module.""" + with zarr.config.set({"array.rectilinear_chunks": True}): + yield + + +def _edges(grid: ChunkGrid, dim: int) -> tuple[int, ...]: + """Extract the per-chunk edge lengths for *dim* from a ChunkGrid.""" + d = grid._dimensions[dim] + if isinstance(d, FixedDimension): + return tuple(d.size for _ in range(d.nchunks)) + if isinstance(d, VaryingDimension): + return tuple(d.edges) + raise TypeError(f"Unexpected dimension type: {type(d)}") + + +# --------------------------------------------------------------------------- +# Dimension index_to_chunk bounds tests +# --------------------------------------------------------------------------- + + +@pytest.mark.parametrize( + ("dim", "index", "match"), + [ + (VaryingDimension([10, 20, 30], extent=60), 60, "out of bounds"), + (VaryingDimension([10, 20, 30], extent=60), 100, "out of bounds"), + (FixedDimension(size=10, extent=95), 95, "out of bounds"), + (FixedDimension(size=10, extent=95), -1, "Negative"), + ], + ids=[ + "varying-at-extent", + "varying-past-extent", + "fixed-at-extent", + "fixed-negative", + ], +) +def test_dimension_index_to_chunk_bounds( + dim: FixedDimension | VaryingDimension, index: int, match: str +) -> None: + """Out-of-bounds or negative indices raise IndexError for both dimension types""" + with pytest.raises(IndexError, match=match): + dim.index_to_chunk(index) + + +@pytest.mark.parametrize( + ("dim", "index", "expected"), + [ + (VaryingDimension([10, 20, 30], extent=60), 59, 2), + (FixedDimension(size=10, extent=95), 94, 9), + ], + ids=["varying-last-valid", "fixed-last-valid"], +) +def test_dimension_index_to_chunk_last_valid( + dim: FixedDimension | VaryingDimension, index: int, expected: int +) -> None: + """Last valid index maps to the correct chunk for both dimension types""" + assert dim.index_to_chunk(index) == expected + + +# --------------------------------------------------------------------------- +# Rectilinear feature flag tests +# --------------------------------------------------------------------------- + + +@pytest.mark.parametrize( + "action", + [ + lambda: RectilinearChunkGrid(chunk_shapes=((10, 20), (25, 25))), + lambda: RectilinearChunkGrid.from_dict( + { + "name": "rectilinear", + "configuration": {"kind": "inline", "chunk_shapes": [[10, 20, 30], [50, 50]]}, # type: ignore[typeddict-item] + } + ), + lambda: zarr.create_array(MemoryStore(), shape=(30,), chunks=[[10, 20]], dtype="int32"), + ], + ids=["constructor", "from_dict", "create_array"], +) +def test_rectilinear_feature_flag_blocked(action: Any) -> None: + """Rectilinear chunk operations raise ValueError when the feature flag is disabled""" + with zarr.config.set({"array.rectilinear_chunks": False}): + with pytest.raises(ValueError, match="experimental and disabled by default"): + action() + + +def test_rectilinear_feature_flag_enabled() -> None: + """Rectilinear chunk grid construction succeeds when the feature flag is enabled""" + with zarr.config.set({"array.rectilinear_chunks": True}): + grid = RectilinearChunkGrid(chunk_shapes=((10, 20), (25, 25))) + assert grid.ndim == 2 + + +# --------------------------------------------------------------------------- +# FixedDimension tests +# --------------------------------------------------------------------------- + + +@pytest.mark.parametrize( + ( + "size", + "extent", + "chunk_ix", + "expected_nchunks", + "expected_chunk_size", + "expected_data_size", + "expected_offset", + ), + [ + (10, 100, 0, 10, 10, 10, 0), + (10, 100, 1, 10, 10, 10, 10), + (10, 100, 9, 10, 10, 10, 90), + (10, 95, 9, 10, 10, 5, 90), # boundary chunk + (0, 0, None, 0, None, None, None), # zero-size + ], + ids=["start", "middle", "end", "boundary", "zero-size"], +) +def test_fixed_dimension( + size: int, + extent: int, + chunk_ix: int | None, + expected_nchunks: int, + expected_chunk_size: int | None, + expected_data_size: int | None, + expected_offset: int | None, +) -> None: + """FixedDimension properties match expected values for various chunk/extent combinations""" + d = FixedDimension(size=size, extent=extent) + assert d.nchunks == expected_nchunks + if chunk_ix is not None: + assert d.chunk_size(chunk_ix) == expected_chunk_size + assert d.data_size(chunk_ix) == expected_data_size + assert d.chunk_offset(chunk_ix) == expected_offset + + +@pytest.mark.parametrize( + ("idx", "expected"), + [(0, 0), (9, 0), (10, 1), (25, 2)], +) +def test_fixed_dimension_index_to_chunk(idx: int, expected: int) -> None: + """FixedDimension.index_to_chunk maps element indices to correct chunk indices""" + d = FixedDimension(size=10, extent=100) + assert d.index_to_chunk(idx) == expected + + +def test_fixed_dimension_indices_to_chunks() -> None: + """FixedDimension.indices_to_chunks vectorizes index-to-chunk mapping over an array""" + d = FixedDimension(size=10, extent=100) + indices = np.array([0, 5, 10, 15, 99]) + np.testing.assert_array_equal(d.indices_to_chunks(indices), [0, 0, 1, 1, 9]) + + +@pytest.mark.parametrize( + ("size", "extent", "match"), + [(-1, 100, "must be >= 0"), (10, -1, "must be >= 0")], + ids=["negative-size", "negative-extent"], +) +def test_fixed_dimension_rejects_negative(size: int, extent: int, match: str) -> None: + """FixedDimension raises ValueError for negative size or extent""" + with pytest.raises(ValueError, match=match): + FixedDimension(size=size, extent=extent) + + +# --------------------------------------------------------------------------- +# VaryingDimension tests +# --------------------------------------------------------------------------- + + +def test_varying_dimension_construction() -> None: + """VaryingDimension stores edges, cumulative sums, nchunks, and extent correctly""" + d = VaryingDimension([10, 20, 30], extent=60) + assert d.edges == (10, 20, 30) + assert d.cumulative == (10, 30, 60) + assert d.nchunks == 3 + assert d.extent == 60 + + +@pytest.mark.parametrize( + ( + "chunk_idx", + "expected_offset", + "expected_size", + "expected_data", + "expected_chunk_for_first_idx", + ), + [ + (0, 0, 10, 10, 0), + (1, 10, 20, 20, 1), + (2, 30, 30, 30, 2), + ], +) +def test_varying_dimension( + chunk_idx: int, + expected_offset: int, + expected_size: int, + expected_data: int, + expected_chunk_for_first_idx: int, +) -> None: + """VaryingDimension chunk_offset, chunk_size, data_size, and index_to_chunk return correct values""" + d = VaryingDimension([10, 20, 30], extent=60) + assert d.chunk_offset(chunk_idx) == expected_offset + assert d.chunk_size(chunk_idx) == expected_size + assert d.data_size(chunk_idx) == expected_data + assert d.index_to_chunk(expected_offset) == expected_chunk_for_first_idx + + +def test_varying_dimension_indices_to_chunks() -> None: + """VaryingDimension.indices_to_chunks vectorizes index-to-chunk mapping over an array""" + d = VaryingDimension([10, 20, 30], extent=60) + indices = np.array([0, 9, 10, 29, 30, 59]) + np.testing.assert_array_equal(d.indices_to_chunks(indices), [0, 0, 1, 1, 2, 2]) + + +@pytest.mark.parametrize( + ("edges", "extent", "match"), + [ + ([], 0, "must not be empty"), + ([10, 0, 5], 15, "must be > 0"), + ], + ids=["empty", "zero-edge"], +) +def test_varying_dimension_rejects_invalid(edges: list[int], extent: int, match: str) -> None: + """VaryingDimension raises ValueError for empty edges or zero-length edges""" + with pytest.raises(ValueError, match=match): + VaryingDimension(edges, extent=extent) + + +# --------------------------------------------------------------------------- +# ChunkSpec tests +# --------------------------------------------------------------------------- + + +@pytest.mark.parametrize( + ("slices", "codec_shape", "expected_shape", "expected_boundary"), + [ + ((slice(0, 10), slice(0, 20)), (10, 20), (10, 20), False), + ((slice(90, 95), slice(0, 20)), (10, 20), (5, 20), True), + ((slice(10, 10),), (0,), (0,), False), + ((slice(0, 10), slice(0, 5)), (10, 10), (10, 5), True), + ], + ids=["basic", "boundary", "empty-slices", "multidim-boundary"], +) +def test_chunk_spec( + slices: tuple[slice, ...], + codec_shape: tuple[int, ...], + expected_shape: tuple[int, ...], + expected_boundary: bool, +) -> None: + """ChunkSpec reports correct shape and boundary status from slices and codec_shape""" + spec = ChunkSpec(slices=slices, codec_shape=codec_shape) + assert spec.shape == expected_shape + assert spec.is_boundary == expected_boundary + + +# --------------------------------------------------------------------------- +# ChunkGrid construction tests +# --------------------------------------------------------------------------- + + +@pytest.mark.parametrize( + ("array_shape", "chunk_sizes", "expected_regular", "expected_ndim", "expected_chunk_shape"), + [ + ((100, 200), (10, 20), True, 2, (10, 20)), + ((), (), True, 0, ()), + ((60, 100), [[10, 20, 30], [25, 25, 25, 25]], False, 2, None), + ((30, 50), [[10, 10, 10], [25, 25]], True, 2, (10, 25)), # uniform edges → regular + ], + ids=["regular", "zero-dim", "rectilinear", "uniform-becomes-regular"], +) +def test_chunk_grid_construction( + array_shape: tuple[int, ...], + chunk_sizes: Any, + expected_regular: bool, + expected_ndim: int, + expected_chunk_shape: tuple[int, ...] | None, +) -> None: + """ChunkGrid.from_sizes produces grids with correct regularity, ndim, and chunk_shape""" + g = ChunkGrid.from_sizes(array_shape, chunk_sizes) + assert g.is_regular == expected_regular + assert g.ndim == expected_ndim + if expected_chunk_shape is not None: + assert g.chunk_shape == expected_chunk_shape + else: + with pytest.raises(ValueError, match="only available for regular"): + _ = g.chunk_shape + + +def test_chunk_grid_rectilinear_uniform_dim_is_fixed() -> None: + """A rectilinear grid with all-same sizes in one dim stores it as Fixed.""" + g = ChunkGrid.from_sizes((60, 100), [[10, 20, 30], [25, 25, 25, 25]]) + assert isinstance(g._dimensions[0], VaryingDimension) + assert isinstance(g._dimensions[1], FixedDimension) + + +# --------------------------------------------------------------------------- +# ChunkGrid query tests +# --------------------------------------------------------------------------- + + +@pytest.mark.parametrize( + ("shape", "chunks", "expected_grid_shape"), + [ + ((100, 200), (10, 20), (10, 10)), + ((95, 200), (10, 20), (10, 10)), + ((60, 100), [[10, 20, 30], [25, 25, 25, 25]], (3, 4)), + ], + ids=["regular", "regular-boundary", "rectilinear"], +) +def test_chunk_grid_shape( + shape: tuple[int, ...], + chunks: Any, + expected_grid_shape: tuple[int, ...], +) -> None: + """ChunkGrid.grid_shape returns the expected number of chunks per dimension""" + g = ChunkGrid.from_sizes(shape, chunks) + assert g.grid_shape == expected_grid_shape + + +@pytest.mark.parametrize( + ( + "array_shape", + "chunk_sizes", + "coords", + "expected_shape", + "expected_codec_shape", + "expected_boundary", + ), + [ + # regular interior + ((100, 200), (10, 20), (0, 0), (10, 20), (10, 20), False), + # regular boundary + ((95, 200), (10, 20), (9, 0), (5, 20), (10, 20), True), + # rectilinear + ((60, 100), [[10, 20, 30], [25, 25, 25, 25]], (0, 0), (10, 25), (10, 25), False), + ((60, 100), [[10, 20, 30], [25, 25, 25, 25]], (1, 0), (20, 25), (20, 25), False), + ((60, 100), [[10, 20, 30], [25, 25, 25, 25]], (2, 3), (30, 25), (30, 25), False), + ], + ids=["regular", "regular-boundary", "rectilinear-0,0", "rectilinear-1,0", "rectilinear-2,3"], +) +def test_chunk_grid_getitem( + array_shape: tuple[int, ...], + chunk_sizes: Any, + coords: tuple[int, ...], + expected_shape: tuple[int, ...], + expected_codec_shape: tuple[int, ...], + expected_boundary: bool, +) -> None: + """ChunkGrid.__getitem__ returns a ChunkSpec with correct shape, codec_shape, and boundary flag""" + g = ChunkGrid.from_sizes(array_shape, chunk_sizes) + spec = g[coords] + assert spec is not None + assert spec.shape == expected_shape + assert spec.codec_shape == expected_codec_shape + assert spec.is_boundary == expected_boundary + + +@pytest.mark.parametrize( + ("array_shape", "chunk_sizes", "coords"), + [ + ((100, 200), (10, 20), (99, 0)), + ((60, 100), [[10, 20, 30], [25, 25, 25, 25]], (3, 0)), + ], + ids=["regular-oob", "rectilinear-oob"], +) +def test_chunk_grid_getitem_oob( + array_shape: tuple[int, ...], chunk_sizes: Any, coords: tuple[int, ...] +) -> None: + """Out-of-bounds chunk coordinates return None""" + g = ChunkGrid.from_sizes(array_shape, chunk_sizes) + assert g[coords] is None + + +def test_chunk_grid_getitem_slices() -> None: + """ChunkSpec.slices reflect the correct start/stop for a rectilinear chunk""" + g = ChunkGrid.from_sizes((60, 100), [[10, 20, 30], [25, 25, 25, 25]]) + spec = g[(1, 2)] + assert spec is not None + assert spec.slices == (slice(10, 30, 1), slice(50, 75, 1)) + + +# -- all_chunk_coords tests -- + + +@pytest.mark.parametrize( + ("array_shape", "chunk_sizes", "origin", "selection_shape", "expected_coords"), + [ + # rectilinear grid + ( + (60, 100), + [[10, 20, 30], [50, 50]], + None, + None, + [(0, 0), (0, 1), (1, 0), (1, 1), (2, 0), (2, 1)], + ), + ((60, 100), [[10, 20, 30], [50, 50]], (1, 0), None, [(1, 0), (1, 1), (2, 0), (2, 1)]), + ((60, 100), [[10, 20, 30], [50, 50]], None, (2, 1), [(0, 0), (1, 0)]), + ((60, 100), [[10, 20, 30], [50, 50]], (1, 1), (2, 1), [(1, 1), (2, 1)]), + # regular grid + ((30, 40), (10, 20), (2, 1), None, [(2, 1)]), + ((30, 40), (10, 20), None, (0, 0), []), + ((60, 80), (20, 20), (0, 2), (3, 1), [(0, 2), (1, 2), (2, 2)]), + ], + ids=[ + "all", + "with-origin", + "with-sel-shape", + "origin+sel", + "last-chunk", + "zero-sel", + "single-dim", + ], +) +def test_all_chunk_coords( + array_shape: tuple[int, ...], + chunk_sizes: Any, + origin: tuple[int, ...] | None, + selection_shape: tuple[int, ...] | None, + expected_coords: list[tuple[int, ...]], +) -> None: + """all_chunk_coords yields the expected coordinates with optional origin and selection_shape""" + g = ChunkGrid.from_sizes(array_shape, chunk_sizes) + kwargs: dict[str, Any] = {} + if origin is not None: + kwargs["origin"] = origin + if selection_shape is not None: + kwargs["selection_shape"] = selection_shape + assert list(g.all_chunk_coords(**kwargs)) == expected_coords + + +def test_chunk_grid_get_nchunks() -> None: + """get_nchunks returns the total number of chunks across all dimensions""" + g = ChunkGrid.from_sizes((60, 100), [[10, 20, 30], [50, 50]]) + assert g.get_nchunks() == 6 + + +def test_chunk_grid_iter() -> None: + """Iterating a ChunkGrid yields the correct number of ChunkSpec objects""" + g = ChunkGrid.from_sizes((30, 40), (10, 20)) + specs = list(g) + assert len(specs) == 6 + assert all(isinstance(s, ChunkSpec) for s in specs) + + +# --------------------------------------------------------------------------- +# RLE tests +# --------------------------------------------------------------------------- + + +@pytest.mark.parametrize( + ("compressed", "expected"), + [ + ([[10, 3]], [10, 10, 10]), + ([[10, 2], [20, 1]], [10, 10, 20]), + ], +) +def test_rle_expand(compressed: list[Any], expected: list[int]) -> None: + """RLE-encoded edges expand correctly""" + assert expand_rle(compressed) == expected + + +@pytest.mark.parametrize( + ("original", "expected"), + [ + ([10, 10, 10], [[10, 3]]), + ([10, 10, 20], [[10, 2], 20]), + ([5], [5]), + ([10, 20, 30], [10, 20, 30]), + ], +) +def test_rle_compress(original: list[int], expected: list[Any]) -> None: + """compress_rle produces the expected RLE encoding for various input sequences""" + assert compress_rle(original) == expected + + +def test_rle_roundtrip() -> None: + """compress_rle followed by expand_rle recovers the original sequence""" + original = [10, 10, 10, 20, 20, 30] + compressed = compress_rle(original) + assert expand_rle(compressed) == original + + +@pytest.mark.parametrize( + ("rle_input", "match"), + [ + ([0], "Chunk edge length must be >= 1"), + ([-5], "Chunk edge length must be >= 1"), + ([[0, 3]], "Chunk edge length must be >= 1"), + ([[-10, 2]], "Chunk edge length must be >= 1"), + ([[5, 0]], "RLE repeat count must be >= 1"), + ([[5, -1]], "RLE repeat count must be >= 1"), + ], + ids=[ + "zero-edge", + "negative-edge", + "zero-rle-size", + "negative-rle-size", + "zero-rle-count", + "negative-rle-count", + ], +) +def test_rle_expand_rejects_invalid(rle_input: list[Any], match: str) -> None: + """expand_rle raises ValueError for zero/negative edge lengths or repeat counts""" + with pytest.raises(ValueError, match=match): + expand_rle(rle_input) + + +# -- expand_rle handles JSON floats -- + + +def test_expand_rle_bare_integer_floats_accepted() -> None: + """JSON parsers may emit 10.0 for the integer 10; expand_rle should handle it.""" + result = expand_rle([10.0, 20.0]) # type: ignore[list-item] + assert result == [10, 20] + + +def test_expand_rle_pair_with_float_count() -> None: + """expand_rle accepts float repeat counts that are integer-valued""" + result = expand_rle([[10, 3.0]]) # type: ignore[list-item] + assert result == [10, 10, 10] + + +# --------------------------------------------------------------------------- +# _is_rectilinear_chunks tests +# --------------------------------------------------------------------------- + + +@pytest.mark.parametrize( + ("value", "expected"), + [ + ([[10, 20], [5, 5]], True), + (((10, 20), (5, 5)), True), + ((10, 20), False), + ([10, 20], False), + (10, False), + ("auto", False), + ([], False), + ([[]], True), + (ChunkGrid.from_sizes((10,), (5,)), False), + (None, False), + (3.14, False), + ], + ids=[ + "nested-lists", + "nested-tuples", + "flat-tuple", + "flat-list", + "single-int", + "string", + "empty-list", + "empty-nested-list", + "chunk-grid-instance", + "none", + "float", + ], +) +def test_is_rectilinear_chunks(value: Any, expected: bool) -> None: + """_is_rectilinear_chunks correctly identifies nested sequences as rectilinear""" + assert _is_rectilinear_chunks(value) is expected + + +# --------------------------------------------------------------------------- +# Serialization tests +# --------------------------------------------------------------------------- + + +def test_serialization_error_non_regular_chunk_shape() -> None: + """Accessing chunk_shape on a non-regular grid raises ValueError.""" + grid = ChunkGrid.from_sizes((60, 100), [[10, 20, 30], [25, 25, 25, 25]]) + with pytest.raises(ValueError, match="only available for regular"): + grid.chunk_shape # noqa: B018 + + +def test_serialization_error_zero_extent_rectilinear() -> None: + """RectilinearChunkGrid rejects empty edge tuples.""" + with pytest.raises(ValueError, match="has no chunk edges"): + RectilinearChunkGrid(chunk_shapes=((),)) + + +def test_serialization_unknown_name_parse() -> None: + """Parsing metadata with an unknown chunk grid name raises ValueError""" + with pytest.raises(ValueError, match="Unknown chunk grid"): + parse_chunk_grid({"name": "hexagonal", "configuration": {}}) + + +# --------------------------------------------------------------------------- +# Spec compliance tests +# --------------------------------------------------------------------------- + + +def test_spec_kind_inline_required_on_deserialize() -> None: + """Deserialization requires kind: 'inline'.""" + data: dict[str, Any] = { + "name": "rectilinear", + "configuration": {"chunk_shapes": [[10, 20], [15, 15]]}, + } + with pytest.raises(ValueError, match="requires a 'kind' field"): + parse_chunk_grid(data) + + +def test_spec_kind_unknown_rejected() -> None: + """Unsupported rectilinear chunk grid kind raises ValueError on parse""" + data: dict[str, Any] = { + "name": "rectilinear", + "configuration": {"kind": "reference", "chunk_shapes": [[10, 20], [15, 15]]}, + } + with pytest.raises(ValueError, match="Unsupported rectilinear chunk grid kind"): + parse_chunk_grid(data) + + +def test_spec_integer_shorthand_per_dimension() -> None: + """A bare integer in chunk_shapes means repeat until >= extent.""" + data: dict[str, Any] = { + "name": "rectilinear", + "configuration": {"kind": "inline", "chunk_shapes": [4, [1, 2, 3]]}, + } + meta = parse_chunk_grid(data) + g = ChunkGrid.from_sizes((6, 6), meta.chunk_shapes) # type: ignore[union-attr] + assert _edges(g, 0) == (4, 4) + assert _edges(g, 1) == (1, 2, 3) + + +def test_spec_mixed_rle_and_bare_integers() -> None: + """An array can mix bare integers and [value, count] RLE pairs.""" + data: dict[str, Any] = { + "name": "rectilinear", + "configuration": {"kind": "inline", "chunk_shapes": [[[1, 3], 3]]}, + } + meta = parse_chunk_grid(data) + g = ChunkGrid.from_sizes((6,), meta.chunk_shapes) # type: ignore[union-attr] + assert _edges(g, 0) == (1, 1, 1, 3) + + +def test_spec_overflow_chunks_allowed() -> None: + """Edge sum >= extent is valid (overflow chunks permitted).""" + data: dict[str, Any] = { + "name": "rectilinear", + "configuration": {"kind": "inline", "chunk_shapes": [[4, 4, 4]]}, + } + meta = parse_chunk_grid(data) + g = ChunkGrid.from_sizes((6,), meta.chunk_shapes) # type: ignore[union-attr] + assert _edges(g, 0) == (4, 4, 4) + + +def test_spec_example() -> None: + """The full example from the spec README.""" + data: dict[str, Any] = { + "name": "rectilinear", + "configuration": { + "kind": "inline", + "chunk_shapes": [ + 4, + [1, 2, 3], + [[4, 2]], + [[1, 3], 3], + [4, 4, 4], + ], + }, + } + meta = parse_chunk_grid(data) + g = ChunkGrid.from_sizes((6, 6, 6, 6, 6), meta.chunk_shapes) # type: ignore[union-attr] + assert _edges(g, 0) == (4, 4) + assert _edges(g, 1) == (1, 2, 3) + assert _edges(g, 2) == (4, 4) + assert _edges(g, 3) == (1, 1, 1, 3) + assert _edges(g, 4) == (4, 4, 4) + + +# --------------------------------------------------------------------------- +# parse_chunk_grid validation tests +# --------------------------------------------------------------------------- + + +def test_parse_chunk_grid_varying_extent_mismatch_raises() -> None: + """Reconstructing a ChunkGrid with mismatched extents raises ValueError""" + g = ChunkGrid.from_sizes((60, 100), [[10, 20, 30], [50, 50]]) + with pytest.raises(ValueError, match="extent"): + ChunkGrid( + dimensions=tuple( + dim.with_extent(ext) for dim, ext in zip(g._dimensions, (100, 100), strict=True) + ) + ) + + +def test_parse_chunk_grid_varying_extent_match_ok() -> None: + """Reconstructing a ChunkGrid with matching extents succeeds""" + g = ChunkGrid.from_sizes((60, 100), [[10, 20, 30], [50, 50]]) + g2 = ChunkGrid( + dimensions=tuple( + dim.with_extent(ext) for dim, ext in zip(g._dimensions, (60, 100), strict=True) + ) + ) + assert g2._dimensions[0].extent == 60 + + +@pytest.mark.parametrize( + ("chunk_shapes", "array_shape", "match"), + [ + ([[10, 20, 30], [25, 25]], (100, 50), "extent 100 exceeds sum of edges 60"), + ([[50, 50], [10, 20]], (100, 50), "extent 50 exceeds sum of edges 30"), + ], + ids=["first-dim-mismatch", "second-dim-mismatch"], +) +def test_parse_chunk_grid_rectilinear_extent_mismatch_raises( + chunk_shapes: list[list[int]], array_shape: tuple[int, ...], match: str +) -> None: + """Rectilinear grid raises ValueError when array extent exceeds sum of edges""" + data: dict[str, Any] = { + "name": "rectilinear", + "configuration": {"kind": "inline", "chunk_shapes": chunk_shapes}, + } + meta = parse_chunk_grid(data) + with pytest.raises(ValueError, match=match): + ChunkGrid.from_sizes(array_shape, meta.chunk_shapes) # type: ignore[union-attr] + + +def test_parse_chunk_grid_rectilinear_extent_match_passes() -> None: + """Rectilinear grid with matching extents parses and builds successfully""" + data: dict[str, Any] = { + "name": "rectilinear", + "configuration": {"kind": "inline", "chunk_shapes": [[10, 20, 30], [25, 25]]}, + } + meta = parse_chunk_grid(data) + g = ChunkGrid.from_sizes((60, 50), meta.chunk_shapes) # type: ignore[union-attr] + assert g.grid_shape == (3, 2) + + +def test_parse_chunk_grid_rectilinear_ndim_mismatch_raises() -> None: + """Mismatched ndim between array shape and chunk_sizes raises ValueError""" + data: dict[str, Any] = { + "name": "rectilinear", + "configuration": {"kind": "inline", "chunk_shapes": [[10, 20], [25, 25]]}, + } + meta = parse_chunk_grid(data) + with pytest.raises(ValueError, match="3 dimensions but chunk_sizes has 2"): + ChunkGrid.from_sizes((30, 50, 100), meta.chunk_shapes) # type: ignore[union-attr] + + +def test_parse_chunk_grid_rectilinear_rle_extent_validated() -> None: + """RLE-encoded edges are expanded before validation.""" + data: dict[str, Any] = { + "name": "rectilinear", + "configuration": {"kind": "inline", "chunk_shapes": [[[10, 5]], [[25, 2]]]}, + } + meta = parse_chunk_grid(data) + g = ChunkGrid.from_sizes((50, 50), meta.chunk_shapes) # type: ignore[union-attr] + assert g.grid_shape == (5, 2) + with pytest.raises(ValueError, match="extent 100 exceeds sum of edges 50"): + ChunkGrid.from_sizes((100, 50), meta.chunk_shapes) # type: ignore[union-attr] + + +def test_parse_chunk_grid_varying_dimension_extent_mismatch_on_chunkgrid_input() -> None: + """ChunkGrid constructor rejects VaryingDimension with extent exceeding sum of edges""" + g = ChunkGrid.from_sizes((60, 50), [[10, 20, 30], [25, 25]]) + with pytest.raises(ValueError, match="less than"): + ChunkGrid( + dimensions=tuple( + dim.with_extent(ext) for dim, ext in zip(g._dimensions, (100, 50), strict=True) + ) + ) + + +# --------------------------------------------------------------------------- +# Rectilinear indexing tests +# --------------------------------------------------------------------------- + + +def test_basic_indexer_rectilinear() -> None: + """BasicIndexer produces correct projections for a full-slice rectilinear selection""" + from zarr.core.indexing import BasicIndexer + + g = ChunkGrid.from_sizes((60, 100), [[10, 20, 30], [50, 50]]) + indexer = BasicIndexer( + selection=(slice(None), slice(None)), + shape=(60, 100), + chunk_grid=g, + ) + projections = list(indexer) + assert len(projections) == 6 + + p0 = projections[0] + assert p0.chunk_coords == (0, 0) + assert p0.chunk_selection == (slice(0, 10, 1), slice(0, 50, 1)) + + p1 = projections[2] + assert p1.chunk_coords == (1, 0) + assert p1.chunk_selection == (slice(0, 20, 1), slice(0, 50, 1)) + + +def test_basic_indexer_int_selection() -> None: + """BasicIndexer with integer selection maps to the correct chunk and local offset""" + from zarr.core.indexing import BasicIndexer + + g = ChunkGrid.from_sizes((60, 100), [[10, 20, 30], [50, 50]]) + indexer = BasicIndexer( + selection=(15, slice(None)), + shape=(60, 100), + chunk_grid=g, + ) + projections = list(indexer) + assert len(projections) == 2 + assert projections[0].chunk_coords == (1, 0) + assert projections[0].chunk_selection == (5, slice(0, 50, 1)) + + +def test_basic_indexer_slice_subset() -> None: + """BasicIndexer with partial slices spans the expected chunk dimensions""" + from zarr.core.indexing import BasicIndexer + + g = ChunkGrid.from_sizes((60, 100), [[10, 20, 30], [50, 50]]) + indexer = BasicIndexer( + selection=(slice(5, 35), slice(0, 50)), + shape=(60, 100), + chunk_grid=g, + ) + projections = list(indexer) + chunk_coords_dim0 = sorted({p.chunk_coords[0] for p in projections}) + assert chunk_coords_dim0 == [0, 1, 2] + + +def test_orthogonal_indexer_rectilinear() -> None: + """OrthogonalIndexer produces the expected number of projections for a rectilinear grid""" + from zarr.core.indexing import OrthogonalIndexer + + g = ChunkGrid.from_sizes((60, 100), [[10, 20, 30], [50, 50]]) + indexer = OrthogonalIndexer( + selection=(slice(None), slice(None)), + shape=(60, 100), + chunk_grid=g, + ) + projections = list(indexer) + assert len(projections) == 6 + + +def test_oob_block_raises_bounds_check_error() -> None: + """Out-of-bounds block index should raise BoundsCheckError, not IndexError.""" + store = MemoryStore() + a = zarr.create_array(store, shape=(30,), chunks=[[10, 20]], dtype="int32") + with pytest.raises(BoundsCheckError): + a.get_block_selection((2,)) + + +# --------------------------------------------------------------------------- +# End-to-end tests +# --------------------------------------------------------------------------- + + +@pytest.mark.parametrize( + ("shape", "chunks", "expected_regular"), + [ + ((100, 200), (10, 20), True), + ((60, 100), [[10, 20, 30], [50, 50]], False), + ], + ids=["regular", "rectilinear"], +) +def test_e2e_create_array( + tmp_path: Path, shape: tuple[int, ...], chunks: Any, expected_regular: bool +) -> None: + """End-to-end array creation sets correct regularity and ndim on chunk_grid""" + arr = zarr.create_array( + store=tmp_path / "arr.zarr", + shape=shape, + chunks=chunks, + dtype="float32", + ) + assert ChunkGrid.from_metadata(arr.metadata).is_regular == expected_regular + assert ChunkGrid.from_metadata(arr.metadata).ndim == len(shape) + + +@pytest.mark.parametrize( + ("shape", "chunks", "grid_type_name", "grid_name"), + [ + ((100, 200), (10, 20), "RegularChunkGrid", "regular"), + ((60, 100), [[10, 20, 30], [50, 50]], "RectilinearChunkGrid", "rectilinear"), + ], + ids=["regular", "rectilinear"], +) +def test_e2e_chunk_grid_serializes( + tmp_path: Path, shape: tuple[int, ...], chunks: Any, grid_type_name: str, grid_name: str +) -> None: + """Array metadata serializes chunk_grid with the correct type and name""" + from zarr.core.metadata.v3 import ArrayV3Metadata, RectilinearChunkGrid, RegularChunkGrid + + grid_type = RegularChunkGrid if grid_type_name == "RegularChunkGrid" else RectilinearChunkGrid + arr = zarr.create_array( + store=tmp_path / "arr.zarr", + shape=shape, + chunks=chunks, + dtype="float32", + ) + assert isinstance(arr.metadata, ArrayV3Metadata) + assert isinstance(arr.metadata.chunk_grid, grid_type) + d = arr.metadata.to_dict() + chunk_grid_dict = d["chunk_grid"] + assert isinstance(chunk_grid_dict, dict) + assert chunk_grid_dict["name"] == grid_name + + +def test_e2e_chunk_grid_name_roundtrip_preserves_rectilinear(tmp_path: Path) -> None: + """A rectilinear grid with uniform edges stays 'rectilinear' through to_dict/from_dict.""" + from zarr.core.metadata.v3 import ArrayV3Metadata, RectilinearChunkGrid + + meta_dict: dict[str, Any] = { + "zarr_format": 3, + "node_type": "array", + "shape": [100, 100], + "chunk_grid": { + "name": "rectilinear", + "configuration": {"kind": "inline", "chunk_shapes": [[[50, 2]], [[25, 4]]]}, + }, + "chunk_key_encoding": {"name": "default"}, + "data_type": "float32", + "fill_value": 0.0, + "codecs": [{"name": "bytes", "configuration": {"endian": "little"}}], + } + meta = ArrayV3Metadata.from_dict(meta_dict) + assert isinstance(meta.chunk_grid, RectilinearChunkGrid) + d = meta.to_dict() + chunk_grid_dict = d["chunk_grid"] + assert isinstance(chunk_grid_dict, dict) + assert chunk_grid_dict["name"] == "rectilinear" + + +def test_e2e_chunk_grid_name_regular_from_dict(tmp_path: Path) -> None: + """A 'regular' chunk grid name is preserved through from_dict.""" + from zarr.core.metadata.v3 import ArrayV3Metadata, RegularChunkGrid + + meta_dict: dict[str, Any] = { + "zarr_format": 3, + "node_type": "array", + "shape": [100, 100], + "chunk_grid": { + "name": "regular", + "configuration": {"chunk_shape": [50, 25]}, + }, + "chunk_key_encoding": {"name": "default"}, + "data_type": "float32", + "fill_value": 0.0, + "codecs": [{"name": "bytes", "configuration": {"endian": "little"}}], + } + meta = ArrayV3Metadata.from_dict(meta_dict) + assert isinstance(meta.chunk_grid, RegularChunkGrid) + d = meta.to_dict() + chunk_grid_dict = d["chunk_grid"] + assert isinstance(chunk_grid_dict, dict) + assert chunk_grid_dict["name"] == "regular" + + +# --------------------------------------------------------------------------- +# Sharding compatibility tests +# --------------------------------------------------------------------------- + + +def test_sharding_accepts_rectilinear_outer_grid() -> None: + """ShardingCodec.validate should not reject rectilinear outer grids.""" + from zarr.codecs.sharding import ShardingCodec + from zarr.core.dtype import Float32 + from zarr.core.metadata.v3 import RectilinearChunkGrid + + codec = ShardingCodec(chunk_shape=(5, 5)) + grid_meta = RectilinearChunkGrid(chunk_shapes=((10, 20, 30), (50, 50))) + + codec.validate( + shape=(60, 100), + dtype=Float32(), + chunk_grid=grid_meta, + ) + + +def test_sharding_rejects_non_divisible_rectilinear() -> None: + """Rectilinear shard sizes not divisible by inner chunk_shape should raise.""" + from zarr.codecs.sharding import ShardingCodec + from zarr.core.dtype import Float32 + from zarr.core.metadata.v3 import RectilinearChunkGrid + + codec = ShardingCodec(chunk_shape=(5, 5)) + grid_meta = RectilinearChunkGrid(chunk_shapes=((10, 20, 17), (50, 50))) + + with pytest.raises(ValueError, match="divisible"): + codec.validate( + shape=(47, 100), + dtype=Float32(), + chunk_grid=grid_meta, + ) + + +def test_sharding_accepts_divisible_rectilinear() -> None: + """Rectilinear shard sizes all divisible by inner chunk_shape should pass.""" + from zarr.codecs.sharding import ShardingCodec + from zarr.core.dtype import Float32 + from zarr.core.metadata.v3 import RectilinearChunkGrid + + codec = ShardingCodec(chunk_shape=(5, 5)) + grid_meta = RectilinearChunkGrid(chunk_shapes=((10, 20, 30), (50, 50))) + + codec.validate( + shape=(60, 100), + dtype=Float32(), + chunk_grid=grid_meta, + ) + + +# --------------------------------------------------------------------------- +# Edge cases +# --------------------------------------------------------------------------- + + +def test_edge_case_chunk_grid_boundary_getitem() -> None: + """ChunkGrid with boundary FixedDimension via direct construction.""" + g = ChunkGrid(dimensions=(FixedDimension(10, 95), FixedDimension(20, 40))) + spec = g[(9, 1)] + assert spec is not None + assert spec.shape == (5, 20) + assert spec.codec_shape == (10, 20) + assert spec.is_boundary + + +def test_edge_case_chunk_grid_boundary_iter() -> None: + """Iterating a boundary grid yields correct boundary ChunkSpecs.""" + g = ChunkGrid(dimensions=(FixedDimension(10, 25),)) + specs = list(g) + assert len(specs) == 3 + assert specs[0].shape == (10,) + assert specs[1].shape == (10,) + assert specs[2].shape == (5,) + assert specs[2].is_boundary + assert not specs[0].is_boundary + + +def test_edge_case_chunk_grid_boundary_shape() -> None: + """shape property with boundary extent.""" + g = ChunkGrid(dimensions=(FixedDimension(10, 95),)) + assert g.grid_shape == (10,) + + +# -- Zero-size and zero-extent -- + + +@pytest.mark.parametrize( + ("size", "extent"), + [(0, 0), (0, 5), (10, 0)], + ids=["zero-size-zero-extent", "zero-size-nonzero-extent", "zero-extent-nonzero-size"], +) +def test_edge_case_zero_size_or_extent(size: int, extent: int) -> None: + """FixedDimension with zero size or extent has zero chunks and getitem returns None""" + d = FixedDimension(size=size, extent=extent) + assert d.nchunks == 0 + g = ChunkGrid(dimensions=(d,)) + assert g[0] is None + + +# -- 0-d grid -- + + +def test_0d_grid_getitem() -> None: + """0-d grid has exactly one chunk at coords ().""" + g = ChunkGrid.from_sizes((), ()) + spec = g[()] + assert spec is not None + assert spec.shape == () + assert spec.codec_shape == () + assert not spec.is_boundary + + +def test_0d_grid_iter() -> None: + """0-d grid iteration yields a single ChunkSpec.""" + g = ChunkGrid.from_sizes((), ()) + specs = list(g) + assert len(specs) == 1 + + +def test_0d_grid_all_chunk_coords() -> None: + """0-d grid has one chunk coord: the empty tuple.""" + g = ChunkGrid.from_sizes((), ()) + coords = list(g.all_chunk_coords()) + assert coords == [()] + + +def test_0d_grid_nchunks() -> None: + """0-d grid reports exactly one chunk""" + g = ChunkGrid.from_sizes((), ()) + assert g.get_nchunks() == 1 + + +# -- parse_chunk_grid edge cases -- + + +def test_parse_chunk_grid_preserves_varying_extent() -> None: + """parse_chunk_grid does not overwrite VaryingDimension extent.""" + g = ChunkGrid.from_sizes((60, 100), [[10, 20, 30], [50, 50]]) + assert isinstance(g._dimensions[0], VaryingDimension) + assert g._dimensions[0].extent == 60 + + g2 = ChunkGrid( + dimensions=tuple( + dim.with_extent(ext) for dim, ext in zip(g._dimensions, (60, 100), strict=True) + ) + ) + assert isinstance(g2._dimensions[0], VaryingDimension) + assert g2._dimensions[0].extent == 60 + + +def test_parse_chunk_grid_rebinds_fixed_extent() -> None: + """parse_chunk_grid updates FixedDimension extent from array shape.""" + g = ChunkGrid.from_sizes((100, 200), (10, 20)) + assert g._dimensions[0].extent == 100 + + g2 = ChunkGrid( + dimensions=tuple( + dim.with_extent(ext) for dim, ext in zip(g._dimensions, (50, 100), strict=True) + ) + ) + assert isinstance(g2._dimensions[0], FixedDimension) + assert g2._dimensions[0].extent == 50 + assert g2.grid_shape == (5, 5) + + +# -- ChunkGrid.__getitem__ validation -- + + +def test_getitem_int_1d_regular() -> None: + """Integer indexing works for 1-d regular grids.""" + g = ChunkGrid.from_sizes((100,), (10,)) + spec = g[0] + assert spec is not None + assert spec.shape == (10,) + assert spec.slices == (slice(0, 10, 1),) + spec = g[9] + assert spec is not None + assert spec.shape == (10,) + + +def test_getitem_int_1d_rectilinear() -> None: + """Integer indexing works for 1-d rectilinear grids.""" + g = ChunkGrid.from_sizes((100,), [[20, 30, 50]]) + spec = g[0] + assert spec is not None + assert spec.shape == (20,) + spec = g[1] + assert spec is not None + assert spec.shape == (30,) + spec = g[2] + assert spec is not None + assert spec.shape == (50,) + + +@pytest.mark.parametrize( + ("shape", "chunks", "match"), + [ + ((), (), "Expected 0 coordinate.*got 1"), + ((100, 200), (10, 20), "Expected 2 coordinate.*got 1"), + ], + ids=["0d", "2d"], +) +def test_getitem_int_ndim_mismatch_raises( + shape: tuple[int, ...], chunks: tuple[int, ...], match: str +) -> None: + """Integer indexing on a multi-dim or 0-d grid raises ValueError for ndim mismatch""" + g = ChunkGrid.from_sizes(shape, chunks) + with pytest.raises(ValueError, match=match): + g[0] + + +@pytest.mark.parametrize( + "index", + [(10,), (99,), (-1,)], + ids=["oob-10", "oob-99", "negative"], +) +def test_getitem_oob_returns_none(index: tuple[int, ...]) -> None: + """Out-of-bounds or negative chunk indices return None""" + g = ChunkGrid.from_sizes((100,), (10,)) + assert g[index] is None + + +# -- Rectilinear with zero-nchunks FixedDimension -- + + +def test_zero_nchunks_fixed_dim_in_rectilinear() -> None: + """A rectilinear grid with a 0-extent FixedDimension still has valid size.""" + g = ChunkGrid( + dimensions=( + VaryingDimension([10, 20], extent=30), + FixedDimension(size=10, extent=0), + ) + ) + assert g.grid_shape == (2, 0) + + +# -- VaryingDimension data_size -- + + +def test_varying_dim_data_size_equals_chunk_size() -> None: + """For VaryingDimension, data_size == chunk_size (no padding).""" + d = VaryingDimension([10, 20, 5], extent=35) + for i in range(3): + assert d.data_size(i) == d.chunk_size(i) + + +# --------------------------------------------------------------------------- +# OrthogonalIndexer rectilinear tests +# --------------------------------------------------------------------------- + + +def test_orthogonal_int_array_selection_rectilinear() -> None: + """Integer array selection with rectilinear grid must produce correct + chunk-local selections.""" + from zarr.core.indexing import OrthogonalIndexer + + g = ChunkGrid.from_sizes((60, 100), [[10, 20, 30], [50, 50]]) + indexer = OrthogonalIndexer( + selection=(np.array([5, 15, 35]), slice(None)), + shape=(60, 100), + chunk_grid=g, + ) + projections = list(indexer) + chunk_coords = [p.chunk_coords for p in projections] + assert chunk_coords == [(0, 0), (0, 1), (1, 0), (1, 1), (2, 0), (2, 1)] + + +def test_orthogonal_bool_array_selection_rectilinear() -> None: + """Boolean array selection with rectilinear grid produces correct chunk projections.""" + from zarr.core.indexing import OrthogonalIndexer + + g = ChunkGrid.from_sizes((60, 100), [[10, 20, 30], [50, 50]]) + mask = np.zeros(60, dtype=bool) + mask[5] = True + mask[15] = True + mask[35] = True + indexer = OrthogonalIndexer( + selection=(mask, slice(None)), + shape=(60, 100), + chunk_grid=g, + ) + projections = list(indexer) + assert len(projections) == 6 + chunk_coords = [p.chunk_coords for p in projections] + assert (0, 0) in chunk_coords + assert (1, 0) in chunk_coords + assert (2, 0) in chunk_coords + assert (0, 1) in chunk_coords + assert (1, 1) in chunk_coords + assert (2, 1) in chunk_coords + + +def test_orthogonal_advanced_indexing_produces_correct_projections() -> None: + """Verify OrthogonalIndexer produces correct chunk projections + for advanced indexing with VaryingDimension.""" + from zarr.core.indexing import OrthogonalIndexer + + g = ChunkGrid.from_sizes((60, 100), [[10, 20, 30], [50, 50]]) + indexer = OrthogonalIndexer( + selection=(np.array([5, 15]), slice(None)), + shape=(60, 100), + chunk_grid=g, + ) + projections = list(indexer) + assert len(projections) == 4 + coords = [p.chunk_coords for p in projections] + assert (0, 0) in coords + assert (0, 1) in coords + assert (1, 0) in coords + assert (1, 1) in coords + + +# --------------------------------------------------------------------------- +# Full pipeline rectilinear tests (helpers) +# --------------------------------------------------------------------------- + + +def _make_1d(tmp_path: Path) -> tuple[zarr.Array[Any], np.ndarray[Any, Any]]: + a = np.arange(30, dtype="int32") + z = zarr.create_array( + store=tmp_path / "arr1d.zarr", + shape=(30,), + chunks=[[5, 10, 15]], + dtype="int32", + ) + z[:] = a + return z, a + + +def _make_2d(tmp_path: Path) -> tuple[zarr.Array[Any], np.ndarray[Any, Any]]: + a = np.arange(6000, dtype="int32").reshape(60, 100) + z = zarr.create_array( + store=tmp_path / "arr2d.zarr", + shape=(60, 100), + chunks=[[10, 20, 30], [25, 25, 25, 25]], + dtype="int32", + ) + z[:] = a + return z, a + + +# --- Basic selection --- + + +def test_pipeline_basic_selection_1d(tmp_path: Path) -> None: + """1D rectilinear basic selections match numpy for ints, slices, and full-array reads""" + z, a = _make_1d(tmp_path) + sels: list[Any] = [0, 4, 5, 14, 15, 29, -1, slice(None), slice(3, 18), slice(0, 0)] + for sel in sels: + np.testing.assert_array_equal(z[sel], a[sel], err_msg=f"sel={sel}") + + +def test_pipeline_basic_selection_1d_strided(tmp_path: Path) -> None: + """1D rectilinear strided slice selections match numpy""" + z, a = _make_1d(tmp_path) + for sel in [slice(None, None, 2), slice(1, 25, 3), slice(0, 30, 7)]: + np.testing.assert_array_equal(z[sel], a[sel], err_msg=f"sel={sel}") + + +def test_pipeline_basic_selection_2d(tmp_path: Path) -> None: + """2D rectilinear basic selections match numpy across chunk boundaries""" + z, a = _make_2d(tmp_path) + selections: list[Any] = [ + 42, + -1, + (9, 24), + (10, 25), + (30, 50), + (59, 99), + slice(None), + (slice(5, 35), slice(20, 80)), + (slice(0, 10), slice(0, 25)), + (slice(10, 10), slice(None)), + (slice(None, None, 3), slice(None, None, 7)), + ] + for sel in selections: + np.testing.assert_array_equal(z[sel], a[sel], err_msg=f"sel={sel}") + + +# --- Orthogonal selection --- + + +def test_pipeline_orthogonal_selection_1d_bool(tmp_path: Path) -> None: + """1D boolean orthogonal indexing on rectilinear arrays matches numpy""" + z, a = _make_1d(tmp_path) + ix = np.zeros(30, dtype=bool) + ix[[0, 4, 5, 14, 15, 29]] = True + np.testing.assert_array_equal(z.oindex[ix], a[ix]) + + +def test_pipeline_orthogonal_selection_1d_int(tmp_path: Path) -> None: + """1D integer and negative-index orthogonal selection on rectilinear arrays matches numpy""" + z, a = _make_1d(tmp_path) + ix = np.array([0, 4, 5, 14, 15, 29]) + np.testing.assert_array_equal(z.oindex[ix], a[ix]) + ix_neg = np.array([0, -1, -15, -25]) + np.testing.assert_array_equal(z.oindex[ix_neg], a[ix_neg]) + + +def test_pipeline_orthogonal_selection_2d_bool(tmp_path: Path) -> None: + """2D boolean orthogonal selection on rectilinear arrays matches numpy""" + z, a = _make_2d(tmp_path) + ix0 = np.zeros(60, dtype=bool) + ix0[[0, 9, 10, 29, 30, 59]] = True + ix1 = np.zeros(100, dtype=bool) + ix1[[0, 24, 25, 49, 50, 99]] = True + np.testing.assert_array_equal(z.oindex[ix0, ix1], a[np.ix_(ix0, ix1)]) + + +def test_pipeline_orthogonal_selection_2d_int(tmp_path: Path) -> None: + """2D integer orthogonal selection on rectilinear arrays matches numpy""" + z, a = _make_2d(tmp_path) + ix0 = np.array([0, 9, 10, 29, 30, 59]) + ix1 = np.array([0, 24, 25, 49, 50, 99]) + np.testing.assert_array_equal(z.oindex[ix0, ix1], a[np.ix_(ix0, ix1)]) + + +def test_pipeline_orthogonal_selection_2d_mixed(tmp_path: Path) -> None: + """2D mixed int-array and slice orthogonal selection on rectilinear arrays matches numpy""" + z, a = _make_2d(tmp_path) + ix = np.array([0, 9, 10, 29, 30, 59]) + np.testing.assert_array_equal(z.oindex[ix, slice(25, 75)], a[np.ix_(ix, np.arange(25, 75))]) + np.testing.assert_array_equal( + z.oindex[slice(10, 30), ix[:4]], a[np.ix_(np.arange(10, 30), ix[:4])] + ) + + +# --- Coordinate (vindex) selection --- + + +def test_pipeline_coordinate_selection_1d(tmp_path: Path) -> None: + """1D coordinate (vindex) selection on rectilinear arrays matches numpy""" + z, a = _make_1d(tmp_path) + ix = np.array([0, 4, 5, 14, 15, 29]) + np.testing.assert_array_equal(z.vindex[ix], a[ix]) + + +def test_pipeline_coordinate_selection_2d(tmp_path: Path) -> None: + """2D coordinate (vindex) selection on rectilinear arrays matches numpy""" + z, a = _make_2d(tmp_path) + r = np.array([0, 9, 10, 29, 30, 59]) + c = np.array([0, 24, 25, 49, 50, 99]) + np.testing.assert_array_equal(z.vindex[r, c], a[r, c]) + + +def test_pipeline_coordinate_selection_2d_bool_mask(tmp_path: Path) -> None: + """2D boolean mask vindex selection on rectilinear arrays matches numpy""" + z, a = _make_2d(tmp_path) + mask = a > 3000 + np.testing.assert_array_equal(z.vindex[mask], a[mask]) + + +# --- Block selection --- + + +def test_pipeline_block_selection_1d(tmp_path: Path) -> None: + """1D block selection on rectilinear arrays returns correct chunk data""" + z, a = _make_1d(tmp_path) + np.testing.assert_array_equal(z.blocks[0], a[0:5]) + np.testing.assert_array_equal(z.blocks[1], a[5:15]) + np.testing.assert_array_equal(z.blocks[2], a[15:30]) + np.testing.assert_array_equal(z.blocks[-1], a[15:30]) + np.testing.assert_array_equal(z.blocks[0:2], a[0:15]) + np.testing.assert_array_equal(z.blocks[1:3], a[5:30]) + np.testing.assert_array_equal(z.blocks[:], a[:]) + + +def test_pipeline_block_selection_2d(tmp_path: Path) -> None: + """2D block selection on rectilinear arrays returns correct chunk data""" + z, a = _make_2d(tmp_path) + np.testing.assert_array_equal(z.blocks[0, 0], a[0:10, 0:25]) + np.testing.assert_array_equal(z.blocks[1, 2], a[10:30, 50:75]) + np.testing.assert_array_equal(z.blocks[2, 3], a[30:60, 75:100]) + np.testing.assert_array_equal(z.blocks[-1, -1], a[30:60, 75:100]) + np.testing.assert_array_equal(z.blocks[0:2, 1:3], a[0:30, 25:75]) + np.testing.assert_array_equal(z.blocks[:, :], a[:, :]) + + +def test_pipeline_set_block_selection_1d(tmp_path: Path) -> None: + """Writing via 1D block selection on rectilinear arrays persists correctly""" + z, a = _make_1d(tmp_path) + val = np.full(10, -1, dtype="int32") + z.blocks[1] = val + a[5:15] = val + np.testing.assert_array_equal(z[:], a) + + +def test_pipeline_set_block_selection_2d(tmp_path: Path) -> None: + """Writing via 2D block selection on rectilinear arrays persists correctly""" + z, a = _make_2d(tmp_path) + val = np.full((30, 50), -99, dtype="int32") + z.blocks[0:2, 1:3] = val + a[0:30, 25:75] = val + np.testing.assert_array_equal(z[:], a) + + +def test_pipeline_block_selection_slice_stop_at_nchunks(tmp_path: Path) -> None: + """Block slice with stop == nchunks exercises the dim_len fallback.""" + z, a = _make_1d(tmp_path) + np.testing.assert_array_equal(z.blocks[1:3], a[5:30]) + np.testing.assert_array_equal(z.blocks[0:10], a[:]) + + +def test_pipeline_block_selection_slice_stop_at_nchunks_2d(tmp_path: Path) -> None: + """Same fallback test for 2D rectilinear arrays.""" + z, a = _make_2d(tmp_path) + np.testing.assert_array_equal(z.blocks[2:3, 3:4], a[30:60, 75:100]) + np.testing.assert_array_equal(z.blocks[0:99, 0:99], a[:, :]) + + +# --- Set coordinate selection --- + + +def test_pipeline_set_coordinate_selection_1d(tmp_path: Path) -> None: + """Writing via 1D coordinate selection on rectilinear arrays persists correctly""" + z, a = _make_1d(tmp_path) + ix = np.array([0, 4, 5, 14, 15, 29]) + val = np.full(len(ix), -7, dtype="int32") + z.vindex[ix] = val + a[ix] = val + np.testing.assert_array_equal(z[:], a) + + +def test_pipeline_set_coordinate_selection_2d(tmp_path: Path) -> None: + """Writing via 2D coordinate selection on rectilinear arrays persists correctly""" + z, a = _make_2d(tmp_path) + r = np.array([0, 9, 10, 29, 30, 59]) + c = np.array([0, 24, 25, 49, 50, 99]) + val = np.full(len(r), -42, dtype="int32") + z.vindex[r, c] = val + a[r, c] = val + np.testing.assert_array_equal(z[:], a) + + +# --- Set selection --- + + +def test_pipeline_set_basic_selection(tmp_path: Path) -> None: + """Writing via basic slice selection on rectilinear arrays persists correctly""" + z, a = _make_2d(tmp_path) + new_data = np.full((20, 50), -1, dtype="int32") + z[5:25, 10:60] = new_data + a[5:25, 10:60] = new_data + np.testing.assert_array_equal(z[:], a) + + +def test_pipeline_set_orthogonal_selection(tmp_path: Path) -> None: + """Writing via orthogonal selection on rectilinear arrays persists correctly""" + z, a = _make_2d(tmp_path) + rows = np.array([0, 10, 30]) + cols = np.array([0, 25, 50, 75]) + val = np.full((3, 4), -99, dtype="int32") + z.oindex[rows, cols] = val + a[np.ix_(rows, cols)] = val + np.testing.assert_array_equal(z[:], a) + + +# --- Higher dimensions --- + + +def test_pipeline_3d_array(tmp_path: Path) -> None: + """3D rectilinear array write and read-back match numpy""" + shape = (12, 20, 15) + chunk_shapes = [[4, 8], [5, 5, 10], [5, 10]] + a = np.arange(int(np.prod(shape)), dtype="int32").reshape(shape) + z = zarr.create_array( + store=tmp_path / "arr3d.zarr", + shape=shape, + chunks=chunk_shapes, + dtype="int32", + ) + z[:] = a + np.testing.assert_array_equal(z[:], a) + np.testing.assert_array_equal(z[2:10, 3:18, 4:14], a[2:10, 3:18, 4:14]) + + +def test_pipeline_1d_single_chunk(tmp_path: Path) -> None: + """Single-chunk rectilinear array write and read-back match numpy""" + a = np.arange(20, dtype="int32") + z = zarr.create_array( + store=tmp_path / "arr1c.zarr", + shape=(20,), + chunks=[[20]], + dtype="int32", + ) + z[:] = a + np.testing.assert_array_equal(z[:], a) + + +# --- Persistence roundtrip --- + + +def test_pipeline_persistence_roundtrip(tmp_path: Path) -> None: + """Rectilinear array survives close and reopen with correct data""" + _, a = _make_2d(tmp_path) + z2 = zarr.open_array(store=tmp_path / "arr2d.zarr", mode="r") + assert not ChunkGrid.from_metadata(z2.metadata).is_regular + np.testing.assert_array_equal(z2[:], a) + + +# --- Highly irregular chunks --- + + +def test_pipeline_highly_irregular_chunks(tmp_path: Path) -> None: + """Highly irregular chunk sizes produce correct write and partial-read results""" + shape = (100, 100) + chunk_shapes = [[5, 10, 15, 20, 50], [100]] + a = np.arange(10000, dtype="int32").reshape(shape) + z = zarr.create_array( + store=tmp_path / "irreg.zarr", + shape=shape, + chunks=chunk_shapes, + dtype="int32", + ) + z[:] = a + np.testing.assert_array_equal(z[:], a) + np.testing.assert_array_equal(z[3:97, 10:90], a[3:97, 10:90]) + + +# --- API validation --- + + +def test_pipeline_v2_rejects_rectilinear(tmp_path: Path) -> None: + """Creating a rectilinear array with zarr_format=2 raises ValueError""" + with pytest.raises(ValueError, match="Zarr format 2"): + zarr.create_array( + store=tmp_path / "v2.zarr", + shape=(30,), + chunks=[[10, 20]], + dtype="int32", + zarr_format=2, + ) + + +def test_pipeline_sharding_rejects_rectilinear_chunks_with_shards(tmp_path: Path) -> None: + """Rectilinear chunks (inner) with sharding is not supported.""" + with pytest.raises(ValueError, match="Rectilinear chunks with sharding"): + zarr.create_array( + store=tmp_path / "shard.zarr", + shape=(60, 100), + chunks=[[10, 20, 30], [25, 25, 25, 25]], + shards=(30, 50), + dtype="int32", + ) + + +def test_pipeline_rectilinear_shards_roundtrip(tmp_path: Path) -> None: + """Rectilinear shards with uniform inner chunks: full write/read roundtrip.""" + data = np.arange(120 * 100, dtype="int32").reshape(120, 100) + arr = zarr.create_array( + store=tmp_path / "rect_shards.zarr", + shape=(120, 100), + chunks=(10, 10), + shards=[[60, 40, 20], [50, 50]], + dtype="int32", + ) + arr[:] = data + result = arr[:] + np.testing.assert_array_equal(result, data) + + +def test_pipeline_rectilinear_shards_partial_read(tmp_path: Path) -> None: + """Partial reads across rectilinear shard boundaries.""" + data = np.arange(120 * 100, dtype="float64").reshape(120, 100) + arr = zarr.create_array( + store=tmp_path / "rect_shards.zarr", + shape=(120, 100), + chunks=(10, 10), + shards=[[60, 40, 20], [50, 50]], + dtype="float64", + ) + arr[:] = data + result = arr[50:70, 40:60] + np.testing.assert_array_equal(result, data[50:70, 40:60]) + + +def test_pipeline_rectilinear_shards_validates_divisibility(tmp_path: Path) -> None: + """Inner chunk_shape must divide every shard's dimensions.""" + with pytest.raises(ValueError, match="divisible"): + zarr.create_array( + store=tmp_path / "bad.zarr", + shape=(120, 100), + chunks=(10, 10), + shards=[[60, 45, 15], [50, 50]], + dtype="int32", + ) + + +def test_pipeline_nchunks(tmp_path: Path) -> None: + """Rectilinear array reports the correct total number of chunks""" + z, _ = _make_2d(tmp_path) + assert ChunkGrid.from_metadata(z.metadata).get_nchunks() == 12 + + +def test_pipeline_parse_chunk_grid_regular_from_dict() -> None: + """parse_chunk_grid constructs a regular grid from a metadata dict.""" + d: dict[str, Any] = {"name": "regular", "configuration": {"chunk_shape": [10, 20]}} + meta = parse_chunk_grid(d) + assert isinstance(meta, RegularChunkGridMeta) + g = ChunkGrid.from_sizes((100, 200), tuple(meta.chunk_shape)) + assert g.is_regular + assert g.chunk_shape == (10, 20) + assert g.grid_shape == (10, 10) + assert g.get_nchunks() == 100 + + +# --------------------------------------------------------------------------- +# VaryingDimension boundary tests +# --------------------------------------------------------------------------- + + +@pytest.mark.parametrize( + ("edges", "extent", "chunk_idx", "expected_data_size"), + [ + ([10, 20, 30], 50, 0, 10), + ([10, 20, 30], 50, 1, 20), + ([10, 20, 30], 50, 2, 20), + ([10, 20, 30], 60, 2, 30), + ([10, 20, 30], 31, 0, 10), + ([10, 20, 30], 31, 1, 20), + ([10, 20, 30], 31, 2, 1), + ], + ids=[ + "interior-0", + "interior-1", + "boundary-clipped", + "exact-no-clip", + "single-element-boundary-0", + "single-element-boundary-1", + "single-element-boundary-2", + ], +) +def test_varying_dimension_boundary_data_size( + edges: list[int], extent: int, chunk_idx: int, expected_data_size: int +) -> None: + """VaryingDimension.data_size clips correctly at boundary chunks""" + d = VaryingDimension(edges, extent=extent) + assert d.data_size(chunk_idx) == expected_data_size + + +def test_varying_dimension_boundary_extent_parameter() -> None: + """VaryingDimension preserves extent and full chunk_size even when extent < sum of edges""" + d = VaryingDimension([10, 20, 30], extent=50) + assert d.extent == 50 + assert d.chunk_size(2) == 30 + + +def test_varying_dimension_extent_exceeds_sum_rejected() -> None: + """VaryingDimension rejects extent greater than sum of edges""" + with pytest.raises(ValueError, match="exceeds sum of edges"): + VaryingDimension([10, 20], extent=50) + + +def test_varying_dimension_negative_extent_rejected() -> None: + """VaryingDimension rejects negative extent""" + with pytest.raises(ValueError, match="must be >= 0"): + VaryingDimension([10, 20], extent=-1) + + +def test_varying_dimension_boundary_chunk_spec() -> None: + """ChunkGrid with a boundary VaryingDimension produces correct ChunkSpec.""" + g = ChunkGrid(dimensions=(VaryingDimension([10, 20, 30], extent=50),)) + spec = g[(2,)] + assert spec is not None + assert spec.codec_shape == (30,) + assert spec.shape == (20,) + assert spec.is_boundary is True + + +def test_varying_dimension_interior_chunk_spec() -> None: + """Interior VaryingDimension chunk has matching codec_shape and shape with no boundary""" + g = ChunkGrid(dimensions=(VaryingDimension([10, 20, 30], extent=50),)) + spec = g[(0,)] + assert spec is not None + assert spec.codec_shape == (10,) + assert spec.shape == (10,) + assert spec.is_boundary is False + + +# --------------------------------------------------------------------------- +# Multiple overflow chunks tests +# --------------------------------------------------------------------------- + + +def test_overflow_multiple_chunks_past_extent() -> None: + """Edges past extent are structural; nchunks counts active only.""" + g = ChunkGrid.from_sizes((50,), [[10, 20, 30, 40]]) + d = g._dimensions[0] + assert d.ngridcells == 4 + assert d.nchunks == 3 + assert d.data_size(0) == 10 + assert d.data_size(1) == 20 + assert d.data_size(2) == 20 + assert d.chunk_size(2) == 30 + + +def test_overflow_chunk_spec_past_extent_is_oob() -> None: + """Chunk entirely past the extent is out of bounds (not active).""" + g = ChunkGrid.from_sizes((50,), [[10, 20, 30, 40]]) + spec = g[(3,)] + assert spec is None + + +def test_overflow_chunk_spec_partial() -> None: + """ChunkSpec for a partially-overflowing chunk clips correctly.""" + g = ChunkGrid.from_sizes((50,), [[10, 20, 30, 40]]) + spec = g[(2,)] + assert spec is not None + assert spec.shape == (20,) + assert spec.codec_shape == (30,) + assert spec.is_boundary is True + assert spec.slices == (slice(30, 50, 1),) + + +def test_overflow_chunk_sizes() -> None: + """chunk_sizes only includes active chunks.""" + g = ChunkGrid.from_sizes((50,), [[10, 20, 30, 40]]) + assert g.chunk_sizes == ((10, 20, 20),) + + +def test_overflow_multidim() -> None: + """Overflow in multiple dimensions simultaneously.""" + g = ChunkGrid.from_sizes((45, 100), [[10, 20, 30], [40, 40, 40]]) + assert g.chunk_sizes == ((10, 20, 15), (40, 40, 20)) + spec = g[(2, 2)] + assert spec is not None + assert spec.shape == (15, 20) + assert spec.codec_shape == (30, 40) + + +def test_overflow_uniform_edges_collapses_to_fixed() -> None: + """Uniform edges where len == ceildiv(extent, edge) collapse to FixedDimension.""" + g = ChunkGrid.from_sizes((35,), [[10, 10, 10, 10]]) + assert isinstance(g._dimensions[0], FixedDimension) + assert g.is_regular + assert g.chunk_sizes == ((10, 10, 10, 5),) + assert g._dimensions[0].nchunks == 4 + + +def test_overflow_index_to_chunk_near_extent() -> None: + """Index lookup near and at the extent boundary.""" + d = VaryingDimension([10, 20, 30, 40], extent=50) + assert d.index_to_chunk(29) == 1 + assert d.index_to_chunk(30) == 2 + assert d.index_to_chunk(49) == 2 + + +# --------------------------------------------------------------------------- +# Boundary indexing tests +# --------------------------------------------------------------------------- + + +@pytest.mark.parametrize( + ( + "dim", + "mask", + "dim_len", + "expected_chunk_ix", + "expected_sel_len", + "expected_first_two", + "expected_third", + ), + [ + ( + FixedDimension(size=5, extent=7), + np.array([False, False, False, False, False, True, True]), + 7, + 1, + 5, + (np.True_, np.True_), + np.False_, + ), + ( + VaryingDimension([5, 10], extent=7), + np.array([False, False, False, False, False, True, True]), + 7, + 1, + 10, + (np.True_, np.True_), + np.False_, + ), + ], + ids=["fixed-boundary", "varying-boundary"], +) +def test_bool_indexer_boundary( + dim: FixedDimension | VaryingDimension, + mask: np.ndarray[Any, Any], + dim_len: int, + expected_chunk_ix: int, + expected_sel_len: int, + expected_first_two: tuple[Any, Any], + expected_third: Any, +) -> None: + """BoolArrayDimIndexer pads to codec size for boundary chunks.""" + from zarr.core.indexing import BoolArrayDimIndexer + + indexer = BoolArrayDimIndexer(mask, dim_len, dim) + projections = list(indexer) + assert len(projections) == 1 + p = projections[0] + assert p.dim_chunk_ix == expected_chunk_ix + sel = p.dim_chunk_sel + assert isinstance(sel, np.ndarray) + assert sel.shape[0] == expected_sel_len + assert sel[0] is expected_first_two[0] + assert sel[1] is expected_first_two[1] + assert sel[2] is expected_third + + +def test_bool_indexer_no_padding_interior() -> None: + """No padding needed for interior chunks.""" + from zarr.core.indexing import BoolArrayDimIndexer + + dim = FixedDimension(size=5, extent=10) + mask = np.array([True, False, False, False, False, False, False, False, False, False]) + indexer = BoolArrayDimIndexer(mask, 10, dim) + projections = list(indexer) + assert len(projections) == 1 + p = projections[0] + assert p.dim_chunk_ix == 0 + sel = p.dim_chunk_sel + assert isinstance(sel, np.ndarray) + assert sel.shape[0] == 5 + + +def test_slice_indexer_varying_boundary() -> None: + """SliceDimIndexer clips to data_size at boundary for VaryingDimension.""" + from zarr.core.indexing import SliceDimIndexer + + dim = VaryingDimension([5, 10], extent=7) + indexer = SliceDimIndexer(slice(None), 7, dim) + projections = list(indexer) + assert len(projections) == 2 + assert projections[0].dim_chunk_sel == slice(0, 5, 1) + assert projections[1].dim_chunk_sel == slice(0, 2, 1) + + +def test_int_array_indexer_varying_boundary() -> None: + """IntArrayDimIndexer handles indices near boundary correctly.""" + from zarr.core.indexing import IntArrayDimIndexer + + dim = VaryingDimension([5, 10], extent=7) + indices = np.array([6]) + indexer = IntArrayDimIndexer(indices, 7, dim) + projections = list(indexer) + assert len(projections) == 1 + assert projections[0].dim_chunk_ix == 1 + sel = projections[0].dim_chunk_sel + assert isinstance(sel, np.ndarray) + np.testing.assert_array_equal(sel, [1]) + + +@pytest.mark.parametrize( + "dim", + [FixedDimension(size=2, extent=10), VaryingDimension([5, 5], extent=10)], + ids=["fixed", "varying"], +) +def test_slice_indexer_empty_slice_at_boundary(dim: FixedDimension | VaryingDimension) -> None: + """SliceDimIndexer yields no projections for an empty slice at the dimension boundary.""" + from zarr.core.indexing import SliceDimIndexer + + indexer = SliceDimIndexer(slice(10, 10), 10, dim) + projections = list(indexer) + assert len(projections) == 0 + + +def test_orthogonal_indexer_varying_boundary_advanced() -> None: + """OrthogonalIndexer with advanced indexing uses per-chunk chunk_size.""" + from zarr.core.indexing import OrthogonalIndexer + + g = ChunkGrid( + dimensions=( + VaryingDimension([5, 10], extent=7), + FixedDimension(size=4, extent=8), + ) + ) + indexer = OrthogonalIndexer( + selection=(np.array([0, 6]), slice(None)), + shape=(7, 8), + chunk_grid=g, + ) + projections = list(indexer) + assert len(projections) == 4 + coords = {p.chunk_coords for p in projections} + assert coords == {(0, 0), (0, 1), (1, 0), (1, 1)} + + +# --------------------------------------------------------------------------- +# update_shape tests +# --------------------------------------------------------------------------- + + +def test_update_shape_no_change() -> None: + """update_shape with the same shape preserves edges unchanged""" + grid = ChunkGrid.from_sizes((60, 50), [[10, 20, 30], [25, 25]]) + new_grid = grid.update_shape((60, 50)) + assert _edges(new_grid, 0) == (10, 20, 30) + assert _edges(new_grid, 1) == (25, 25) + + +def test_update_shape_grow_single_dim() -> None: + """Growing a single dimension appends a new edge chunk""" + grid = ChunkGrid.from_sizes((60, 50), [[10, 20, 30], [25, 25]]) + new_grid = grid.update_shape((80, 50)) + assert _edges(new_grid, 0) == (10, 20, 30, 20) + assert _edges(new_grid, 1) == (25, 25) + + +def test_update_shape_grow_multiple_dims() -> None: + """Growing multiple dimensions appends correctly sized edge chunks""" + grid = ChunkGrid.from_sizes((30, 50), [[10, 20], [20, 30]]) + new_grid = grid.update_shape((45, 65)) + assert _edges(new_grid, 0) == (10, 20, 15) + assert _edges(new_grid, 1) == (20, 30, 15) + + +def test_update_shape_shrink_single_dim() -> None: + """Shrinking a single dimension reduces nchunks while preserving edges""" + grid = ChunkGrid.from_sizes((100, 50), [[10, 20, 30, 40], [25, 25]]) + new_grid = grid.update_shape((35, 50)) + assert _edges(new_grid, 0) == (10, 20, 30, 40) + assert new_grid._dimensions[0].nchunks == 3 + assert _edges(new_grid, 1) == (25, 25) + + +def test_update_shape_shrink_to_single_chunk() -> None: + """Shrinking to fit within the first chunk reduces nchunks to 1""" + grid = ChunkGrid.from_sizes((60, 50), [[10, 20, 30], [25, 25]]) + new_grid = grid.update_shape((5, 50)) + assert _edges(new_grid, 0) == (10, 20, 30) + assert new_grid._dimensions[0].nchunks == 1 + assert _edges(new_grid, 1) == (25, 25) + + +def test_update_shape_shrink_multiple_dims() -> None: + """Shrinking multiple dimensions reduces nchunks in each dimension""" + grid = ChunkGrid.from_sizes((40, 60), [[10, 10, 15, 5], [20, 25, 15]]) + new_grid = grid.update_shape((25, 35)) + assert _edges(new_grid, 0) == (10, 10, 15, 5) + assert new_grid._dimensions[0].nchunks == 3 + assert _edges(new_grid, 1) == (20, 25, 15) + assert new_grid._dimensions[1].nchunks == 2 + + +def test_update_shape_dimension_mismatch_error() -> None: + """update_shape raises ValueError when new shape has different ndim""" + grid = ChunkGrid.from_sizes((30, 70), [[10, 20], [30, 40]]) + with pytest.raises(ValueError, match="dimensions"): + grid.update_shape((30, 70, 100)) + + +def test_update_shape_boundary_cases() -> None: + """update_shape handles grow-one-dim and shrink-both-dims edge cases correctly""" + grid = ChunkGrid.from_sizes((60, 40), [[10, 20, 30], [15, 25]]) + new_grid = grid.update_shape((60, 65)) + assert _edges(new_grid, 0) == (10, 20, 30) + assert _edges(new_grid, 1) == (15, 25, 25) + + grid2 = ChunkGrid.from_sizes((60, 50), [[10, 20, 30], [15, 25, 10]]) + new_grid2 = grid2.update_shape((30, 40)) + assert _edges(new_grid2, 0) == (10, 20, 30) + assert new_grid2._dimensions[0].nchunks == 2 + assert _edges(new_grid2, 1) == (15, 25, 10) + assert new_grid2._dimensions[1].nchunks == 2 + + +def test_update_shape_regular_preserves_extents(tmp_path: Path) -> None: + """Resize a regular array -- chunk_grid extents must match new shape.""" + z = zarr.create_array( + store=tmp_path / "regular.zarr", + shape=(100,), + chunks=(10,), + dtype="int32", + ) + z[:] = np.arange(100, dtype="int32") + z.resize(50) + assert z.shape == (50,) + assert ChunkGrid.from_metadata(z.metadata)._dimensions[0].extent == 50 + + +# --------------------------------------------------------------------------- +# update_shape boundary tests +# --------------------------------------------------------------------------- + + +def test_update_shape_shrink_creates_boundary() -> None: + """Shrinking extent into a chunk creates a boundary with clipped data_size""" + grid = ChunkGrid.from_sizes((60,), [[10, 20, 30]]) + new_grid = grid.update_shape((45,)) + dim = new_grid._dimensions[0] + assert isinstance(dim, VaryingDimension) + assert dim.edges == (10, 20, 30) + assert dim.extent == 45 + assert dim.chunk_size(2) == 30 + assert dim.data_size(2) == 15 + + +def test_update_shape_shrink_to_exact_boundary() -> None: + """Shrinking to an exact chunk boundary reduces nchunks without partial data""" + grid = ChunkGrid.from_sizes((60,), [[10, 20, 30]]) + new_grid = grid.update_shape((30,)) + dim = new_grid._dimensions[0] + assert isinstance(dim, VaryingDimension) + assert dim.edges == (10, 20, 30) + assert dim.nchunks == 2 + assert dim.ngridcells == 3 + assert dim.extent == 30 + assert dim.data_size(1) == 20 + + +def test_update_shape_shrink_chunk_spec() -> None: + """After shrink, ChunkSpec reflects boundary correctly.""" + grid = ChunkGrid.from_sizes((60,), [[10, 20, 30]]) + new_grid = grid.update_shape((45,)) + spec = new_grid[(2,)] + assert spec is not None + assert spec.codec_shape == (30,) + assert spec.shape == (15,) + assert spec.is_boundary is True + + +def test_update_shape_parse_chunk_grid_rebinds_extent() -> None: + """parse_chunk_grid re-binds VaryingDimension extent to array shape.""" + g = ChunkGrid.from_sizes((60,), [[10, 20, 30]]) + g2 = ChunkGrid( + dimensions=tuple( + dim.with_extent(ext) for dim, ext in zip(g._dimensions, (50,), strict=True) + ) + ) + dim = g2._dimensions[0] + assert isinstance(dim, VaryingDimension) + assert dim.extent == 50 + assert dim.data_size(2) == 20 + + +# --------------------------------------------------------------------------- +# Resize rectilinear tests +# --------------------------------------------------------------------------- + + +async def test_async_resize_grow() -> None: + """Async resize grow appends new edge chunks and preserves existing data""" + store = zarr.storage.MemoryStore() + arr = await zarr.api.asynchronous.create_array( + store=store, + shape=(30, 40), + chunks=[[10, 20], [20, 20]], + dtype="i4", + zarr_format=3, + ) + data = np.arange(30 * 40, dtype="i4").reshape(30, 40) + await arr.setitem(slice(None), data) + + await arr.resize((50, 60)) + assert arr.shape == (50, 60) + assert _edges(ChunkGrid.from_metadata(arr.metadata), 0) == (10, 20, 20) + assert _edges(ChunkGrid.from_metadata(arr.metadata), 1) == (20, 20, 20) + result = await arr.getitem((slice(0, 30), slice(0, 40))) + np.testing.assert_array_equal(result, data) + + +async def test_async_resize_shrink() -> None: + """Async resize shrink truncates data to the new shape""" + store = zarr.storage.MemoryStore() + arr = await zarr.api.asynchronous.create_array( + store=store, + shape=(60, 50), + chunks=[[10, 20, 30], [25, 25]], + dtype="f4", + zarr_format=3, + ) + data = np.arange(60 * 50, dtype="f4").reshape(60, 50) + await arr.setitem(slice(None), data) + + await arr.resize((25, 30)) + assert arr.shape == (25, 30) + result = await arr.getitem(slice(None)) + np.testing.assert_array_equal(result, data[:25, :30]) + + +def test_sync_resize_grow() -> None: + """Sync resize grow expands the array and preserves existing data""" + store = zarr.storage.MemoryStore() + arr = zarr.create_array( + store=store, + shape=(20, 30), + chunks=[[8, 12], [10, 20]], + dtype="u1", + zarr_format=3, + ) + data = np.arange(20 * 30, dtype="u1").reshape(20, 30) + arr[:] = data + arr.resize((35, 45)) + assert arr.shape == (35, 45) + np.testing.assert_array_equal(arr[:20, :30], data) + + +def test_sync_resize_shrink() -> None: + """Sync resize shrink truncates the array and returns correct data""" + store = zarr.storage.MemoryStore() + arr = zarr.create_array( + store=store, + shape=(40, 50), + chunks=[[10, 15, 15], [20, 30]], + dtype="i2", + zarr_format=3, + ) + data = np.arange(40 * 50, dtype="i2").reshape(40, 50) + arr[:] = data + arr.resize((15, 30)) + assert arr.shape == (15, 30) + np.testing.assert_array_equal(arr[:], data[:15, :30]) + + +# --------------------------------------------------------------------------- +# Append rectilinear tests +# --------------------------------------------------------------------------- + + +async def test_append_first_axis() -> None: + """Appending along axis 0 grows the array and concatenates data correctly""" + store = zarr.storage.MemoryStore() + arr = await zarr.api.asynchronous.create_array( + store=store, + shape=(30, 20), + chunks=[[10, 20], [10, 10]], + dtype="i4", + zarr_format=3, + ) + initial = np.arange(30 * 20, dtype="i4").reshape(30, 20) + await arr.setitem(slice(None), initial) + + append_data = np.arange(30 * 20, 45 * 20, dtype="i4").reshape(15, 20) + await arr.append(append_data, axis=0) + assert arr.shape == (45, 20) + + result = await arr.getitem(slice(None)) + np.testing.assert_array_equal(result, np.vstack([initial, append_data])) + + +async def test_append_second_axis() -> None: + """Appending along axis 1 grows the array and concatenates data correctly""" + store = zarr.storage.MemoryStore() + arr = await zarr.api.asynchronous.create_array( + store=store, + shape=(20, 30), + chunks=[[10, 10], [10, 20]], + dtype="f4", + zarr_format=3, + ) + initial = np.arange(20 * 30, dtype="f4").reshape(20, 30) + await arr.setitem(slice(None), initial) + + append_data = np.arange(20 * 30, 20 * 45, dtype="f4").reshape(20, 15) + await arr.append(append_data, axis=1) + assert arr.shape == (20, 45) + + result = await arr.getitem(slice(None)) + np.testing.assert_array_equal(result, np.hstack([initial, append_data])) + + +def test_sync_append() -> None: + """Sync append grows the array and preserves both initial and appended data""" + store = zarr.storage.MemoryStore() + arr = zarr.create_array( + store=store, + shape=(20, 20), + chunks=[[8, 12], [7, 13]], + dtype="u2", + zarr_format=3, + ) + initial = np.arange(20 * 20, dtype="u2").reshape(20, 20) + arr[:] = initial + + append_data = np.arange(20 * 20, 25 * 20, dtype="u2").reshape(5, 20) + arr.append(append_data, axis=0) + assert arr.shape == (25, 20) + np.testing.assert_array_equal(arr[:20, :], initial) + np.testing.assert_array_equal(arr[20:, :], append_data) + + +async def test_multiple_appends() -> None: + """Multiple sequential appends accumulate data correctly""" + store = zarr.storage.MemoryStore() + arr = await zarr.api.asynchronous.create_array( + store=store, + shape=(10, 10), + chunks=[[3, 7], [4, 6]], + dtype="i4", + zarr_format=3, + ) + initial = np.arange(10 * 10, dtype="i4").reshape(10, 10) + await arr.setitem(slice(None), initial) + + all_data = [initial] + for i in range(3): + chunk = np.full((5, 10), i + 100, dtype="i4") + await arr.append(chunk, axis=0) + all_data.append(chunk) + + assert arr.shape == (25, 10) + result = await arr.getitem(slice(None)) + np.testing.assert_array_equal(result, np.vstack(all_data)) + + +async def test_append_with_partial_edge_chunks() -> None: + """Appending data that creates partial edge chunks preserves all data""" + store = zarr.storage.MemoryStore() + arr = await zarr.api.asynchronous.create_array( + store=store, + shape=(25, 30), + chunks=[[10, 15], [12, 18]], + dtype="f8", + zarr_format=3, + ) + initial = np.random.default_rng(42).random((25, 30)) + await arr.setitem(slice(None), initial) + + append_data = np.random.default_rng(43).random((10, 30)) + await arr.append(append_data, axis=0) + assert arr.shape == (35, 30) + + result = np.asarray(await arr.getitem(slice(None))) + np.testing.assert_array_almost_equal(result, np.vstack([initial, append_data])) + + +async def test_append_small_data() -> None: + """Appending a small amount of data smaller than a chunk works correctly""" + store = zarr.storage.MemoryStore() + arr = await zarr.api.asynchronous.create_array( + store=store, + shape=(20, 20), + chunks=[[8, 12], [7, 13]], + dtype="i4", + zarr_format=3, + ) + data = np.arange(20 * 20, dtype="i4").reshape(20, 20) + await arr.setitem(slice(None), data) + + small = np.full((3, 20), 999, dtype="i4") + await arr.append(small, axis=0) + assert arr.shape == (23, 20) + result = await arr.getitem((slice(20, 23), slice(None))) + np.testing.assert_array_equal(result, small) + + +# --------------------------------------------------------------------------- +# V2 regression tests +# --------------------------------------------------------------------------- + + +def test_v2_create_and_readback(tmp_path: Path) -> None: + """Basic V2 array: create, write, read back.""" + data = np.arange(60, dtype="float64").reshape(6, 10) + a = zarr.create_array( + store=tmp_path / "v2.zarr", + shape=data.shape, + chunks=(3, 5), + dtype=data.dtype, + zarr_format=2, + ) + a[:] = data + np.testing.assert_array_equal(a[:], data) + + +def test_v2_chunk_grid_is_regular(tmp_path: Path) -> None: + """V2 chunk_grid produces a regular ChunkGrid with FixedDimensions.""" + a = zarr.create_array( + store=tmp_path / "v2.zarr", + shape=(20, 30), + chunks=(10, 15), + dtype="int32", + zarr_format=2, + ) + grid = ChunkGrid.from_metadata(a.metadata) + assert grid.is_regular + assert grid.chunk_shape == (10, 15) + assert grid.grid_shape == (2, 2) + assert all(isinstance(d, FixedDimension) for d in grid._dimensions) + + +def test_v2_boundary_chunks(tmp_path: Path) -> None: + """V2 boundary chunks: codec buffer size stays full, data is clipped.""" + a = zarr.create_array( + store=tmp_path / "v2.zarr", + shape=(25,), + chunks=(10,), + dtype="int32", + zarr_format=2, + ) + grid = ChunkGrid.from_metadata(a.metadata) + assert grid._dimensions[0].nchunks == 3 + assert grid._dimensions[0].chunk_size(2) == 10 + assert grid._dimensions[0].data_size(2) == 5 + + +def test_v2_slicing_with_boundary(tmp_path: Path) -> None: + """V2 array slicing across boundary chunks returns correct data.""" + data = np.arange(25, dtype="int32") + a = zarr.create_array( + store=tmp_path / "v2.zarr", + shape=(25,), + chunks=(10,), + dtype="int32", + zarr_format=2, + ) + a[:] = data + np.testing.assert_array_equal(a[18:25], data[18:25]) + np.testing.assert_array_equal(a[:], data) + + +def test_v2_metadata_roundtrip(tmp_path: Path) -> None: + """V2 metadata survives store close and reopen.""" + store_path = tmp_path / "v2.zarr" + data = np.arange(12, dtype="float32").reshape(3, 4) + a = zarr.create_array( + store=store_path, + shape=data.shape, + chunks=(2, 2), + dtype=data.dtype, + zarr_format=2, + ) + a[:] = data + + b = zarr.open_array(store=store_path, mode="r") + assert b.metadata.zarr_format == 2 + assert b.chunks == (2, 2) + assert ChunkGrid.from_metadata(b.metadata).chunk_shape == (2, 2) + np.testing.assert_array_equal(b[:], data) + + +def test_v2_chunk_spec_via_grid(tmp_path: Path) -> None: + """ChunkSpec from V2 grid has correct slices and codec_shape.""" + a = zarr.create_array( + store=tmp_path / "v2.zarr", + shape=(15, 20), + chunks=(10, 10), + dtype="int32", + zarr_format=2, + ) + grid = ChunkGrid.from_metadata(a.metadata) + spec = grid[(0, 0)] + assert spec is not None + assert spec.shape == (10, 10) + assert spec.codec_shape == (10, 10) + spec = grid[(1, 1)] + assert spec is not None + assert spec.shape == (5, 10) + assert spec.codec_shape == (10, 10) + + +# --------------------------------------------------------------------------- +# ChunkSizes tests +# --------------------------------------------------------------------------- + + +@pytest.mark.parametrize( + ("shape", "chunks", "expected"), + [ + ((100, 80), (30, 40), ((30, 30, 30, 10), (40, 40))), + ((90, 80), (30, 40), ((30, 30, 30), (40, 40))), + ((60, 100), [[10, 20, 30], [50, 50]], ((10, 20, 30), (50, 50))), + ((10,), (10,), ((10,),)), + ], + ids=["regular", "regular-exact", "rectilinear", "single-chunk"], +) +def test_chunk_sizes( + shape: tuple[int, ...], chunks: Any, expected: tuple[tuple[int, ...], ...] +) -> None: + """chunk_sizes returns the per-dimension tuple of actual data sizes""" + grid = ChunkGrid.from_sizes(shape, chunks) + assert grid.chunk_sizes == expected + + +def test_array_read_chunk_sizes_regular() -> None: + """Regular array exposes correct read_chunk_sizes and write_chunk_sizes""" + store = zarr.storage.MemoryStore() + arr = zarr.create_array( + store=store, shape=(100, 80), chunks=(30, 40), dtype="i4", zarr_format=3 + ) + assert arr.read_chunk_sizes == ((30, 30, 30, 10), (40, 40)) + assert arr.write_chunk_sizes == ((30, 30, 30, 10), (40, 40)) + + +def test_array_read_chunk_sizes_rectilinear() -> None: + """Rectilinear array exposes correct read_chunk_sizes and write_chunk_sizes""" + store = zarr.storage.MemoryStore() + arr = zarr.create_array( + store=store, shape=(60, 100), chunks=[[10, 20, 30], [50, 50]], dtype="i4", zarr_format=3 + ) + assert arr.read_chunk_sizes == ((10, 20, 30), (50, 50)) + assert arr.write_chunk_sizes == ((10, 20, 30), (50, 50)) + + +def test_array_sharded_chunk_sizes() -> None: + """Sharded array read_chunk_sizes reflects inner chunks and write_chunk_sizes reflects shards""" + store = zarr.storage.MemoryStore() + arr = zarr.create_array( + store=store, + shape=(120, 80), + chunks=(60, 40), + shards=(120, 80), + dtype="i4", + zarr_format=3, + ) + assert arr.read_chunk_sizes == ((60, 60), (40, 40)) + assert arr.write_chunk_sizes == ((120,), (80,)) + + +# --------------------------------------------------------------------------- +# Info display test +# --------------------------------------------------------------------------- + + +def test_info_display_rectilinear() -> None: + """Array.info should not crash for rectilinear grids.""" + store = zarr.storage.MemoryStore() + arr = zarr.create_array( + store=store, + shape=(30,), + chunks=[[10, 20]], + dtype="i4", + zarr_format=3, + ) + info = arr.info + text = repr(info) + assert "" in text + assert "Array" in text + + +# --------------------------------------------------------------------------- +# nchunks tests +# --------------------------------------------------------------------------- + + +@pytest.mark.parametrize( + ("shape", "chunks", "expected"), + [ + ((30,), [[10, 20]], 2), + ((30, 40), [[10, 20], [15, 25]], 4), + ], + ids=["1d", "2d"], +) +def test_nchunks_rectilinear( + shape: tuple[int, ...], chunks: list[list[int]], expected: int +) -> None: + """Array.nchunks reports correct total chunk count for rectilinear arrays""" + store = MemoryStore() + a = zarr.create_array(store, shape=shape, chunks=chunks, dtype="int32") + assert a.nchunks == expected + + +# --------------------------------------------------------------------------- +# iter_chunk_regions test +# --------------------------------------------------------------------------- + + +def test_iter_chunk_regions_rectilinear() -> None: + """_iter_chunk_regions should work for rectilinear arrays.""" + from zarr.core.array import _iter_chunk_regions + + store = MemoryStore() + a = zarr.create_array(store, shape=(30,), chunks=[[10, 20]], dtype="int32") + regions = list(_iter_chunk_regions(a)) + assert len(regions) == 2 + assert regions[0] == (slice(0, 10, 1),) + assert regions[1] == (slice(10, 30, 1),) + + +# --------------------------------------------------------------------------- +# RectilinearChunkGrid metadata object tests (already parametrized) +# --------------------------------------------------------------------------- + + +@pytest.mark.parametrize( + ("json_input", "expected_chunk_shapes"), + [ + ( + { + "name": "rectilinear", + "configuration": {"kind": "inline", "chunk_shapes": [4, 8]}, + }, + (4, 8), + ), + ( + { + "name": "rectilinear", + "configuration": {"kind": "inline", "chunk_shapes": [[1, 2, 3], [10, 20]]}, + }, + ((1, 2, 3), (10, 20)), + ), + ( + { + "name": "rectilinear", + "configuration": {"kind": "inline", "chunk_shapes": [[[4, 3]], [10, 20]]}, + }, + ((4, 4, 4), (10, 20)), + ), + ( + { + "name": "rectilinear", + "configuration": {"kind": "inline", "chunk_shapes": [[[1, 3], 3], [5]]}, + }, + ((1, 1, 1, 3), (5,)), + ), + ( + { + "name": "rectilinear", + "configuration": {"kind": "inline", "chunk_shapes": [4, [10, 20]]}, + }, + (4, (10, 20)), + ), + ], +) +def test_rectilinear_from_dict( + json_input: dict[str, Any], expected_chunk_shapes: tuple[int | tuple[int, ...], ...] +) -> None: + """RectilinearChunkGrid.from_dict correctly parses all spec forms.""" + grid = RectilinearChunkGrid.from_dict(json_input) # type: ignore[arg-type] + assert grid.chunk_shapes == expected_chunk_shapes + + +@pytest.mark.parametrize( + ("chunk_shapes", "expected_json_shapes"), + [ + ((4, 8), [4, 8]), + (((4,), (8,)), [[4], [8]]), + (((10, 20), (5, 5)), [[10, 20], [[5, 2]]]), + (((4, 4, 4), (10, 20)), [[[4, 3]], [10, 20]]), + ((4, (10, 20)), [4, [10, 20]]), + ], +) +def test_rectilinear_to_dict( + chunk_shapes: tuple[int | tuple[int, ...], ...], + expected_json_shapes: list[Any], +) -> None: + """RectilinearChunkGrid.to_dict serializes back to spec-compliant JSON.""" + grid = RectilinearChunkGrid(chunk_shapes=chunk_shapes) + result = grid.to_dict() + assert result["name"] == "rectilinear" + assert result["configuration"]["kind"] == "inline" + assert list(result["configuration"]["chunk_shapes"]) == expected_json_shapes + + +@pytest.mark.parametrize( + "json_input", + [ + {"name": "rectilinear", "configuration": {"kind": "inline", "chunk_shapes": [4, 8]}}, + { + "name": "rectilinear", + "configuration": {"kind": "inline", "chunk_shapes": [[1, 2, 3], [10, 20]]}, + }, + { + "name": "rectilinear", + "configuration": {"kind": "inline", "chunk_shapes": [[[4, 3]], [[5, 2]]]}, + }, + ], +) +def test_rectilinear_roundtrip(json_input: dict[str, Any]) -> None: + """from_dict -> to_dict -> from_dict produces the same grid.""" + grid1 = RectilinearChunkGrid.from_dict(json_input) # type: ignore[arg-type] + grid2 = RectilinearChunkGrid.from_dict(grid1.to_dict()) + assert grid1.chunk_shapes == grid2.chunk_shapes + + +# --------------------------------------------------------------------------- +# Hypothesis property tests +# --------------------------------------------------------------------------- + + +pytest.importorskip("hypothesis") + +import hypothesis.strategies as st # noqa: E402 +from hypothesis import event, given, settings # noqa: E402 + + +@st.composite +def rectilinear_chunks_st(draw: st.DrawFn, *, shape: tuple[int, ...]) -> list[list[int]]: + """Generate valid rectilinear chunk shapes for a given array shape.""" + chunk_shapes: list[list[int]] = [] + for size in shape: + assert size > 0 + max_chunks = min(size, 10) + nchunks = draw(st.integers(min_value=1, max_value=max_chunks)) + if nchunks == 1: + chunk_shapes.append([size]) + else: + dividers = sorted( + draw( + st.lists( + st.integers(min_value=1, max_value=size - 1), + min_size=nchunks - 1, + max_size=nchunks - 1, + unique=True, + ) + ) + ) + chunk_shapes.append( + [a - b for a, b in zip(dividers + [size], [0] + dividers, strict=False)] + ) + return chunk_shapes + + +@st.composite +def rectilinear_arrays_st(draw: st.DrawFn) -> tuple[zarr.Array[Any], np.ndarray[Any, Any]]: + """Generate a rectilinear zarr array with random data, shape, and chunks.""" + from zarr.storage import MemoryStore + + ndim = draw(st.integers(min_value=1, max_value=3)) + shape = draw(st.tuples(*[st.integers(min_value=2, max_value=20) for _ in range(ndim)])) + chunk_shapes = draw(rectilinear_chunks_st(shape=shape)) + event(f"ndim={ndim}, shape={shape}") + + a = np.arange(int(np.prod(shape)), dtype="int32").reshape(shape) + store = MemoryStore() + z = zarr.create_array(store=store, shape=shape, chunks=chunk_shapes, dtype="int32") + z[:] = a + return z, a + + +@settings(deadline=None, max_examples=50) +@given(data=st.data()) +def test_property_block_indexing_rectilinear(data: st.DataObject) -> None: + """Property test: block indexing on rectilinear arrays matches numpy.""" + z, a = data.draw(rectilinear_arrays_st()) + grid = ChunkGrid.from_metadata(z.metadata) + + for dim in range(a.ndim): + dim_grid = grid._dimensions[dim] + block_ix = data.draw(st.integers(min_value=0, max_value=dim_grid.nchunks - 1)) + sel = [slice(None)] * a.ndim + start = dim_grid.chunk_offset(block_ix) + stop = start + dim_grid.data_size(block_ix) + sel[dim] = slice(start, stop) + block_sel: list[slice | int] = [slice(None)] * a.ndim + block_sel[dim] = block_ix + np.testing.assert_array_equal( + z.blocks[tuple(block_sel)], + a[tuple(sel)], + err_msg=f"dim={dim}, block={block_ix}", + )