Skip to content

Partial zarr download/upload (stage: Design)#1816

Draft
yarikoptic wants to merge 1 commit intomasterfrom
partial-zarr
Draft

Partial zarr download/upload (stage: Design)#1816
yarikoptic wants to merge 1 commit intomasterfrom
partial-zarr

Conversation

@yarikoptic
Copy link
Member

@yarikoptic yarikoptic commented Mar 2, 2026

Summary

Design document for partial zarr download and upload support, addressing #1462, #1474, and related archive issues.

The design covers five areas:

  1. --zarr TYPE:PATTERN filtering for dandi download — glob, path, and regex filters for selecting entries within zarr assets, with a metadata alias for common zarr metadata files
  2. URL parsing with zarr boundary detectionAssetZarrEntryURL to handle URLs like dandi://dandi/000108/.../file.ome.zarr/0/0/0
  3. --zarr-mode {full, patch} for dandi upload — patch mode uploads changed files without deleting remote files absent locally
  4. Checksums and manifests — documents that per-directory checksums are computed hierarchically by the zarr_checksum library but are NOT persisted (only the root digest is stored in the DB); legacy .checksum files exist on S3 at zarr-checksums/ for ~72% of older zarrs but are orphaned since Dec 2022
  5. dandi ls for zarr contents — listing files within a zarr asset

Key findings from investigation

  • The zarr_checksum algorithm IS hierarchical (Merkle tree, bottom-up via ZarrChecksumTree)
  • The archive's ingest_zarr_archive task computes checksums entirely in memory and stores only the root digest
  • Per-directory .checksum files on S3 (zarr-checksums/ prefix) were written by ZarrChecksumFileUpdater, removed in dandi-archive PRs Always clear checksum files during zarr ingestion dandi-archive#1390 (Dec 2022). Legacy files remain for older zarrs but no API exposes them
  • Subtree checksum verification is not possible today without recomputation from file ETags

Review checklist

Please review the design at doc/design/partial-zarr.md and comment on:

  • --zarr TYPE:PATTERN syntax — is the filter approach right? Are glob/path/regex the right types?
  • metadata alias expansion — does glob:**/.z* + glob:**/zarr.json + glob:**/.zmetadata cover all cases?
  • --zarr-mode patch semantics — is "upload without deleting" the right default for patch? Should subtree cleanup happen?
  • URL parsing — is AssetZarrEntryURL with zarr boundary detection the right approach?
  • Checksum strategy — relying on per-file ETags for partial ops, deferring subtree checksums to future manifests
  • Open questions in the doc (AND vs OR composition, --sync interaction, server-side glob)
  • Should legacy zarr-checksums/ files on S3 be cleaned up as part of this or separately?

TODO (post-review)

  • Implement dandi/zarr_filter.py — filter parsing and matching
  • Implement AssetZarrEntryURL and split_zarr_location() in dandi/dandiarchive.py
  • Add --zarr option to dandi download CLI
  • Modify _download_zarr() for partial download support
  • Add --zarr-mode option to dandi upload CLI
  • Implement patch mode in iter_upload() (dandi/files/zarr.py)
  • Thread zarr_mode through dandi/upload.py
  • dandi ls zarr contents support (may be separate PR)
  • Tests for all of the above
  • Coordinate with dandi-archive on manifest design (#2702) for future subtree checksum support

🤖 Generated with Claude Code

Covers five areas:
- --zarr TYPE:PATTERN filtering for download (glob, path, regex)
- URL parsing with zarr boundary detection (AssetZarrEntryURL)
- --zarr-mode {full, patch} for upload
- Checksums and manifests (per-directory checksums are NOT
  persisted on the archive; legacy .checksum files exist on S3
  under zarr-checksums/ for ~72% of older zarrs but are orphaned)
- dandi ls for zarr contents

Related: #1462, #1474

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@yarikoptic yarikoptic marked this pull request as draft March 2, 2026 21:37
@yarikoptic yarikoptic requested review from kabilar and satra March 2, 2026 21:38
@yarikoptic yarikoptic changed the title Design: partial zarr download/upload Partial zarr download/upload (stage: Design) Mar 2, 2026
@codecov
Copy link

codecov bot commented Mar 2, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 75.11%. Comparing base (017b2b2) to head (dd19691).
⚠️ Report is 6 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1816   +/-   ##
=======================================
  Coverage   75.11%   75.11%           
=======================================
  Files          84       84           
  Lines       11925    11925           
=======================================
  Hits         8958     8958           
  Misses       2967     2967           
Flag Coverage Δ
unittests 75.11% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@yarikoptic yarikoptic added enhancement New feature or request minor Increment the minor version when merged labels Mar 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request minor Increment the minor version when merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant