xarray-contrib · Melissari1997 · Jun 7, 2026 · Jun 7, 2026
diff --git a/.cursor/rules/backend-parity.mdc b/.cursor/rules/backend-parity.mdc
@@ -0,0 +1,68 @@
+---
+description: "Verify that all implemented backends produce consistent results for a given function or set of functions"
+globs: "*.py"
+---
+
+# Backend Parity: Cross-Backend Consistency Audit
+
+Verify that all implemented backends produce consistent results for a given function or set of functions.
+
+## Step 1 -- Identify targets
+
+1. If the prompt names specific functions (e.g. `slope`, `aspect`), use those.
+2. If the prompt names a category (e.g. `hydrology`, `surface`, `focal`), read `README.md` to find all functions in that category.
+3. If the prompt is empty, scan the full feature matrix in `README.md` and test every function that claims support for 2+ backends.
+4. For each function, read its source file and find the `ArrayTypeFunctionMapping` call to determine which backends are actually implemented.
+
+## Step 2 -- Build test inputs
+
+For each target function, create test rasters at three scales:
+
+| Name    | Size    | Purpose                                         |
+|---------|---------|--------------------------------------------------|
+| tiny    | 8x6     | Fast, easy to inspect cell-by-cell               |
+| medium  | 64x64   | Catches chunk-boundary artifacts in dask          |
+| large   | 256x256 | Stress test, exposes numerical accumulation drift |
+
+For each size, generate two variants:
+- **Clean:** no NaN, realistic value range for the function
+- **Dirty:** 5-10% random NaN, some extreme values near dtype limits
+
+Use `np.random.default_rng(42)` for reproducibility. Test with at least `float32` and `float64`.
+
+## Step 3 -- Run every backend
+
+1. **NumPy:** `create_test_raster(data, backend='numpy')` -- always the baseline.
+2. **Dask+NumPy:** test with two chunk configurations: even split and ragged remainder.
+3. **CuPy:** `create_test_raster(data, backend='cupy')` -- skip if CUDA unavailable.
+4. **Dask+CuPy:** `create_test_raster(data, backend='dask+cupy')` -- skip if CUDA unavailable.
+
+## Step 4 -- Pairwise comparison
+
+For every non-NumPy result, compare against the NumPy baseline. Extract data:
+- Dask: `.data.compute()`
+- CuPy: `.data.get()`
+- Dask+CuPy: `.data.compute().get()`
+
+Compute: absolute difference, relative difference, NaN mask agreement, metadata preservation.
+
+Pass/fail thresholds:
+- NumPy vs Dask+NumPy: rtol=1e-5, atol=0
+- NumPy vs CuPy: rtol=1e-6, atol=1e-6
+- NumPy vs Dask+CuPy: rtol=1e-6, atol=1e-6
+
+A comparison fails if max_abs > atol AND max_rel > rtol, or if NaN masks disagree.
+
+## Step 5 -- Chunk boundary analysis
+
+For any Dask comparison that fails, identify which cells diverge and map them to chunk boundaries. Report what percentage of divergent cells are at chunk boundaries vs interior.
+
+## Step 6 -- Generate the report
+
+Print a structured report with: functions tested, parity matrix table, failures with root cause analysis, and summary counts.
+
+## General rules
+
+- Do not modify any source or test files. This rule is read-only.
+- Use `create_test_raster` from `general_checks.py` for all raster construction.
+- If CUDA is unavailable, skip CuPy and Dask+CuPy gracefully. Report as SKIPPED, not FAIL.
diff --git a/.cursor/rules/bench.mdc b/.cursor/rules/bench.mdc
@@ -0,0 +1,51 @@
+---
+description: "Run ASV benchmarks for the current branch against main and report regressions and improvements"
+globs: "benchmarks/**/*.py"
+---
+
+# Bench: Local Performance Comparison
+
+Run ASV benchmarks for the current branch against main and report regressions and improvements.
+
+## Step 1 -- Identify what changed
+
+1. If the prompt names specific benchmark classes or functions, use those directly.
+2. If the prompt is empty or says "auto", run `git diff origin/main --name-only` to find changed source files under `xrspatial/`. Map each changed file to the corresponding benchmark module in `benchmarks/benchmarks/`.
+3. If no benchmark exists for the changed code, note this and suggest whether one should be added.
+
+## Step 2 -- Check prerequisites
+
+1. Verify ASV is installed: `python -c "import asv"`. If missing, tell the user to install it.
+2. Verify the benchmarks directory exists at `benchmarks/`.
+3. Read `benchmarks/asv.conf.json` to confirm the project name and branch settings.
+
+## Step 3 -- Run the comparison
+
+Run ASV in continuous-comparison mode from the `benchmarks/` directory:
+
+```bash
+cd benchmarks && asv continuous origin/main HEAD -b "<regex>" -e
+```
+
+Where `<regex>` is a pattern matching the benchmark classes identified in Step 1.
+
+## Step 4 -- Parse and interpret results
+
+Classify each result:
+- REGRESSION: Ratio > 1.2x
+- IMPROVED: Ratio < 0.8x
+- UNCHANGED: Between 0.8x and 1.2x
+
+## Step 5 -- Generate the report
+
+Print a table with benchmark name, main time, HEAD time, ratio, and status. List regressions with likely causes, improvements, missing benchmarks, and a recommendation.
+
+## Step 6 -- Suggest benchmark additions
+
+If changed functions have no benchmark coverage, describe what a new benchmark should test and ask the user whether to write it.
+
+## General rules
+
+- Always run benchmarks from the `benchmarks/` directory.
+- The regression threshold is 1.2x, matching `.github/workflows/benchmarks.yml`.
+- Do not modify any source, test, or benchmark files unless explicitly asked to write a benchmark.
diff --git a/.cursor/rules/dask-notebook.mdc b/.cursor/rules/dask-notebook.mdc
@@ -0,0 +1,58 @@
+---
+description: "Create a Jupyter notebook that sets up a Dask distributed LocalCluster and walks through an ETL workflow"
+globs: "*.ipynb"
+---
+
+# Dask ETL Notebook
+
+Create a Jupyter notebook that sets up a Dask distributed LocalCluster and walks through an ETL (Extract, Transform, Load) workflow.
+
+## Notebook structure
+
+1. Title + one-line description
+2. Overview (what the pipeline does, what you'll learn)
+3. Imports
+4. Cluster Setup -- create and inspect a LocalCluster + Client
+5. Extract -- load or generate source data as lazy Dask arrays
+6. Transform -- apply transformations (filtering, rechunking, computation)
+7. Load -- write results to disk (Zarr, Parquet, GeoTIFF)
+8. Cleanup -- close the client and cluster
+9. Summary + next steps
+
+## Cluster Setup
+
+Always use this pattern:
+```python
+from dask.distributed import Client, LocalCluster
+
+cluster = LocalCluster(
+    n_workers=4,
+    threads_per_worker=2,
+    memory_limit="2GB",
+)
+client = Client(cluster)
+client
+```
+
+Include a markdown cell noting the dashboard link and that n_workers/memory_limit should be tuned.
+
+## Code conventions
+
+- **Lazy by default**: build the computation graph before calling .compute().
+- **Chunking**: explain chunk choices. Use explicit chunks=.
+- **Avoid full materialization mid-pipeline**: no .values or .compute() until the Load phase.
+- **Persist when reused**: if an intermediate result is used in multiple downstream steps, call client.persist().
+- **Cleanup**: always close the client and cluster at the end.
+
+## Data handling
+
+- Generate or load data lazily. Wrap numpy arrays with da.from_array(..., chunks=...).
+- For file-based sources, prefer xr.open_dataset with explicit chunks=.
+- For Load phase, prefer Zarr (to_zarr()) as default output format.
+
+## Checklist
+
+1. Pick a data domain from the prompt (or default to geospatial raster).
+2. Write the full cell sequence following the structure.
+3. Verify all code cells are syntactically correct and self-contained.
+4. Ensure the notebook cleans up after itself (cluster closed, temp files noted).
diff --git a/.cursor/rules/deep-sweep.mdc b/.cursor/rules/deep-sweep.mdc
@@ -0,0 +1,49 @@
+---
+description: "Pick one xrspatial module and dispatch every sweep command at it in parallel"
+globs: "*.py"
+---
+
+# Deep Sweep: Run every sweep-* command focused on a single module
+
+Pick one xrspatial module and dispatch every sweep-* command at it in parallel. Required first argument: the module name (e.g. `geotiff`, `slope`, `hydro`).
+
+## Step 0 -- Parse arguments
+
+The first positional token is the module name (required). Parse flags:
+- `--only-sweep s1,s2` -- only dispatch named sweeps
+- `--exclude-sweep s1,s2` -- skip named sweeps
+- `--no-fix` -- audit only, no rockout, no PR
+- `--reset-state` -- delete the target module's row from state CSVs
+
+## Step 1 -- Validate the module
+
+- If `xrspatial/{module}.py` exists, it is a single-file module.
+- Else if `xrspatial/{module}/` is a directory, it is a subpackage.
+- Otherwise, report that the module was not found.
+
+Skip: `__init__`, `_version`, `__main__`, `utils`, `accessor`, `preview`, `dataset_support`, `diagnostics`, `analytics`.
+
+## Step 2 -- Discover sweep commands
+
+List all files in `.cursor/rules/` matching `sweep-*.mdc`. Build the dispatch list in sorted order. Apply `--only-sweep` / `--exclude-sweep` filters.
+
+## Step 3 -- Gather shared module metadata
+
+Collect: module_files, last_modified, total_commits, loc, has_cuda_kernels, has_file_io, has_numba_jit, has_dask_backend, has_cuda_backend, and CUDA availability.
+
+## Step 4 -- Handle --reset-state
+
+If `--reset-state` was passed, remove the target module's row from each state CSV before dispatching.
+
+## Step 5 -- Dispatch one subagent per sweep
+
+Print a dispatch table. Launch one agent per sweep in parallel, each reading its own `.mdc` rule file and auditing the specified module.
+
+## Step 6 -- Wait, collect, and print the summary
+
+Print a results table showing findings, rockout PRs, and state row written for each sweep.
+
+## General rules
+
+- Never modify source files from the parent. All edits happen inside per-sweep worktrees.
+- Keep parent output concise.
diff --git a/.cursor/rules/efficiency-audit.mdc b/.cursor/rules/efficiency-audit.mdc
@@ -0,0 +1,47 @@
+---
+description: "Analyze source code for performance anti-patterns specific to the NumPy/CuPy/Dask/Numba stack"
+globs: "*.py"
+---
+
+# Efficiency Audit: Compute Waste and Anti-Pattern Detection
+
+Analyze source code for performance anti-patterns specific to the NumPy/CuPy/Dask/Numba stack.
+
+## Step 1 -- Scope the audit
+
+1. If the prompt names specific files or functions, audit only those.
+2. If the prompt names a category, identify all source files in that category.
+3. If the prompt is empty, audit every .py file under xrspatial/ (excluding tests/, datasets/, __pycache__/).
+
+## Step 2 -- Static analysis: Dask anti-patterns
+
+- **Premature materialization (HIGH)**: .values on a Dask-backed DataArray, .compute() inside a loop, np.array() wrapping a Dask or CuPy array.
+- **Chunking issues (MEDIUM)**: da.stack() without .rechunk(), map_overlap with depth >= chunk_size / 2, missing boundary argument in map_overlap.
+- **Redundant computation (MEDIUM)**: calling the same function twice without caching, building large intermediate arrays that could be fused.
+
+## Step 3 -- Static analysis: GPU anti-patterns
+
+- **Register pressure (HIGH)**: CUDA kernels with >20 float64 locals, thread blocks >16x16 on register-heavy kernels.
+- **Unnecessary transfers (HIGH)**: .data.get() followed by CuPy operations, cupy.asarray(numpy_array) inside a hot path, mixing NumPy and CuPy ops.
+- **Kernel launch overhead (LOW)**: per-cell kernel launches, small array kernel launches.
+
+## Step 4 -- Static analysis: Numba anti-patterns
+
+- **JIT compilation issues (MEDIUM)**: missing @ngjit or @jit(nopython=True), object-mode fallback, type instability.
+- **Memory layout (LOW)**: column-major iteration on row-major arrays.
+
+## Step 5 -- Static analysis: General Python anti-patterns
+
+- **Unnecessary copies (MEDIUM)**: .copy() on arrays never mutated, np.zeros_like() + fill loop.
+- **Inefficient I/O patterns (LOW)**: reading the same file multiple times, writing intermediate results to disk.
+
+## Step 6 -- Generate the report
+
+Print a structured report with: scope, findings grouped by severity (HIGH/MEDIUM/LOW) with file:line, pattern, description, and suggested fix. Include summary counts and top 3-5 prioritized recommendations.
+
+## General rules
+
+- Do not modify source, test, or benchmark files.
+- Only flag patterns actually present in the code.
+- Include exact file path and line number for every finding.
+- False positives are worse than missed issues. If not confident a pattern is harmful, do not flag it.
diff --git a/.cursor/rules/new-issues.mdc b/.cursor/rules/new-issues.mdc
@@ -0,0 +1,43 @@
+---
+description: "Audit the README feature matrix, identify gaps, and file GitHub issues for the best candidates"
+alwaysApply: true
+---
+
+# New Issues: Feature Gap Analysis and Issue Creation
+
+Audit the README feature matrix, identify gaps and opportunities, and file GitHub issues for the best candidates.
+
+## Step 1 -- Read the feature matrix
+
+1. Read `README.md` and extract every function in the feature matrix tables.
+2. For each function, record: category, backend support (native, fallback, or missing).
+3. Read source files referenced in the matrix to confirm what actually exists.
+
+## Step 2 -- Identify backend gaps
+
+1. List every function where one or more backends show fallback or unsupported.
+2. Prioritize gaps where the function has 3 of 4 backends, the missing backend is CuPy or Dask+CuPy, or the function is commonly used.
+3. Draft 1-3 maintenance issues for the highest-value backend completions.
+
+## Step 3 -- Identify missing features
+
+Consider gaps across categories: surface analysis, hydrology, focal/neighborhood, multispectral, interpolation, zonal, network/connectivity, time series, I/O and interop. Select the 3-5 most impactful suggestions ranked by frequency of need, architectural fit, and uniqueness.
+
+## Step 4 -- Draft the issues
+
+For each candidate, draft a GitHub issue following the `.github/ISSUE_TEMPLATE/feature-proposal.md` template: title, labels, body sections (Reason, Proposal, Stakeholders, Drawbacks, Alternatives, Unresolved Questions).
+
+## Step 5 -- Create the issues
+
+1. Search existing issues to avoid duplicates.
+2. Create each issue with `gh issue create`, passing title, body, and labels.
+3. Record the issue numbers and URLs.
+
+## Step 6 -- Summary
+
+Print a table of all created issues and briefly explain the rationale.
+
+## General rules
+
+- Do not create duplicate issues. Search existing issues first.
+- Prefer fewer, higher-quality issues over a long wishlist.
diff --git a/.cursor/rules/ready-to-merge.mdc b/.cursor/rules/ready-to-merge.mdc
@@ -0,0 +1,45 @@
+---
+description: "Scan open pull requests and report the ones that are ready to merge"
+alwaysApply: true
+---
+
+# Ready to Merge: Surface PRs Safe to Merge
+
+Scan the open pull requests and report the ones that are ready to merge. This rule is read-only -- it does not apply labels, post comments, approve, or merge anything.
+
+## Step 1 -- List the open PRs
+
+```bash
+gh pr list --state open --limit 100 \
+  --json number,title,url,isDraft,headRefName,reviews,mergeable,mergeStateStatus
+```
+
+Drop any PR where `isDraft` is true.
+
+## Step 2 -- Reviewed gate
+
+A PR qualifies as reviewed when it has at least one review of any state (APPROVED or COMMENTED). If all reviews are COMMENTED with none APPROVED, flag it as `(no approving review)`.
+
+## Step 3 -- Merge-conflict gate
+
+Re-fetch mergeable status until it settles (not UNKNOWN). `mergeable == "MERGEABLE"` passes. `mergeable == "CONFLICTING"` or `mergeStateStatus == "DIRTY"` excludes the PR.
+
+## Step 4 -- CI gate, with the Read the Docs exception
+
+Pull the check rollup as JSON. Classify:
+- Any check with bucket `pending` -- exclude with reason `CI still running`
+- A check with bucket `fail` on a non-RTD check -- exclude with reason `CI failure: <check name>`
+- The RTD check (`docs/readthedocs.org:xarray-spatial`) failing is tolerated
+- Every check bucket `pass` or `skipping` -- passes
+
+## Step 5 -- Blockers-addressed gate
+
+For each PR that cleared Steps 2-4, re-run the review to confirm no unresolved blockers remain. Zero blockers means the PR is ready. One or more blockers means excluded with reason `open review blockers (N)`.
+
+## Step 6 -- Report
+
+Print two sections: "Ready to merge" with qualifying PRs, and "Excluded" with every other open PR and the specific reason it did not qualify.
+
+## General rules
+
+- Do not apply labels, comment on any PR, or merge anything. The output is a report for a human to act on.