Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
68 changes: 68 additions & 0 deletions .cursor/rules/backend-parity.mdc
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
---
description: "Verify that all implemented backends produce consistent results for a given function or set of functions"
globs: "*.py"
---

# Backend Parity: Cross-Backend Consistency Audit

Verify that all implemented backends produce consistent results for a given function or set of functions.

## Step 1 -- Identify targets

1. If the prompt names specific functions (e.g. `slope`, `aspect`), use those.
2. If the prompt names a category (e.g. `hydrology`, `surface`, `focal`), read `README.md` to find all functions in that category.
3. If the prompt is empty, scan the full feature matrix in `README.md` and test every function that claims support for 2+ backends.
4. For each function, read its source file and find the `ArrayTypeFunctionMapping` call to determine which backends are actually implemented.

## Step 2 -- Build test inputs

For each target function, create test rasters at three scales:

| Name | Size | Purpose |
|---------|---------|--------------------------------------------------|
| tiny | 8x6 | Fast, easy to inspect cell-by-cell |
| medium | 64x64 | Catches chunk-boundary artifacts in dask |
| large | 256x256 | Stress test, exposes numerical accumulation drift |

For each size, generate two variants:
- **Clean:** no NaN, realistic value range for the function
- **Dirty:** 5-10% random NaN, some extreme values near dtype limits

Use `np.random.default_rng(42)` for reproducibility. Test with at least `float32` and `float64`.

## Step 3 -- Run every backend

1. **NumPy:** `create_test_raster(data, backend='numpy')` -- always the baseline.
2. **Dask+NumPy:** test with two chunk configurations: even split and ragged remainder.
3. **CuPy:** `create_test_raster(data, backend='cupy')` -- skip if CUDA unavailable.
4. **Dask+CuPy:** `create_test_raster(data, backend='dask+cupy')` -- skip if CUDA unavailable.

## Step 4 -- Pairwise comparison

For every non-NumPy result, compare against the NumPy baseline. Extract data:
- Dask: `.data.compute()`
- CuPy: `.data.get()`
- Dask+CuPy: `.data.compute().get()`

Compute: absolute difference, relative difference, NaN mask agreement, metadata preservation.

Pass/fail thresholds:
- NumPy vs Dask+NumPy: rtol=1e-5, atol=0
- NumPy vs CuPy: rtol=1e-6, atol=1e-6
- NumPy vs Dask+CuPy: rtol=1e-6, atol=1e-6

A comparison fails if max_abs > atol AND max_rel > rtol, or if NaN masks disagree.

## Step 5 -- Chunk boundary analysis

For any Dask comparison that fails, identify which cells diverge and map them to chunk boundaries. Report what percentage of divergent cells are at chunk boundaries vs interior.

## Step 6 -- Generate the report

Print a structured report with: functions tested, parity matrix table, failures with root cause analysis, and summary counts.

## General rules

- Do not modify any source or test files. This rule is read-only.
- Use `create_test_raster` from `general_checks.py` for all raster construction.
- If CUDA is unavailable, skip CuPy and Dask+CuPy gracefully. Report as SKIPPED, not FAIL.
51 changes: 51 additions & 0 deletions .cursor/rules/bench.mdc
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
---
description: "Run ASV benchmarks for the current branch against main and report regressions and improvements"
globs: "benchmarks/**/*.py"
---

# Bench: Local Performance Comparison

Run ASV benchmarks for the current branch against main and report regressions and improvements.

## Step 1 -- Identify what changed

1. If the prompt names specific benchmark classes or functions, use those directly.
2. If the prompt is empty or says "auto", run `git diff origin/main --name-only` to find changed source files under `xrspatial/`. Map each changed file to the corresponding benchmark module in `benchmarks/benchmarks/`.
3. If no benchmark exists for the changed code, note this and suggest whether one should be added.

## Step 2 -- Check prerequisites

1. Verify ASV is installed: `python -c "import asv"`. If missing, tell the user to install it.
2. Verify the benchmarks directory exists at `benchmarks/`.
3. Read `benchmarks/asv.conf.json` to confirm the project name and branch settings.

## Step 3 -- Run the comparison

Run ASV in continuous-comparison mode from the `benchmarks/` directory:

```bash
cd benchmarks && asv continuous origin/main HEAD -b "<regex>" -e
```

Where `<regex>` is a pattern matching the benchmark classes identified in Step 1.

## Step 4 -- Parse and interpret results

Classify each result:
- REGRESSION: Ratio > 1.2x
- IMPROVED: Ratio < 0.8x
- UNCHANGED: Between 0.8x and 1.2x

## Step 5 -- Generate the report

Print a table with benchmark name, main time, HEAD time, ratio, and status. List regressions with likely causes, improvements, missing benchmarks, and a recommendation.

## Step 6 -- Suggest benchmark additions

If changed functions have no benchmark coverage, describe what a new benchmark should test and ask the user whether to write it.

## General rules

- Always run benchmarks from the `benchmarks/` directory.
- The regression threshold is 1.2x, matching `.github/workflows/benchmarks.yml`.
- Do not modify any source, test, or benchmark files unless explicitly asked to write a benchmark.
58 changes: 58 additions & 0 deletions .cursor/rules/dask-notebook.mdc
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
---
description: "Create a Jupyter notebook that sets up a Dask distributed LocalCluster and walks through an ETL workflow"
globs: "*.ipynb"
---

# Dask ETL Notebook

Create a Jupyter notebook that sets up a Dask distributed LocalCluster and walks through an ETL (Extract, Transform, Load) workflow.

## Notebook structure

1. Title + one-line description
2. Overview (what the pipeline does, what you'll learn)
3. Imports
4. Cluster Setup -- create and inspect a LocalCluster + Client
5. Extract -- load or generate source data as lazy Dask arrays
6. Transform -- apply transformations (filtering, rechunking, computation)
7. Load -- write results to disk (Zarr, Parquet, GeoTIFF)
8. Cleanup -- close the client and cluster
9. Summary + next steps

## Cluster Setup

Always use this pattern:
```python
from dask.distributed import Client, LocalCluster

cluster = LocalCluster(
n_workers=4,
threads_per_worker=2,
memory_limit="2GB",
)
client = Client(cluster)
client
```

Include a markdown cell noting the dashboard link and that n_workers/memory_limit should be tuned.

## Code conventions

- **Lazy by default**: build the computation graph before calling .compute().
- **Chunking**: explain chunk choices. Use explicit chunks=.
- **Avoid full materialization mid-pipeline**: no .values or .compute() until the Load phase.
- **Persist when reused**: if an intermediate result is used in multiple downstream steps, call client.persist().
- **Cleanup**: always close the client and cluster at the end.

## Data handling

- Generate or load data lazily. Wrap numpy arrays with da.from_array(..., chunks=...).
- For file-based sources, prefer xr.open_dataset with explicit chunks=.
- For Load phase, prefer Zarr (to_zarr()) as default output format.

## Checklist

1. Pick a data domain from the prompt (or default to geospatial raster).
2. Write the full cell sequence following the structure.
3. Verify all code cells are syntactically correct and self-contained.
4. Ensure the notebook cleans up after itself (cluster closed, temp files noted).
49 changes: 49 additions & 0 deletions .cursor/rules/deep-sweep.mdc
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
---
description: "Pick one xrspatial module and dispatch every sweep command at it in parallel"
globs: "*.py"
---

# Deep Sweep: Run every sweep-* command focused on a single module

Pick one xrspatial module and dispatch every sweep-* command at it in parallel. Required first argument: the module name (e.g. `geotiff`, `slope`, `hydro`).

## Step 0 -- Parse arguments

The first positional token is the module name (required). Parse flags:
- `--only-sweep s1,s2` -- only dispatch named sweeps
- `--exclude-sweep s1,s2` -- skip named sweeps
- `--no-fix` -- audit only, no rockout, no PR
- `--reset-state` -- delete the target module's row from state CSVs

## Step 1 -- Validate the module

- If `xrspatial/{module}.py` exists, it is a single-file module.
- Else if `xrspatial/{module}/` is a directory, it is a subpackage.
- Otherwise, report that the module was not found.

Skip: `__init__`, `_version`, `__main__`, `utils`, `accessor`, `preview`, `dataset_support`, `diagnostics`, `analytics`.

## Step 2 -- Discover sweep commands

List all files in `.cursor/rules/` matching `sweep-*.mdc`. Build the dispatch list in sorted order. Apply `--only-sweep` / `--exclude-sweep` filters.

## Step 3 -- Gather shared module metadata

Collect: module_files, last_modified, total_commits, loc, has_cuda_kernels, has_file_io, has_numba_jit, has_dask_backend, has_cuda_backend, and CUDA availability.

## Step 4 -- Handle --reset-state

If `--reset-state` was passed, remove the target module's row from each state CSV before dispatching.

## Step 5 -- Dispatch one subagent per sweep

Print a dispatch table. Launch one agent per sweep in parallel, each reading its own `.mdc` rule file and auditing the specified module.

## Step 6 -- Wait, collect, and print the summary

Print a results table showing findings, rockout PRs, and state row written for each sweep.

## General rules

- Never modify source files from the parent. All edits happen inside per-sweep worktrees.
- Keep parent output concise.
47 changes: 47 additions & 0 deletions .cursor/rules/efficiency-audit.mdc
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
---
description: "Analyze source code for performance anti-patterns specific to the NumPy/CuPy/Dask/Numba stack"
globs: "*.py"
---

# Efficiency Audit: Compute Waste and Anti-Pattern Detection

Analyze source code for performance anti-patterns specific to the NumPy/CuPy/Dask/Numba stack.

## Step 1 -- Scope the audit

1. If the prompt names specific files or functions, audit only those.
2. If the prompt names a category, identify all source files in that category.
3. If the prompt is empty, audit every .py file under xrspatial/ (excluding tests/, datasets/, __pycache__/).

## Step 2 -- Static analysis: Dask anti-patterns

- **Premature materialization (HIGH)**: .values on a Dask-backed DataArray, .compute() inside a loop, np.array() wrapping a Dask or CuPy array.
- **Chunking issues (MEDIUM)**: da.stack() without .rechunk(), map_overlap with depth >= chunk_size / 2, missing boundary argument in map_overlap.
- **Redundant computation (MEDIUM)**: calling the same function twice without caching, building large intermediate arrays that could be fused.

## Step 3 -- Static analysis: GPU anti-patterns

- **Register pressure (HIGH)**: CUDA kernels with >20 float64 locals, thread blocks >16x16 on register-heavy kernels.
- **Unnecessary transfers (HIGH)**: .data.get() followed by CuPy operations, cupy.asarray(numpy_array) inside a hot path, mixing NumPy and CuPy ops.
- **Kernel launch overhead (LOW)**: per-cell kernel launches, small array kernel launches.

## Step 4 -- Static analysis: Numba anti-patterns

- **JIT compilation issues (MEDIUM)**: missing @ngjit or @jit(nopython=True), object-mode fallback, type instability.
- **Memory layout (LOW)**: column-major iteration on row-major arrays.

## Step 5 -- Static analysis: General Python anti-patterns

- **Unnecessary copies (MEDIUM)**: .copy() on arrays never mutated, np.zeros_like() + fill loop.
- **Inefficient I/O patterns (LOW)**: reading the same file multiple times, writing intermediate results to disk.

## Step 6 -- Generate the report

Print a structured report with: scope, findings grouped by severity (HIGH/MEDIUM/LOW) with file:line, pattern, description, and suggested fix. Include summary counts and top 3-5 prioritized recommendations.

## General rules

- Do not modify source, test, or benchmark files.
- Only flag patterns actually present in the code.
- Include exact file path and line number for every finding.
- False positives are worse than missed issues. If not confident a pattern is harmful, do not flag it.
43 changes: 43 additions & 0 deletions .cursor/rules/new-issues.mdc
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
---
description: "Audit the README feature matrix, identify gaps, and file GitHub issues for the best candidates"
alwaysApply: true
---

# New Issues: Feature Gap Analysis and Issue Creation

Audit the README feature matrix, identify gaps and opportunities, and file GitHub issues for the best candidates.

## Step 1 -- Read the feature matrix

1. Read `README.md` and extract every function in the feature matrix tables.
2. For each function, record: category, backend support (native, fallback, or missing).
3. Read source files referenced in the matrix to confirm what actually exists.

## Step 2 -- Identify backend gaps

1. List every function where one or more backends show fallback or unsupported.
2. Prioritize gaps where the function has 3 of 4 backends, the missing backend is CuPy or Dask+CuPy, or the function is commonly used.
3. Draft 1-3 maintenance issues for the highest-value backend completions.

## Step 3 -- Identify missing features

Consider gaps across categories: surface analysis, hydrology, focal/neighborhood, multispectral, interpolation, zonal, network/connectivity, time series, I/O and interop. Select the 3-5 most impactful suggestions ranked by frequency of need, architectural fit, and uniqueness.

## Step 4 -- Draft the issues

For each candidate, draft a GitHub issue following the `.github/ISSUE_TEMPLATE/feature-proposal.md` template: title, labels, body sections (Reason, Proposal, Stakeholders, Drawbacks, Alternatives, Unresolved Questions).

## Step 5 -- Create the issues

1. Search existing issues to avoid duplicates.
2. Create each issue with `gh issue create`, passing title, body, and labels.
3. Record the issue numbers and URLs.

## Step 6 -- Summary

Print a table of all created issues and briefly explain the rationale.

## General rules

- Do not create duplicate issues. Search existing issues first.
- Prefer fewer, higher-quality issues over a long wishlist.
45 changes: 45 additions & 0 deletions .cursor/rules/ready-to-merge.mdc
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
---
description: "Scan open pull requests and report the ones that are ready to merge"
alwaysApply: true
---

# Ready to Merge: Surface PRs Safe to Merge

Scan the open pull requests and report the ones that are ready to merge. This rule is read-only -- it does not apply labels, post comments, approve, or merge anything.

## Step 1 -- List the open PRs

```bash
gh pr list --state open --limit 100 \
--json number,title,url,isDraft,headRefName,reviews,mergeable,mergeStateStatus
```

Drop any PR where `isDraft` is true.

## Step 2 -- Reviewed gate

A PR qualifies as reviewed when it has at least one review of any state (APPROVED or COMMENTED). If all reviews are COMMENTED with none APPROVED, flag it as `(no approving review)`.

## Step 3 -- Merge-conflict gate

Re-fetch mergeable status until it settles (not UNKNOWN). `mergeable == "MERGEABLE"` passes. `mergeable == "CONFLICTING"` or `mergeStateStatus == "DIRTY"` excludes the PR.

## Step 4 -- CI gate, with the Read the Docs exception

Pull the check rollup as JSON. Classify:
- Any check with bucket `pending` -- exclude with reason `CI still running`
- A check with bucket `fail` on a non-RTD check -- exclude with reason `CI failure: <check name>`
- The RTD check (`docs/readthedocs.org:xarray-spatial`) failing is tolerated
- Every check bucket `pass` or `skipping` -- passes

## Step 5 -- Blockers-addressed gate

For each PR that cleared Steps 2-4, re-run the review to confirm no unresolved blockers remain. Zero blockers means the PR is ready. One or more blockers means excluded with reason `open review blockers (N)`.

## Step 6 -- Report

Print two sections: "Ready to merge" with qualifying PRs, and "Excluded" with every other open PR and the specific reason it did not qualify.

## General rules

- Do not apply labels, comment on any PR, or merge anything. The output is a report for a human to act on.
Loading
Loading