Skip to content

Fixes #190: GPU-enable all classification operations#852

Merged
brendancol merged 8 commits intomasterfrom
fixes-190-gpu-enabled-classification-ops
Feb 20, 2026
Merged

Fixes #190: GPU-enable all classification operations#852
brendancol merged 8 commits intomasterfrom
fixes-190-gpu-enabled-classification-ops

Conversation

@brendancol
Copy link
Contributor

@brendancol brendancol commented Feb 19, 2026

Summary

  • Adds Dask+CuPy backend for equal_interval via dedicated _run_dask_cupy_equal_interval function
  • Replaces quantile Dask+CuPy NotImplementedError with a working implementation that materializes data to CPU for percentile computation
  • Adds CuPy, Dask+NumPy, and Dask+CuPy backends for natural_breaks by extracting a shared _compute_natural_break_bins helper that runs the Jenks algorithm on CPU, then delegates to _bin() for GPU/Dask classification
  • Adds 7 new tests covering all new backend combinations
  • Updates README feature matrix to reflect full backend support

All 5 classification functions (binary, reclassify, quantile, equal_interval, natural_breaks) now support all 4 backends (NumPy, Dask+NumPy, CuPy, Dask+CuPy).

Closes #190

Test plan

  • All 27 classify tests pass (20 existing + 7 new), including GPU backends
  • Verify no regressions in other test modules

…breaks now support all 4 backends (#190)

- Add Dask+CuPy backend for equal_interval via _run_dask_cupy_equal_interval
- Replace quantile Dask+CuPy NotImplementedError with working implementation that materializes data to CPU for percentile computation
- Add CuPy, Dask+NumPy, and Dask+CuPy backends for natural_breaks by extracting shared _compute_natural_break_bins helper
- Add 7 new tests covering all new backend combinations
- Update README feature matrix to reflect full backend support
- quantile dask+cupy: replace full materialization with map_blocks(cupy.asnumpy)
  to convert chunks to CPU one at a time, then delegate to dask's streaming
  approximate percentile
- natural_breaks dask backends: sample lazily from the dask array and only
  materialize the sample (default 20k points), not the entire dataset.
  Add _generate_sample_indices helper that uses O(num_sample) memory via
  RandomState.choice() for large datasets, falling back to the original
  linspace+shuffle for small datasets to preserve determinism with numpy
- Remove unnecessary .ravel() in _run_equal_interval; nanmin/nanmax work on 2D
- Combine double where(±inf) into single isinf pass in _run_equal_interval
  and _run_cupy_bin, halving temporary allocations
- Use dask.compute(min, max) instead of two separate .compute() calls so
  dask reads data once instead of twice
- Build cuts as numpy array for all backends (was needlessly dask for k elements)
- Replace boolean fancy indexing in dask natural_break functions with
  da.where + da.nanmax to preserve chunk structure
- Delete _run_dask_cupy_equal_interval; unified _run_equal_interval with
  module=da handles both dask+numpy and dask+cupy
… consistency

- Missing backend: natural_breaks dask+cupy num_sample
- Input mutation: verify all 5 functions don't modify input DataArray
- Untested path: natural_breaks with num_sample=None
- Edge cases: equal_interval k=1, all-NaN input for equal_interval and
  natural_breaks
- Name parameter: verify default and custom name on all 5 functions
- Cross-backend: verify natural_breaks cupy and dask match numpy results
  on a separate 10x10 dataset
Replace np.argpartition with np.argsort(kind='stable') so that
tied gap sizes are broken by index order, consistently selecting
the highest-value gaps.
@brendancol brendancol merged commit ebb4872 into master Feb 20, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GPU-enabled Classification Ops

1 participant

Comments