Author of Proposal: @brendancol
Reason or problem
Reviewed the sieve implementation (added in #1159) for accuracy and performance. Found a few things worth fixing.
Accuracy
-
Silent convergence limit. The merge loop caps at 50 iterations but never warns if it hits the limit. On pathological inputs (lots of cascading same-value merges), the function just returns a partially-sieved result with no indication anything went wrong.
-
Integer nodata gap. Integer rasters can't express NaN, so classified rasters that use sentinel values like -9999 or 255 for nodata get those pixels treated as valid data. Not a bug -- the docstring says "NaN pixels are preserved" -- but worth noting for a future nodata parameter.
Performance
-
Per-value labeling. _label_all_regions calls scipy.ndimage.label once per unique raster value. For a land-cover raster with 50 classes on a 10k x 10k grid, that's 50 separate label passes over 100M pixels. A numba union-find can do this in one pass.
-
Python-level adjacency loop. _build_adjacency extracts unique border pairs with np.unique then iterates them in a Python for loop. For large rasters with many region boundaries, this is the bottleneck after labeling.
-
Unnecessary re-labeling. The outer loop re-labels the entire raster every iteration even when the inner merge loop didn't create any new same-value adjacencies (which is the only case that changes component structure).
Proposal
- Warn when the 50-iteration limit is reached
- Replace per-value
scipy.ndimage.label calls with a single-pass numba union-find
- Vectorize the adjacency builder (numpy fancy indexing instead of Python loop)
- Track whether merges changed the value topology; skip re-labeling when they didn't
- Add tests for the convergence warning and for larger synthetic rasters
Value
The sieve function targets noisy classified rasters, which can easily have thousands of small regions. These changes keep it usable on larger inputs without changing the public API.
Drawbacks
Adds numba as a runtime dependency for the labeling path, but the project already uses numba everywhere (@ngjit).
Unresolved questions
Whether to add a nodata parameter for integer sentinel values. Leaving that for a separate issue.
Author of Proposal: @brendancol
Reason or problem
Reviewed the sieve implementation (added in #1159) for accuracy and performance. Found a few things worth fixing.
Accuracy
Silent convergence limit. The merge loop caps at 50 iterations but never warns if it hits the limit. On pathological inputs (lots of cascading same-value merges), the function just returns a partially-sieved result with no indication anything went wrong.
Integer nodata gap. Integer rasters can't express NaN, so classified rasters that use sentinel values like -9999 or 255 for nodata get those pixels treated as valid data. Not a bug -- the docstring says "NaN pixels are preserved" -- but worth noting for a future
nodataparameter.Performance
Per-value labeling.
_label_all_regionscallsscipy.ndimage.labelonce per unique raster value. For a land-cover raster with 50 classes on a 10k x 10k grid, that's 50 separate label passes over 100M pixels. A numba union-find can do this in one pass.Python-level adjacency loop.
_build_adjacencyextracts unique border pairs withnp.uniquethen iterates them in a Pythonforloop. For large rasters with many region boundaries, this is the bottleneck after labeling.Unnecessary re-labeling. The outer loop re-labels the entire raster every iteration even when the inner merge loop didn't create any new same-value adjacencies (which is the only case that changes component structure).
Proposal
scipy.ndimage.labelcalls with a single-pass numba union-findValue
The sieve function targets noisy classified rasters, which can easily have thousands of small regions. These changes keep it usable on larger inputs without changing the public API.
Drawbacks
Adds numba as a runtime dependency for the labeling path, but the project already uses numba everywhere (
@ngjit).Unresolved questions
Whether to add a
nodataparameter for integer sentinel values. Leaving that for a separate issue.