Skip to content

Conversation

@jonas2612
Copy link
Contributor

Possible solution to Issue #1009

@codecov
Copy link

codecov bot commented Oct 30, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.20%. Comparing base (84db22b) to head (040021b).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1010   +/-   ##
=======================================
  Coverage   92.20%   92.20%           
=======================================
  Files          49       49           
  Lines        7560     7560           
=======================================
  Hits         6971     6971           
  Misses        589      589           
Files with missing lines Coverage Δ
src/spatialdata/models/models.py 88.61% <100.00%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@melonora
Copy link
Collaborator

melonora commented Oct 30, 2025

Hi @jonas2612, Thanks for the PR. However, before I look further into this, could you please check whether the problem persists when you use the PR that unpins dask, #1006? I had a different issue with partitioning, but it could be connected so want to see if it would be fixed.

Also, in general, cat.as_known in the newer dask versions does not necessarily preserve the order anymore, so it might be better to wait with this PR a tiny bit as we expect to have the unpinning dask PR reviewed soon. It would then be better to branch of from that point given what I just mentioned.

@jonas2612
Copy link
Contributor Author

Hi @melonora. Yes, of course. I'll check it today and get back to you

@melonora
Copy link
Collaborator

Cheers, please also check this https://github.com/melonora/spatialdata/blob/e017ca7d6107623750196606a07fe8e4407c242f/src/spatialdata/_core/operations/rasterize.py#L674-L677.

This might provide some context for what I mentioned in my message above.

@jonas2612
Copy link
Contributor Author

That indeed looks very similar to the bug I get.
Still, I checked it with your PR too, but the error still exists:
image
Here, I expect either 500 or 550 genes, but only receive 17.

I understand the difficulty, although do not understand the background well, why the ordering of the categories is important. But if it's better for you, I can branch off and redo the change at a later time point after the PR is reviewed

@melonora
Copy link
Collaborator

@jonas2612, the reason why it is important is because as_known is basically reassigning a column. If the order is changed, a particular category belonging to a certain index can have changed.

@LucaMarconato
Copy link
Member

LucaMarconato commented Jan 3, 2026

Thanks @jonas2612 for the PR, I will merge it now. It introduces a performance regression but we can address it later.

I did some quick tests and it seems that

data[c] = data[c].cat.as_known()

takes the same time as

data[c] = data[c].cat.set_categories(data[c].compute().cat.categories)

Since there may be some concern around the order of categories with as_known() (@melonora do you have updates from the Dask team?), let's use compute() for now. I opened an issue to remember to explore faster alternatives: #1042.

@LucaMarconato
Copy link
Member

I added a test, which fails before the new change in the pair. As I explain in the test, compute() does not preserve the order of categories (same as as_known()). We may want to just accept it as the new behavior, or follow up with the dask team.

@LucaMarconato
Copy link
Member

I'll skip the CI from the latest push since I just removed a comment.

@LucaMarconato LucaMarconato merged commit 8022f5c into scverse:main Jan 3, 2026
1 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Categorical column in points element becomes filled with NaNs Categories missing with highly partitioned dask dataframes in PointsModel

3 participants