First cut at adding some parallelism in pyfive by bnlawrence · Pull Request #209 · NCAS-CMS/pyfive

bnlawrence · 2026-03-25T13:58:58Z

Description

This pull request addresses issue: #208 by introducing a new mixin class for parallel access to chunks.

Checklist

This pull request has a descriptive title and labels
This pull request has a minimal description (most was discussed in the issue, but a two-liner description is still desirable)
Unit tests have been added (if codecov test fails)
Any changed dependencies have been added or removed correctly (if need be)
If you are working on the documentation, please ensure the current build passes
All tests pass

zequihg50 · 2026-03-26T10:43:29Z

My two main concerns with this implementation are:

fsspec cat_ranges - I’m not fully convinced about the efficiency of this approach. From the source, it appears to return a list of bytes objects to the caller, which likely introduces unnecessary allocations and copies. I think the rationale of this is to leverage fsspec’s concurrency (e.g., via an async HTTPStore), but I suspect this may still introduce blocking at some point in the pipeline. A lower-level approach might be more efficient, for example, working with memoryview and implementing custom concurrency. That said, this comes at the cost of increased implementation complexity...
Use of inheritance - I think this design would benefit from composition over inheritance. In particular, it feels unintuitive for a Dataset to inherit from ChunkRead, especially since HDF5 datasets can also be contiguous. A structure where something like ChunkedDataset extends Dataset (or composes chunk-reading behavior) would likely be more appropriate. More broadly, the API of this is far from trivial I would say, since this design could/should support different concurrency models (e.g., async vs threads).

bnlawrence · 2026-03-26T10:53:28Z

fsspec cat_ranges - I’m not fully convinced about the efficiency of this approach. From the source, it appears to return a list of bytes objects to the caller, which likely introduces unnecessary allocations and copies. I think the rationale of this is to leverage fsspec’s concurrency (e.g., via an async HTTPStore), but I suspect this may still introduce blocking at some point in the pipeline. A lower-level approach might be more efficient, for example, working with memoryview and implementing custom concurrency. That said, this comes at the cost of increased implementation complexity...

It does seem to be the only way to safely exploit asyncio in the context of fsspec, but yes, there is a lot to investigate. I've not actually tested this in anger yet.

bnlawrence · 2026-03-26T10:54:39Z

Use of inheritance - I think this design would benefit from composition over inheritance. In particular, it feels unintuitive for a Dataset to inherit from ChunkRead, especially since HDF5 datasets can also be contiguous. A structure where something like ChunkedDataset extends Dataset (or composes chunk-reading behavior) would likely be more appropriate. More broadly, the API of this is far from trivial I would say, since this design could/should support different concurrency models (e.g., async vs threads).

Yes, I agree. I started out by wanting to do this. I am not quite sure how I ended up with this. I'm minded to persevere with it until we have sorted the performance issues out, then refactor it.

bnlawrence · 2026-03-26T10:57:12Z

The other thing where we may get benefit is threading around the uncompress, which should also be embarassingly parallel, and certainly not optimal for async. You'll note that at the moment that's still serial ...

zequihg50 · 2026-03-26T15:55:47Z

The other thing where we may get benefit is threading around the uncompress, which should also be embarassingly parallel, and certainly not optimal for async. You'll note that at the moment that's still serial ...

Absolutely, just as a remainder for my future self, this might good to implement using two thread pools or async threads that communicate via some queue.

First cut at adding some parallelism in pyfive

9d9afed

bnlawrence requested a review from zequihg50 March 25, 2026 13:59

controlling and logging parallelism

cca72e5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First cut at adding some parallelism in pyfive#209

First cut at adding some parallelism in pyfive#209
bnlawrence wants to merge 2 commits intomainfrom
parallel

bnlawrence commented Mar 25, 2026 •

edited

Loading

Uh oh!

zequihg50 commented Mar 26, 2026

Uh oh!

bnlawrence commented Mar 26, 2026

Uh oh!

bnlawrence commented Mar 26, 2026

Uh oh!

bnlawrence commented Mar 26, 2026

Uh oh!

zequihg50 commented Mar 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bnlawrence commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

zequihg50 commented Mar 26, 2026

Uh oh!

bnlawrence commented Mar 26, 2026

Uh oh!

bnlawrence commented Mar 26, 2026

Uh oh!

bnlawrence commented Mar 26, 2026

Uh oh!

zequihg50 commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bnlawrence commented Mar 25, 2026 •

edited

Loading

zequihg50 commented Mar 26, 2026 •

edited

Loading