Turn ParallelAnalysisBase into dask custom collection#136
Turn ParallelAnalysisBase into dask custom collection#136yuxuanzhuang wants to merge 36 commits intoMDAnalysis:masterfrom
Conversation
|
Hello @yuxuanzhuang! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
Comment last updated at 2020-08-20 10:39:58 UTC |
|
This PR really depends on PR #132 so we should look at that one first. Then you can rebase this one and it will become much cleaner. |
orbeckst
left a comment
There was a problem hiding this comment.
Overall this looks like a really interesting way to move forward. This, together with the notebook, is a good study for how the next version of PMDA could look like.
I have a bunch of initial questions/comments inline.
Also note that we would first need to merge PR #132 before really moving forward here.
We would also need to remove Python 2 as soon as we become dependent on MDA 2.0.0 (but that's for PR #132).
Tests will obviously be needed...
| return self._keys | ||
|
|
||
| # it uses multiprocessing scheduler in default | ||
| __dask_scheduler__ = staticmethod(dask.multiprocessing.get) |
There was a problem hiding this comment.
Even though multiprocessing is the default scheduler, one can still use distributed, right?
There was a problem hiding this comment.
Right, it can either be a global dask config, a context manager, or an arg in self.compute(). (https://docs.dask.org/en/latest/scheduler-overview.html#configuring-the-schedulers)
pmda/parallel.py
Outdated
| np.array([el[5] for el in res])) | ||
|
|
||
| # this is crucial if the analysis does not iterate over | ||
| # the whole trajectory. |
There was a problem hiding this comment.
Why is this crucial? What would happen? Add more comment.
There was a problem hiding this comment.
discussed here
https://github.com/MDAnalysis/pmda/pull/132/files#r455247843
|
|
||
| def __dask_postpersist__(self): | ||
| # we don't need persist implementation. | ||
| raise NotImplementedError |
There was a problem hiding this comment.
Will it not be possible to persist?
Presumably, that would have been possible previously if we had chosen persist in run instead of compute().
pmda/parallel.py
Outdated
| res.append(result_single_frame) | ||
| return res | ||
|
|
||
| def __getstate__(self): |
There was a problem hiding this comment.
Does DaskMixin require the whole class to be picklable?
There was a problem hiding this comment.
Yes...I mean in the old implementation, the whole class has to be picklable as well.
FYI, the code here is not needed anymore after MDAnalysis/mdanalysis#2893 is merged
pmda/parallel.py
Outdated
| """ | ||
| raise NotImplementedError | ||
|
|
||
| def prepare_jobs(self, |
There was a problem hiding this comment.
prepare_jobs sounds confusing to me – what "jobs"? If it's part of the documented workflow then it could just be prepare.
prepare_dask would be more explicit but also a bit pointless because PMDA is fully intertwined with dask so that's the only thing we would be preparing for. create_dask_graph is too long and really talks to much about implementation details.
All in all, I'd just call it prepare and add more docs stating clearly what is being prepared and under which circumstances a user needs to run it.
Fixes #135
Note the only file changes from #132 is
parallel.pyYou can read https://github.com/yuxuanzhuang/pmda/pull/1/files to get the actual changes.Changes made in this Pull Request:
PR Checklist