[Feature]: Restructure frame-parallel execution to include heavier frame-dependent stages

# Feature Request
## Problem / Motivation
The current frame-parallel implementation parallelises the `FrameGraph`/covariance stage, but the recent scaling results suggest that this may not dominate the full CodeEntropy workflow as much as expected, especially for the smaller benchmark systems.

Some expensive frame-dependent work still appears to happen before the current frame-parallel section, particularly in the static stage. This may include the dihedral/conformational analysis and neighbour calculation. As a result, the overall workflow scaling may be limited by serial work outside the current Dask frame execution path.

This also means each parallel task currently has a relatively small unit of work, as workers mainly process the covariance pathway for a frame. A larger frame-based unit of work may reduce overhead and improve scaling.

## Proposed Solution
Add clearer profiling/timing around the main `LevelDAG` stages to identify which parts of the workflow are still dominating runtime. This should include timings for:

* Static setup/stage execution
* Dihedral/conformational analysis
* Neighbour calculation
* FrameGraph/covariance execution
* Frame reduction/finalisation

If profiling confirms that frame-dependent work in the static stage is a significant bottleneck, investigate restructuring the workflow so more of this work is moved into the frame-parallel path.

The longer-term structure would be closer to:

```text
for frame or frame_chunk in selected_frames:
    compute covariance contribution
    compute neighbour contribution
    compute heavy frame-dependent dihedral/conformational contributions
    return compact partial results
```

rather than the current structure where only the covariance path is handled by the frame-parallel `FrameGraph`.

For dihedral/conformational analysis, this may require a map-reduce style approach because some parts depend on trajectory-wide information, such as peak/state assignment. For example:

```text
Pass 1:
    workers compute partial dihedral angle/histogram data per frame chunk

Reduce:
    combine partial histograms and identify global peaks/states

Pass 2:
    workers assign conformational states using the global peak/state data

Reduce:
    combine final state counts/populations
```

Neighbour calculation may be a simpler first candidate, as it already appears to follow a frame-based structure.

## Alternatives Considered

* Keep the current implementation as covariance-only frame parallelism.

  * This is useful and provides the initial Dask/HPC infrastructure, but may not give the strongest whole-workflow scaling if other serial stages dominate.

* Only optimise individual functions within the static stage.

  * This may improve runtime locally, but would not address the larger issue that expensive frame-dependent work remains outside the frame-parallel execution path.

* Increase the number of Dask workers without changing the task structure.

  * This is unlikely to fully solve the issue if the parallel task size remains small and significant serial work remains outside the parallel path.

## Expected Impact

* Clearer understanding of where CodeEntropy runtime is spent after the initial frame-parallel implementation.
* Better evidence for whether dihedral/conformational analysis, neighbour calculation, or another stage is limiting scaling.
* Potentially stronger Dask/HPC scaling by increasing the amount of useful work done per worker.
* Cleaner long-term parallel structure, closer to an outer frame/chunk loop where all frame-dependent work is grouped together.
* Potential memory improvements by returning compact partial sums, histograms, or counts instead of building larger all-frame objects where possible.
* Better benchmark evidence for future paper edits and performance discussion.

## Additional Context

The current frame-parallel implementation is an important first step because it introduces the explicit frame-local boundary and Dask/HPC execution infrastructure.

Initial profiling with SnakeViz suggested that the `FrameGraph`/covariance pathway was the main runtime cost, which motivated parallelising that section first. However, benchmark scaling suggests that other workflow stages may still be contributing enough serial runtime to limit overall speedup.

This issue is intended as a follow-up investigation and possible restructuring step, rather than a replacement for the current implementation.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Restructure frame-parallel execution to include heavier frame-dependent stages #358

Feature Request

Problem / Motivation

Proposed Solution

Alternatives Considered

Expected Impact

Additional Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Feature]: Restructure frame-parallel execution to include heavier frame-dependent stages #358

Description

Feature Request

Problem / Motivation

Proposed Solution

Alternatives Considered

Expected Impact

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions