Skip to content

JobGroup.SUPPORTED_EXECUTORS is closed to downstream Executor subclasses #537

@nicgupta-nvidia

Description

@nicgupta-nvidia

Use case

nemo-skills' RayExecutor is a downstream Executor subclass that needs to be passable to JobGroup. Today JobGroup.__post_init__ (around nemo_run/run/job.py:265) asserts isinstance(self.executor, JobGroup.SUPPORTED_EXECUTORS) where SUPPORTED_EXECUTORS = [SlurmExecutor, DockerExecutor, LocalExecutor]. Any downstream package adding a new Executor type — Ray, Kubernetes, etc. — hits this assertion and cannot construct a JobGroup.

Why this matters now

nemo-skills now ships RayExecutor and uses JobGroup for multi-script eval-generation flows (vLLM + sandbox + client co-located). The Ray multi-script path is a separate architectural concern (single Ray submission = single container, vs. the heterogeneous group semantics JobGroup was designed for on Slurm) — but the immediate question is whether downstream Executor subclasses are even a supported extension point.

Current workaround

(Marking this clearly as workaround, not a proposed PR.) A downstream project patches the assertion at runtime via a class-name string sniff (`type(self.executor).name == "RayExecutor"`) to avoid a circular import. This sniff is intentionally narrow but a class-name string match is not idiomatic for upstream.

Proposed designs

Please indicate preference before we send a PR.

1. `SUPPORTED_EXECUTORS` extension hook

Downstream packages register their Executor subclass at import time:

```python
from nemo_run.run.job import JobGroup
JobGroup.SUPPORTED_EXECUTORS = (*JobGroup.SUPPORTED_EXECUTORS, RayExecutor)
```

  • Pro: explicit, discoverable.
  • Con: requires downstream packages to mutate a class attribute, which feels brittle.

2. Sentinel attribute on the Executor subclass

Downstream marks compatibility:

```python
class RayExecutor(Executor):
_jobgroup_compatible = True
```

…and `post_init` checks `getattr(executor, "_jobgroup_compatible", False)` in addition to `isinstance(SUPPORTED_EXECUTORS)`.

  • Pro: no mutation of upstream state.
  • Con: requires JobGroup to know about the sentinel.

3. `Executor.supports_job_group()` classmethod on the base

Defaults to `False`, overridable downstream. Same shape as option 2 but more discoverable in IDEs.

Note on JobGroup.launch path

Even with the assertion relaxed, `JobGroup.launch` calls `nemo_run.run.torchx_backend.launcher.launch(executor=...)` which routes through `EXECUTOR_MAPPING` in `torchx_backend/schedulers/api.py:30`. That mapping has no Ray entry, so `get_executor_str(RayExecutor)` raises `KeyError`. This means the assertion relax is necessary but not sufficient for Ray multi-script JobGroup to actually launch end-to-end. The pragmatic answer for the multi-script case is multi-pool architecture (pre-host components in separate Ray submissions; collapse multi-script to single-script), but the assertion remains too strict in principle for any downstream Executor subclass.

Reference

Prior PR #410 was the last touch on `nemo_run/run/job.py`. Searching closed issues for "Unsupported executor type" returned 0 hits, so this is a fresh report.

Ask

Which of the three designs (or a fourth) would you accept as a PR? Happy to send code once direction is confirmed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions