Skip to content

feat(dynamo): add target_executorch setting to keep output-allocator ops in PyTorch#4355

Open
shoumikhin wants to merge 1 commit into
pytorch:mainfrom
shoumikhin:target-executorch-setting
Open

feat(dynamo): add target_executorch setting to keep output-allocator ops in PyTorch#4355
shoumikhin wants to merge 1 commit into
pytorch:mainfrom
shoumikhin:target-executorch-setting

Conversation

@shoumikhin

@shoumikhin shoumikhin commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

Description

Some converters require a TensorRT output allocator because their output shape is
data-dependent (for example aten.nonzero). A TensorRT engine that needs an output
allocator cannot be consumed by every downstream runtime that executes the compiled
program.

This adds a target_executorch compile setting (default False). When enabled, every
operator whose converter sets requires_output_allocator is routed to
torch_executed_ops and runs in PyTorch instead of being lowered into a TensorRT
engine. When disabled (the default), behavior is unchanged.

Details

  • Discovery and routing are factored into two small helpers (_output_allocator_ops
    and _route_output_allocator_ops) so both are unit-testable without a GPU. The
    registry walk handles a single converter, a list/tuple, or a priority-keyed dict,
    and is conservative: if any converter for a target needs an allocator, the whole
    target is routed to PyTorch so an allocator engine is never emitted.
  • Wired through compile() and cross_compile_for_windows(); the routing runs in
    compile_module(), which both entry points funnel through. It is intentionally not
    exposed on convert_exported_program_to_serialized_trt_engine(), where a single
    serialized engine cannot contain PyTorch fallbacks.
  • Combining target_executorch with require_full_compilation raises a clear error,
    since routing ops to PyTorch contradicts full compilation.
  • CompilationSettings.__setstate__ defaults the new field so older pickles load.

The name is deliberate: it gates ExecuTorch-targeted routing, and further
ExecuTorch-specific behavior can accrete under the same flag.

Tests

tests/py/dynamo/models/test_target_executorch.py:

  • the setting defaults to False and is settable;
  • a state missing the field (older pickle) restores to False;
  • output-allocator converters are discoverable via requires_output_allocator;
  • routing is a no-op when the flag is off;
  • routing adds the op to torch_executed_ops when on (CPU only, no GPU needed);
  • combining with require_full_compilation raises;
  • end to end on GPU, a data-dependent op (nonzero) falls back to PyTorch.

Type of change

  • New feature (non-breaking change which adds functionality)

Checklist

  • My code follows the style guidelines of this project (isort + black)
  • I have added tests that prove my fix/feature works
  • Commit is signed off (DCO)

@meta-cla meta-cla Bot added the cla signed label Jun 21, 2026
@github-actions github-actions Bot added component: tests Issues re: Tests component: core Issues re: The core compiler component: api [Python] Issues re: Python API component: dynamo Issues relating to the `torch.compile` or `torch._dynamo.export` paths labels Jun 21, 2026
@github-actions github-actions Bot requested a review from cehongwang June 21, 2026 00:28
@shoumikhin shoumikhin force-pushed the target-executorch-setting branch from fb85c0d to 424b09f Compare June 21, 2026 10:28
…ops in PyTorch

Some converters require a TensorRT output allocator because their output shape is
data-dependent (for example aten.nonzero). A TensorRT engine that needs an output
allocator cannot be consumed by every downstream runtime that executes the compiled
program.

This adds a target_executorch compile setting (default False). When enabled, every
operator whose converter sets requires_output_allocator is routed to
torch_executed_ops and runs in PyTorch instead of being lowered into a TensorRT
engine. When disabled (the default), behavior is unchanged.

Details:
- Discovery and routing live in two small helpers (_output_allocator_ops and
  _route_output_allocator_ops) so both are unit-testable without a GPU. The registry
  walk handles a single converter, a list/tuple, or a priority-keyed dict, and is
  conservative: if any converter for a target needs an allocator, the whole target is
  routed to PyTorch so an allocator engine is never emitted.
- Wired through compile() and cross_compile_for_windows(); the routing runs in
  compile_module(), which both entry points funnel through. It is intentionally not
  exposed on convert_exported_program_to_serialized_trt_engine(), where a single
  serialized engine cannot contain PyTorch fallbacks.
- Combining target_executorch with require_full_compilation raises a clear error,
  since routing ops to PyTorch contradicts full compilation.
- CompilationSettings.__setstate__ defaults the new field so older pickles load.

The name is deliberate: it gates ExecuTorch-targeted routing, and further
ExecuTorch-specific behavior can accrete under the same flag.

Tests (tests/py/dynamo/models/test_target_executorch.py): default value; old-pickle
compatibility; output-allocator op discovery; routing is a no-op when disabled; routing
adds the op when enabled (CPU only); the require_full_compilation conflict; and an end
to end GPU test that a data-dependent op falls back to PyTorch.

Signed-off-by: shoumikhin <shoumikhin@meta.com>
@shoumikhin shoumikhin force-pushed the target-executorch-setting branch from 424b09f to 85fa1eb Compare June 22, 2026 02:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed component: api [Python] Issues re: Python API component: core Issues re: The core compiler component: dynamo Issues relating to the `torch.compile` or `torch._dynamo.export` paths component: tests Issues re: Tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant