feat(dynamo): add target_executorch setting to keep output-allocator ops in PyTorch#4355
Open
shoumikhin wants to merge 1 commit into
Open
feat(dynamo): add target_executorch setting to keep output-allocator ops in PyTorch#4355shoumikhin wants to merge 1 commit into
shoumikhin wants to merge 1 commit into
Conversation
fb85c0d to
424b09f
Compare
…ops in PyTorch Some converters require a TensorRT output allocator because their output shape is data-dependent (for example aten.nonzero). A TensorRT engine that needs an output allocator cannot be consumed by every downstream runtime that executes the compiled program. This adds a target_executorch compile setting (default False). When enabled, every operator whose converter sets requires_output_allocator is routed to torch_executed_ops and runs in PyTorch instead of being lowered into a TensorRT engine. When disabled (the default), behavior is unchanged. Details: - Discovery and routing live in two small helpers (_output_allocator_ops and _route_output_allocator_ops) so both are unit-testable without a GPU. The registry walk handles a single converter, a list/tuple, or a priority-keyed dict, and is conservative: if any converter for a target needs an allocator, the whole target is routed to PyTorch so an allocator engine is never emitted. - Wired through compile() and cross_compile_for_windows(); the routing runs in compile_module(), which both entry points funnel through. It is intentionally not exposed on convert_exported_program_to_serialized_trt_engine(), where a single serialized engine cannot contain PyTorch fallbacks. - Combining target_executorch with require_full_compilation raises a clear error, since routing ops to PyTorch contradicts full compilation. - CompilationSettings.__setstate__ defaults the new field so older pickles load. The name is deliberate: it gates ExecuTorch-targeted routing, and further ExecuTorch-specific behavior can accrete under the same flag. Tests (tests/py/dynamo/models/test_target_executorch.py): default value; old-pickle compatibility; output-allocator op discovery; routing is a no-op when disabled; routing adds the op when enabled (CPU only); the require_full_compilation conflict; and an end to end GPU test that a data-dependent op falls back to PyTorch. Signed-off-by: shoumikhin <shoumikhin@meta.com>
424b09f to
85fa1eb
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Some converters require a TensorRT output allocator because their output shape is
data-dependent (for example
aten.nonzero). A TensorRT engine that needs an outputallocator cannot be consumed by every downstream runtime that executes the compiled
program.
This adds a
target_executorchcompile setting (defaultFalse). When enabled, everyoperator whose converter sets
requires_output_allocatoris routed totorch_executed_opsand runs in PyTorch instead of being lowered into a TensorRTengine. When disabled (the default), behavior is unchanged.
Details
_output_allocator_opsand
_route_output_allocator_ops) so both are unit-testable without a GPU. Theregistry walk handles a single converter, a list/tuple, or a priority-keyed dict,
and is conservative: if any converter for a target needs an allocator, the whole
target is routed to PyTorch so an allocator engine is never emitted.
compile()andcross_compile_for_windows(); the routing runs incompile_module(), which both entry points funnel through. It is intentionally notexposed on
convert_exported_program_to_serialized_trt_engine(), where a singleserialized engine cannot contain PyTorch fallbacks.
target_executorchwithrequire_full_compilationraises a clear error,since routing ops to PyTorch contradicts full compilation.
CompilationSettings.__setstate__defaults the new field so older pickles load.The name is deliberate: it gates ExecuTorch-targeted routing, and further
ExecuTorch-specific behavior can accrete under the same flag.
Tests
tests/py/dynamo/models/test_target_executorch.py:Falseand is settable;False;requires_output_allocator;torch_executed_opswhen on (CPU only, no GPU needed);require_full_compilationraises;nonzero) falls back to PyTorch.Type of change
Checklist