Skip to content

refactor: move Fabric/USD dispatch from FabricFrameView to FrameView factory#5673

Draft
pv-nvidia wants to merge 1 commit into
isaac-sim:developfrom
pv-nvidia:pv/fabric-view-no-fallback
Draft

refactor: move Fabric/USD dispatch from FabricFrameView to FrameView factory#5673
pv-nvidia wants to merge 1 commit into
isaac-sim:developfrom
pv-nvidia:pv/fabric-view-no-fallback

Conversation

@pv-nvidia
Copy link
Copy Markdown
Contributor

@pv-nvidia pv-nvidia commented May 18, 2026

Problem

FabricFrameView has an internal _use_fabric flag that falls back to UsdFrameView when Fabric is disabled or the device is unsupported. This is a misleading abstraction: the class claims to be a Fabric-accelerated view but sometimes silently delegates everything to a different (slower) implementation. Callers have no way to know which code path actually ran.

Solution

Move the Fabric-enabled + device-supported check from FabricFrameView.__init__ up to the FrameView factory's _get_backend() method. The factory now dispatches to three implementations:

Condition Backend key Class
PhysX + Fabric enabled + supported device physx FabricFrameView
PhysX without Fabric or unsupported device usd UsdFrameView
Newton newton NewtonSiteFrameView

UsdFrameView is eagerly registered on the factory since it lives in isaaclab (not a backend package like isaaclab_physx), so FactoryBase's dynamic import (isaaclab_{backend}.sim.views) can't discover it.

Changes

  • FrameView._get_backend(): checks /physics/fabricEnabled setting + device support via _is_fabric_supported_device(), returns "physx", "usd", or "newton"
  • _is_fabric_supported_device(): accepts any valid cuda:<N> index (not just cuda:0) for multi-GPU support
  • FabricFrameView: stripped of all _use_fabric conditionals, SettingsManager import, _fabric_supported_devices constant — it just does Fabric, nothing else
  • UsdFrameView: eagerly registered as "usd" backend on the factory

Result

Each FrameView implementation now has a single responsibility. FabricFrameView assumes Fabric is available (the factory guarantees this). No runtime behavior change for existing callers — the same code paths execute, just dispatched from a cleaner location.

Multi-GPU setups (cuda:1, cuda:2, etc.) are properly supported by the device check function.

Dependencies

None — this PR applies cleanly to current develop.

Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Isaac Lab Review Bot

PR #5673: refactor: move Fabric/USD dispatch from FabricFrameView to FrameView factory

Summary

This PR contains 3 commits that introduce a service locator pattern and refactor the FrameView backend dispatch logic:

  1. 596bedb - Add typed ServiceLocator to SimulationContext
  2. dac5a16 - Add unit tests and changelog fragment
  3. 4c71366 - Move Fabric/USD dispatch from FabricFrameView to FrameView factory

✅ What's Good

  1. Clean Service Locator Implementation (service_locator.py)

    • Well-designed subscript API (services[cls] = instance)
    • Proper lifecycle management with close() handling
    • Good defensive coding with _try_close() checking callable
  2. Improved Separation of Concerns

    • Factory pattern in FrameView now handles backend selection logic
    • FabricFrameView is now simpler - only handles Fabric operations
    • Removed conditional checks inside FabricFrameView methods
  3. Comprehensive Tests

    • test_service_locator.py covers edge cases (non-callable close, missing keys, etc.)
    • Good coverage of the service locator API
  4. Proper Integration

    • SimulationContext.clear_instance() now calls services.close_all()
    • Services property exposed for external use

🔍 Review Items

1. _FABRIC_SUPPORTED_DEVICES Duplication (Minor)

# In frame_view.py:
_FABRIC_SUPPORTED_DEVICES = ("cpu", "cuda", "cuda:0")

This was previously in fabric_frame_view.py. Consider:

  • Making this a module-level constant that can be imported
  • Or document why the factory owns this knowledge

2. Device Argument Extraction (frame_view.py:58-60)

device = kwargs.get("device", "cpu")
if len(args) >= 2:
    device = args[1]

This relies on positional argument order. Consider adding a comment documenting the expected signature or using keyword-only enforcement.

3. Return Type Consistency (ServiceLocator.__getitem__)

def __getitem__(self, cls: type[_T]) -> _T | None:
    return self._services.get(cls)  # returns None if missing

Returning None for missing keys differs from standard dict behavior (which raises KeyError). This is documented but worth noting - callers must always handle None.

4. No SettingsManager Import Removed from FabricFrameView

The import of SettingsManager was removed from fabric_frame_view.py since Fabric checks moved to the factory. 👍


📋 Checklist

  • Code follows Isaac Lab style guidelines
  • Type hints present and correct
  • Docstrings provided for public APIs
  • Unit tests added for new functionality
  • Changelog fragment included
  • No obvious regressions

Verdict: Approve

This is a clean refactor that improves code organization by moving dispatch logic to the factory where it belongs. The service locator is a useful addition for managing backend-specific singletons. Minor suggestions above are non-blocking.


Reviewed at commit: 4c71366

@pv-nvidia pv-nvidia force-pushed the pv/fabric-view-no-fallback branch from 4c71366 to 5e6be23 Compare May 20, 2026 14:10
Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Isaac Lab Review Bot

PR #5673: refactor: move Fabric/USD dispatch from FabricFrameView to FrameView factory

Summary

This PR refactors the backend dispatch logic by moving Fabric/USD selection from FabricFrameView.__init__ to the FrameView factory's _get_backend() method. This is a clean separation-of-concerns improvement.

Changes:

  • frame_view.py: Factory now handles Fabric-enabled check and device support validation
  • fabric_frame_view.py: Simplified to only handle Fabric operations (assumes Fabric is available)

✅ What's Good

  1. Clean Single Responsibility

    • FabricFrameView no longer pretends to be one thing while sometimes behaving as another
    • Each implementation (Fabric, USD, Newton) now has one clear responsibility
    • Eliminates the Liskov substitution concern mentioned in the PR description
  2. Factory Owns Dispatch Logic

    • _get_backend() now centralizes all backend selection decisions
    • Device support check moved to the right location (factory, not implementation)
    • Warning message preserved for unsupported device configurations
  3. Code Reduction

    • Net deletion of ~35 lines from fabric_frame_view.py
    • Removed conditional branches (if not self._use_fabric) throughout the class
    • Removed the internal _use_fabric flag and related imports
  4. No Behavioral Change for Callers

    • Same code paths execute, just dispatched from a cleaner location
    • Warning message text identical to original

🔍 Findings

1. Behavior Change When SimulationContext is None (Medium)

Location: frame_view.py:56

if ctx is None:
    return "usd"  # Previously returned "physx"

The old _get_backend() returned "physx" when no context existed, which would instantiate FabricFrameView (with its internal fallback). Now it returns "usd" directly.

Risk: Code that instantiates FrameView before SimulationContext exists will now get UsdFrameView instead of FabricFrameView. This is likely the correct behavior (no context = no physics = USD fallback), but worth documenting or testing.

2. Device Argument Extraction Relies on Positional Order (Low)

Location: frame_view.py:70-72

device = kwargs.get("device", "cpu")
if len(args) >= 2:
    device = args[1]

This extraction relies on the assumption that device is the second positional argument. Consider:

  • Adding a brief comment documenting the expected signature: FrameView(prim_path, device, ...)
  • Or using inspect.signature for more robust extraction (though likely overkill for internal code)

3. Multi-GPU Device Support (Informational)

Location: frame_view.py:22

_FABRIC_SUPPORTED_DEVICES = ("cpu", "cuda", "cuda:0")

The original code had a TODO about extending to cuda:N for multi-GPU support. This constant was moved but the limitation persists. Not a regression—just noting this is still a future consideration.

4. Eager Registration Pattern is Good (Positive)

Location: frame_view.py:100-102

# Eagerly register UsdFrameView — it lives in isaaclab, not a backend package
FrameView.register("usd", UsdFrameView)

This is the right approach since FactoryBase's dynamic import can't discover UsdFrameView in the standard location.


📋 Checklist

  • Code follows Isaac Lab style guidelines
  • Type hints present and correct
  • Docstrings updated to reflect new behavior
  • No obvious regressions
  • Changelog fragment (CI reports missing—expected for pure refactors)

Verdict

This is a well-executed refactor that improves code organization by moving dispatch logic to the factory where it belongs. The SimulationContext is None behavior change (#1) is worth considering but is likely correct. Minor suggestion for documenting positional arg extraction (#2) is non-blocking.


Reviewed at commit: 5e6be23


Update (3d5caa6): ✅ Changes fully address the refactor goals and improve multi-GPU support.

Incremental Changes Reviewed

  1. Multi-GPU Device Support Added (Positive)

    • New _is_fabric_supported_device() helper in frame_view.py now accepts any cuda:N device
    • Resolves the TODO mentioned in Finding #3 of original review
    • Proper validation via int(device.split(":", 1)[1]) to reject invalid device strings
  2. Complete Factory Dispatch (Positive)

    • All if not self._use_fabric conditionals removed from FabricFrameView
    • _use_fabric flag and related imports fully eliminated
    • FabricFrameView is now instantiated only when Fabric path is appropriate
  3. Docstring Updates (Positive)

    • Updated FabricFrameView docstring correctly states it only handles Fabric operations
    • Device argument documentation expanded to cover cuda:1, cuda:2, etc.
    • Factory docstring clarifies the three PhysX dispatch paths
  4. Test Coverage (Informational)

    • test_views_xform_prim_fabric.py adds multi-GPU test cases for cuda:1
    • Tests properly guarded by ISAACLAB_TEST_MULTI_GPU env var and torch.cuda.device_count() check

Minor Note

The warning message in the factory differs slightly from the original:

  • Original: "only supported on cpu, cuda, cuda:0" (joined _fabric_supported_devices)
  • New: "only supported on cpu and cuda:<N> devices"

This is fine — the new message is more accurate since any CUDA index is now supported.

Verdict

The incremental changes complete the refactor cleanly. Multi-GPU support is a nice bonus. LGTM.

@isaaclab-review-bot isaaclab-review-bot Bot mentioned this pull request May 20, 2026
7 tasks
kellyguo11 pushed a commit that referenced this pull request May 20, 2026
# Description

Removes the `cuda:0`-only restriction in `FabricFrameView`. USDRT
`SelectPrims` now accepts any CUDA device index, so Fabric acceleration
runs on the simulation device (e.g., `cuda:1`) instead of silently
falling back to the slower USD path. This unblocks distributed training
where each process is pinned to a specific GPU.

Changes:

- **Drop device allowlist.** Removes `_fabric_supported_devices`, the
device guard in `__init__`, and the corresponding assertion in
`_initialize_fabric`. Any CUDA device (or CPU) now works.
- **Multi-GPU test coverage.** Three `cuda:1`-parameterized tests gated
by `ISAACLAB_TEST_MULTI_GPU=1` env var, plus a dedicated CI workflow on
the multi-GPU runner that sets it.
- **Fix deprecated `wp.to_torch()` calls.** Replaced with `.torch`
accessor on ProxyArray (avoids DeprecationWarning).
- **TODOs for follow-up PRs.**: 
   - #5673 
   - #5674 

## Type of change

- New feature (non-breaking change which adds functionality)

`cuda:0` continues to work exactly as before; `cuda:1`+ now also works
instead of silently falling back to USD. No public API surface changed.

## Checklist

- [x] I have read and understood the [contribution
guidelines](https://isaac-sim.github.io/IsaacLab/main/source/refs/contributing.html)
- [x] I have run the [`pre-commit` checks](https://pre-commit.com/) with
`./isaaclab.sh --format`
- [x] I have made corresponding changes to the documentation
- [x] My changes generate no new warnings
- [x] I have added tests that prove my fix is effective or that my
feature works
- [x] I have updated the changelog and the corresponding version in the
extension's `config/extension.toml` file
- [x] I have added my name to the `CONTRIBUTORS.md` or my name already
exists there

> Note: this PR uses a fragment file at
`source/isaaclab_physx/changelog.d/feat-frame-view-enable-mgpu.rst` per
the fragment-based changelog system.

## Test plan

Three new tests gated by `ISAACLAB_TEST_MULTI_GPU=1` and parameterized
with `["cuda:1"]`:

- `test_fabric_cuda1_world_pose_roundtrip` — `set_world_poses` →
`get_world_poses` returns the same values on a non-primary CUDA device.
- `test_fabric_cuda1_no_usd_writeback` — Fabric writes on `cuda:1` do
not write back to USD.
- `test_fabric_cuda1_scales_roundtrip` — covers the `set_scales` write
path on `cuda:1`.

A dedicated CI workflow (`test-fabric-multi-gpu.yaml`) runs on the
`[self-hosted, linux, x64, gpu, multi-gpu]` runner with
`ISAACLAB_TEST_MULTI_GPU=1` set. Pre-flights with `nvidia-smi` and
`torch.cuda.device_count()`, fails loudly if the runner has < 2 GPUs.

To verify locally on a multi-GPU machine:

```bash
ISAACLAB_TEST_MULTI_GPU=1 ./isaaclab.sh -p -m pytest \
    source/isaaclab_physx/test/sim/test_views_xform_prim_fabric.py -v
```

To verify the `cuda:0` path is unchanged (multi-GPU tests auto-skip):

```bash
./isaaclab.sh -p -m pytest \
    source/isaaclab_physx/test/sim/test_views_xform_prim_fabric.py -v
```
…factory

FabricFrameView had an internal _use_fabric flag that fell back to
UsdFrameView when Fabric was disabled or the device was unsupported.
This violated single-responsibility: FabricFrameView pretended to be
one class but sometimes behaved as another.

Now the FrameView factory handles all dispatch:
- PhysX + Fabric enabled + supported device → FabricFrameView
- PhysX without Fabric (or unsupported device) → UsdFrameView
- Newton → NewtonSiteFrameView

FabricFrameView no longer checks _use_fabric or _fabric_supported_devices.
It assumes Fabric is available (the factory guarantees this).

UsdFrameView is eagerly registered on the factory since it lives in
isaaclab (not a backend package), so FactoryBase's dynamic import
(isaaclab_{backend}.sim.views) can't discover it.
@pv-nvidia pv-nvidia force-pushed the pv/fabric-view-no-fallback branch from 5e6be23 to 3d5caa6 Compare May 22, 2026 00:05
@pv-nvidia
Copy link
Copy Markdown
Contributor Author

Superseded by the consolidated PR #5728 (pv/fabric-full-stack).

@pv-nvidia pv-nvidia closed this May 22, 2026
@pv-nvidia pv-nvidia reopened this May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

isaac-lab Related to Isaac Lab team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant