[BUG] cancel_node in BeforeNodeCallEvent raises RuntimeError that kills the entire graph on resume

### Checks

- [x] I have updated to the lastest minor and patch version of Strands
- [x] I have checked the documentation and this is not expected behavior
- [x] I have searched [./issues](./issues?q=) and there are no duplicates of my issue

### Strands Version

1.32.0

### Python Version

3.13

### Operating System

15.6.1

### Installation Method

pip

### Steps to Reproduce

## Steps to Reproduce

1. Create a linear graph with 3 nodes: `step_a → step_b → step_c`
2. `step_a` is an INPUT agent that calls `interrupt()` to pause for user input
3. `step_b` has a `BeforeNodeCallEvent` hook that sets `cancel_node = True` based on runtime state
4. `step_c` is a normal agent that should execute after `step_b` is skipped
5. Add a `FileSessionManager` or `S3SessionManager` for persistence
6. **Turn 1**: Call `graph("task")` — `step_a` interrupts, graph pauses. Works fine.
7. **Turn 2**: Resume with `graph(responses, invocation_state={"extracted": {"skip_step_b": True}})` — `step_a` completes, graph reaches `step_b`, hook sets `cancel_node = True`
8. **Result**: `RuntimeError("node cancelled by user")` is raised at `graph.py:~896`, propagates through `_execute_nodes_parallel`, and kills the entire graph. `step_c` never executes. The graph status becomes FAILED instead of continuing.

The issue only manifests on **resume** (Turn 2). On a fresh start without interrupts, `cancel_node` also raises but the graph hasn't persisted state yet so there's nothing to corrupt. On resume, the crash leaves the workflow in a FAILED state with no recovery path.

### Expected Behavior

## Expected Behavior

When `BeforeNodeCallEvent.cancel_node = True` is set:

1. The node should be treated as **successfully completed** (or a new SKIPPED status) for dependency resolution purposes
2. Downstream nodes (`step_c`) should execute normally — the cancelled node should not block the graph
3. The graph should continue to completion or the next interrupt point
4. `execution_order` should either omit the skipped node or include it with a distinguishable status
5. No exception should propagate — `cancel_node` is an intentional control flow decision, not an error

### Actual Behavior

## Actual Behavior

Setting `cancel_node = True` raises `RuntimeError` that terminates the entire graph:

```python
# graph.py, _execute_node(), line ~896
if before_event.cancel_node:
    cancel_message = (
        before_event.cancel_node if isinstance(before_event.cancel_node, str) 
        else "node cancelled by user"
    )
    yield MultiAgentNodeCancelEvent(node.node_id, cancel_message)
    raise RuntimeError(cancel_message)  # ← kills the graph
```

The `RuntimeError` propagates:
- `_execute_node` → `_stream_node_to_queue` (line ~790) → `_execute_nodes_parallel` (line ~752) → `raise event`
- The graph catches this as an unrecoverable failure
- `record.status` becomes `FAILED`
- All downstream nodes are abandoned
- The workflow cannot be resumed — the next user message starts a brand new workflow, losing all accumulated state

### Additional Context

## Additional Context

- The `cancel_node` feature was introduced to support the `BeforeNodeCallEvent` hook, but its current implementation treats cancellation as a fatal error rather than a control flow mechanism.
- This behavior is consistent across versions 1.32.0 through 1.38.0.
- The related feature request #1346 (pass `invocation_state` to edge conditions) would provide an alternative path for conditional routing, but `cancel_node` should still work as a valid skip mechanism since it's exposed as a public API on the event object.
- Our production workaround wraps skippable nodes in a no-op `AgentBase` implementation that checks the condition at call time and returns an empty `AgentResult`. This avoids `cancel_node` entirely but adds complexity and prevents proper skip tracking in `execution_order`.

### Possible Solution

## Possible Solution

Replace the `RuntimeError` in `_execute_node()` with graceful completion. In `graph.py` line ~896:

**Current:**
```python
if before_event.cancel_node:
    cancel_message = (
        before_event.cancel_node if isinstance(before_event.cancel_node, str) 
        else "node cancelled by user"
    )
    yield MultiAgentNodeCancelEvent(node.node_id, cancel_message)
    raise RuntimeError(cancel_message)
```

**Proposed:**
```python
if before_event.cancel_node:
    cancel_message = (
        before_event.cancel_node if isinstance(before_event.cancel_node, str) 
        else "node cancelled by user"
    )
    logger.debug("reason=<%s> | skipping node execution", cancel_message)
    yield MultiAgentNodeCancelEvent(node.node_id, cancel_message)
    
    # Mark as completed so downstream nodes can proceed
    node.execution_status = Status.COMPLETED
    
    # Yield a minimal result so the graph can continue
    yield MultiAgentNodeCompleteEvent(
        node_id=node.node_id,
        result=AgentResult(
            stop_reason="end_turn",
            message={"role": "assistant", "content": [{"text": cancel_message}]},
            metrics=EventLoopMetrics(),
            state={},
        ),
    )
    return  # Exit cleanly instead of raising
```

This ensures:
- The cancelled node is treated as completed for dependency resolution
- Downstream nodes execute normally
- `execution_order` includes the node (consumers can check `MultiAgentNodeCancelEvent` to distinguish skipped from executed)
- No `RuntimeError` propagation — the graph continues

An alternative would be adding a `Status.SKIPPED` enum value that the graph treats identically to `COMPLETED` for edge traversal but is distinguishable in `execution_order` for observability.

### Related Issues

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] cancel_node in BeforeNodeCallEvent raises RuntimeError that kills the entire graph on resume #2240

Checks

Strands Version

Python Version

Operating System

Installation Method

Steps to Reproduce

Steps to Reproduce

Expected Behavior

Expected Behavior

Actual Behavior

Actual Behavior

Additional Context

Additional Context

Possible Solution

Possible Solution

Related Issues

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[BUG] cancel_node in BeforeNodeCallEvent raises RuntimeError that kills the entire graph on resume #2240

Description

Checks

Strands Version

Python Version

Operating System

Installation Method

Steps to Reproduce

Steps to Reproduce

Expected Behavior

Expected Behavior

Actual Behavior

Actual Behavior

Additional Context

Additional Context

Possible Solution

Possible Solution

Related Issues

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions