Skip to content

[BUG] cancel_node in BeforeNodeCallEvent raises RuntimeError that kills the entire graph on resume #2240

@yananym

Description

@yananym

Checks

  • I have updated to the lastest minor and patch version of Strands
  • I have checked the documentation and this is not expected behavior
  • I have searched ./issues and there are no duplicates of my issue

Strands Version

1.32.0

Python Version

3.13

Operating System

15.6.1

Installation Method

pip

Steps to Reproduce

Steps to Reproduce

  1. Create a linear graph with 3 nodes: step_a → step_b → step_c
  2. step_a is an INPUT agent that calls interrupt() to pause for user input
  3. step_b has a BeforeNodeCallEvent hook that sets cancel_node = True based on runtime state
  4. step_c is a normal agent that should execute after step_b is skipped
  5. Add a FileSessionManager or S3SessionManager for persistence
  6. Turn 1: Call graph("task")step_a interrupts, graph pauses. Works fine.
  7. Turn 2: Resume with graph(responses, invocation_state={"extracted": {"skip_step_b": True}})step_a completes, graph reaches step_b, hook sets cancel_node = True
  8. Result: RuntimeError("node cancelled by user") is raised at graph.py:~896, propagates through _execute_nodes_parallel, and kills the entire graph. step_c never executes. The graph status becomes FAILED instead of continuing.

The issue only manifests on resume (Turn 2). On a fresh start without interrupts, cancel_node also raises but the graph hasn't persisted state yet so there's nothing to corrupt. On resume, the crash leaves the workflow in a FAILED state with no recovery path.

Expected Behavior

Expected Behavior

When BeforeNodeCallEvent.cancel_node = True is set:

  1. The node should be treated as successfully completed (or a new SKIPPED status) for dependency resolution purposes
  2. Downstream nodes (step_c) should execute normally — the cancelled node should not block the graph
  3. The graph should continue to completion or the next interrupt point
  4. execution_order should either omit the skipped node or include it with a distinguishable status
  5. No exception should propagate — cancel_node is an intentional control flow decision, not an error

Actual Behavior

Actual Behavior

Setting cancel_node = True raises RuntimeError that terminates the entire graph:

# graph.py, _execute_node(), line ~896
if before_event.cancel_node:
    cancel_message = (
        before_event.cancel_node if isinstance(before_event.cancel_node, str) 
        else "node cancelled by user"
    )
    yield MultiAgentNodeCancelEvent(node.node_id, cancel_message)
    raise RuntimeError(cancel_message)  # ← kills the graph

The RuntimeError propagates:

  • _execute_node_stream_node_to_queue (line ~790) → _execute_nodes_parallel (line ~752) → raise event
  • The graph catches this as an unrecoverable failure
  • record.status becomes FAILED
  • All downstream nodes are abandoned
  • The workflow cannot be resumed — the next user message starts a brand new workflow, losing all accumulated state

Additional Context

Additional Context

  • The cancel_node feature was introduced to support the BeforeNodeCallEvent hook, but its current implementation treats cancellation as a fatal error rather than a control flow mechanism.
  • This behavior is consistent across versions 1.32.0 through 1.38.0.
  • The related feature request [FEATURE] Pass invocation_state to edge condition call #1346 (pass invocation_state to edge conditions) would provide an alternative path for conditional routing, but cancel_node should still work as a valid skip mechanism since it's exposed as a public API on the event object.
  • Our production workaround wraps skippable nodes in a no-op AgentBase implementation that checks the condition at call time and returns an empty AgentResult. This avoids cancel_node entirely but adds complexity and prevents proper skip tracking in execution_order.

Possible Solution

Possible Solution

Replace the RuntimeError in _execute_node() with graceful completion. In graph.py line ~896:

Current:

if before_event.cancel_node:
    cancel_message = (
        before_event.cancel_node if isinstance(before_event.cancel_node, str) 
        else "node cancelled by user"
    )
    yield MultiAgentNodeCancelEvent(node.node_id, cancel_message)
    raise RuntimeError(cancel_message)

Proposed:

if before_event.cancel_node:
    cancel_message = (
        before_event.cancel_node if isinstance(before_event.cancel_node, str) 
        else "node cancelled by user"
    )
    logger.debug("reason=<%s> | skipping node execution", cancel_message)
    yield MultiAgentNodeCancelEvent(node.node_id, cancel_message)
    
    # Mark as completed so downstream nodes can proceed
    node.execution_status = Status.COMPLETED
    
    # Yield a minimal result so the graph can continue
    yield MultiAgentNodeCompleteEvent(
        node_id=node.node_id,
        result=AgentResult(
            stop_reason="end_turn",
            message={"role": "assistant", "content": [{"text": cancel_message}]},
            metrics=EventLoopMetrics(),
            state={},
        ),
    )
    return  # Exit cleanly instead of raising

This ensures:

  • The cancelled node is treated as completed for dependency resolution
  • Downstream nodes execute normally
  • execution_order includes the node (consumers can check MultiAgentNodeCancelEvent to distinguish skipped from executed)
  • No RuntimeError propagation — the graph continues

An alternative would be adding a Status.SKIPPED enum value that the graph treats identically to COMPLETED for edge traversal but is distinguishable in execution_order for observability.

Related Issues

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions