Skip to content

feat: define child workflow state machine#99

Open
timl3136 wants to merge 7 commits into
cadence-workflow:mainfrom
timl3136:child-wf-1
Open

feat: define child workflow state machine#99
timl3136 wants to merge 7 commits into
cadence-workflow:mainfrom
timl3136:child-wf-1

Conversation

@timl3136
Copy link
Copy Markdown
Member

@timl3136 timl3136 commented May 5, 2026

What changed?
define child workflow state machine

Why?
essential step for implement child wf

How did you test it?
unit tests

Potential risks

Release notes

Documentation Changes

Comment thread cadence/_internal/workflow/statemachine/child_workflow_execution_state_machine.py Outdated
Comment thread cadence/_internal/workflow/statemachine/nondeterminism.py
Comment thread cadence/_internal/workflow/statemachine/child_workflow_execution_state_machine.py Outdated
Comment thread cadence/_internal/workflow/statemachine/child_workflow_execution_state_machine.py Outdated
Comment thread cadence/_internal/workflow/statemachine/child_workflow_execution_state_machine.py Outdated
Comment thread cadence/_internal/workflow/statemachine/child_workflow_execution_state_machine.py Outdated
Comment thread cadence/_internal/workflow/statemachine/child_workflow_execution_state_machine.py Outdated
Comment thread cadence/_internal/workflow/statemachine/child_workflow_execution_state_machine.py Outdated
Comment thread cadence/_internal/workflow/statemachine/nondeterminism.py
Comment thread cadence/error.py Outdated
Comment thread tests/cadence/_internal/workflow/statemachine/test_decision_manager.py Outdated
@timl3136 timl3136 force-pushed the child-wf-1 branch 2 times, most recently from 8d1beff to ef8f7e5 Compare May 11, 2026 22:12
@codecov
Copy link
Copy Markdown

codecov Bot commented May 11, 2026

Codecov Report

❌ Patch coverage is 95.09202% with 8 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...internal/workflow/statemachine/decision_manager.py 78.94% 2 Missing and 2 partials ⚠️
...internal/workflow/statemachine/event_dispatcher.py 75.00% 2 Missing and 2 partials ⚠️
Files with missing lines Coverage Δ
...emachine/child_workflow_execution_state_machine.py 100.00% <100.00%> (ø)
.../_internal/workflow/statemachine/nondeterminism.py 91.77% <100.00%> (+1.05%) ⬆️
...internal/workflow/statemachine/decision_manager.py 90.90% <78.94%> (-0.69%) ⬇️
...internal/workflow/statemachine/event_dispatcher.py 75.86% <75.00%> (+4.75%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread cadence/_internal/workflow/statemachine/child_workflow_execution_state_machine.py Outdated
Comment thread cadence/_internal/workflow/statemachine/event_dispatcher.py
@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented May 11, 2026

Gitar is not allowed to push to this forked PR.

Comment on lines +51 to +60
if self.state is DecisionState.REQUESTED:
return decision.Decision(
start_child_workflow_execution_decision_attributes=self.request
)
if self.state is DecisionState.CANCELED_AFTER_REQUESTED:
return record_immediate_cancel(self.request)
if self.state in (
DecisionState.CANCELED_AFTER_RECORDED,
DecisionState.CANCELED_AFTER_STARTED,
):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use switch in python match

id_for_event = getattr(event_attributes, action.id_attr)
# This may be a reference via the user id or a reference to a previous event.
# Supports dotted paths (e.g. "workflow_execution.workflow_id") for nested fields.
id_for_event = resolve_id_attr(event_attributes, action.id_attr)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a comment: we might want to revisit to simplify the logic. The complexity is to point an event to a statemachine id.
In the past, it's simply resolved by alias logic. Now with childworkflow, the mapping logic becomes more complicated.

)
machine = ChildWorkflowExecutionStateMachine(attrs, execution, result)
self._add_state_machine(machine)
return execution, result
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need the result here?

Comment thread cadence/error.py Outdated
Comment on lines +152 to +183
class ChildWorkflowError(Exception):
"""Base class for all child workflow lifecycle errors."""

pass


class StartChildWorkflowExecutionFailed(ChildWorkflowError):
def __init__(self, message: str, cause: Any, workflow_id: str) -> None:
super().__init__(message)
self.cause = cause
self.workflow_id = workflow_id


class ChildWorkflowExecutionFailed(ChildWorkflowError):
def __init__(self, message: str, failure: Any) -> None:
super().__init__(message)
self.failure = failure


class ChildWorkflowExecutionCanceled(ChildWorkflowError):
def __init__(self, message: str, details: Any) -> None:
super().__init__(message)
self.details = details


class ChildWorkflowExecutionTimedOut(ChildWorkflowError):
def __init__(self, message: str, timeout_type: int) -> None:
super().__init__(message)
self.timeout_type = timeout_type


class ChildWorkflowExecutionTerminated(ChildWorkflowError):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All errors in this file is exposed to Client. We don't have APIs on child workflows and thus we don't need it here. The suggestion would be to move to where the error is needed.

Comment thread idls
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you intended to update this

Copy link
Copy Markdown
Member

@natemort natemort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, +1 to Shijie's comments.

Comment thread cadence/workflow.py
* A method can only be one of @workflow.run, @workflow.signal, or @workflow.query.

Args:
name: The name of the query type. If not provided, use the function name.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Quality: Query decorator docstring contradicts implementation

The docstring for query() states "If not provided, use the function name" (line 391), but the implementation immediately raises ValueError("name is required") when name is None (line 399-400). The docstring should be updated to match the actual behavior, or the implementation should fall back to f.__name__ like the docstring promises.

The signal decorator has the same pattern (requires name), so this is consistent behavior but the docstring is misleading.

Fix 1: Fix docstring to match implementation (name is required)
Args:
    name: The name of the query type. Required.
  • Apply fix
Fix 2: Alternatively, implement the fallback to function name as documented
if name is None:
    def decorator(f: T) -> T:
        f._workflow_query = f.__name__  # type: ignore[attr-defined]
        return f
    return decorator

def decorator(f: T) -> T:
    f._workflow_query = name  # type: ignore[attr-defined]
    return f

return decorator
  • Apply fix

Check a box to apply a fix or reply for a change | Was this helpful? React with 👍 / 👎

timl3136 added 6 commits May 19, 2026 14:38
Signed-off-by: Tim Li <ltim@uber.com>
Signed-off-by: Tim Li <ltim@uber.com>
Signed-off-by: Tim Li <ltim@uber.com>
Signed-off-by: Tim Li <ltim@uber.com>
Signed-off-by: Tim Li <ltim@uber.com>
Signed-off-by: Tim Li <ltim@uber.com>
@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented May 20, 2026

CI failed: The test `test_async_signal_handler_direct_mutation` failed due to a workflow timeout, likely caused by regressions in the new state machine logic introduced in this PR.

Overview

The integration test suite failed during the test_async_signal_handler_direct_mutation test case, resulting in an AssertionError due to a workflow timeout. Out of 502 total tests, 500 passed, 1 was skipped, and 1 failed.

Failures

Workflow Timeout in test_async_signal_handler_direct_mutation (confidence: medium)

  • Type: test
  • Affected jobs: 76862667323
  • Related to change: yes
  • Root cause: The new state machine logic for child workflows appears to be interfering with asynchronous signal processing, causing the workflow to deadlock or fail to complete within the 10-second timeout period.
  • Suggested fix: Investigate the event dispatching logic in cadence/_internal/workflow/statemachine/event_dispatcher.py and decision_manager.py. Verify that the new child_workflow_execution_state_machine.py implementation correctly yields control and does not block signal handler execution.

Summary

  • Change-related failures: 1 integration test failure linked to the new child workflow state machine implementation.
  • Infrastructure/flaky failures: 0
  • Recommended action: Review the changes made to the state machine architecture in this PR, specifically focusing on event loop handling and potential blocking operations within child workflow transitions. Debug the AsyncSignalDirectWorkflow execution to identify if the signal handler is reaching completion.
Code Review 👍 Approved with suggestions 5 resolved / 6 findings

Defines the child workflow state machine with robust event dispatching and replay determinism. Update the query decorator docstring to match the current implementation which requires an explicit name.

💡 Quality: Query decorator docstring contradicts implementation

📄 cadence/workflow.py:391 📄 cadence/workflow.py:399-400

The docstring for query() states "If not provided, use the function name" (line 391), but the implementation immediately raises ValueError("name is required") when name is None (line 399-400). The docstring should be updated to match the actual behavior, or the implementation should fall back to f.__name__ like the docstring promises.

The signal decorator has the same pattern (requires name), so this is consistent behavior but the docstring is misleading.

Fix docstring to match implementation (name is required)
Args:
    name: The name of the query type. Required.
Alternatively, implement the fallback to function name as documented
if name is None:
    def decorator(f: T) -> T:
        f._workflow_query = f.__name__  # type: ignore[attr-defined]
        return f
    return decorator

def decorator(f: T) -> T:
    f._workflow_query = name  # type: ignore[attr-defined]
    return f

return decorator
✅ 5 resolved
Bug: handle_cancel_initiated routing will fail at runtime

📄 cadence/_internal/workflow/statemachine/child_workflow_execution_state_machine.py:168-178
The handle_cancel_initiated handler uses @child_workflow_events.event("workflow_execution") as its id_attr. At dispatch time, DecisionManager.handle_history_event will do getattr(attrs, "workflow_execution") which yields a WorkflowExecution protobuf object. The alias lookup uses (decision_type, id_for_event) as the key, so it will look for (CHILD_WORKFLOW, <WorkflowExecution object>). However, the state machine is registered under (CHILD_WORKFLOW, "child-wf-1") (a string). These won't match, so the event will never be routed to the state machine.

Additionally, since handle_cancel_initiated doesn't set event_id_is_alias=True, the cancel-initiated event's ID is never registered as an alias. This means handle_cancel_failed (which looks up by "initiated_event_id") will also fail to route.

The tests pass because they call sm.handle_cancel_initiated(...) directly, bypassing the dispatcher routing.

Bug: StartChildWorkflowExecutionFailed uses CANCEL expectation incorrectly

📄 cadence/_internal/workflow/statemachine/nondeterminism.py:288-293
In nondeterminism.py, the to_expectation registration for StartChildWorkflowExecutionFailedEventAttributes uses CANCEL (i.e., {"canceled": True}) as its properties. However, CANCEL is semantically used for cancellation operations (e.g., ActivityTaskCancelRequestedEventAttributes, CancelTimerFailedEventAttributes). An initiation failure is not a cancellation — it should probably carry the same properties as the initiation expectation ({"workflow_type": attrs.workflow_type.name}) to properly enforce determinism. If a replayed history has a start-child-failed event but the new code expects a different workflow type, the current CANCEL expectation won't catch that mismatch.

Bug: handle_started crashes if execution future was already cancelled

📄 cadence/_internal/workflow/statemachine/child_workflow_execution_state_machine.py:80-83 📄 cadence/_internal/workflow/statemachine/child_workflow_execution_state_machine.py:111-116
In request_cancel() when state is RECORDED (line 80-83), execution.force_cancel() is called. However, if a ChildWorkflowExecutionStarted event subsequently arrives (which is possible since the cancel decision is sent asynchronously), handle_started (line 114-116) calls self.execution.set_result(...) on the already-cancelled future, which raises asyncio.InvalidStateError.

The scenario: Initiated → RECORDED → user cancels → CANCELED_AFTER_RECORDED (execution force-cancelled) → server already recorded Started event → handle_started tries set_result → crash.

Similarly, if handle_cancel_failed reverts to STARTED and then handle_started is somehow re-dispatched, the same issue would occur.

Quality: Test helper has unused initiated_event_id parameter

📄 tests/cadence/_internal/workflow/statemachine/test_decision_manager.py:279-288
The child_wf_initiated helper function accepts an initiated_event_id keyword argument that is never used in the function body. This is misleading because callers pass values like initiated_event_id=1, suggesting the parameter configures something, but it has no effect. The StartChildWorkflowExecutionInitiatedEventAttributes proto doesn't have an initiated_event_id field — it IS the initiated event, and its event_id is set via the outer HistoryEvent(event_id=...).

Bug: Empty event_id_attr causes AttributeError at dispatch time

📄 cadence/_internal/workflow/statemachine/event_dispatcher.py:35-36 📄 cadence/_internal/workflow/statemachine/event_dispatcher.py:41 📄 cadence/_internal/workflow/statemachine/event_dispatcher.py:47-55 📄 cadence/_internal/workflow/statemachine/decision_manager.py:178
The EventDispatcher.__init__ now defaults default_id_attr to "", and _validate_field is skipped when event_id_attr is falsy. However, the empty string is still stored in Action.id_attr. When DecisionManager.handle_history_event later calls resolve_id_attr(event_attributes, action.id_attr) with an empty string, "".split(".") returns [''], causing getattr(obj, '') which raises AttributeError.

Either resolve_id_attr needs an early return for empty paths, or handle_history_event needs to guard against empty id_attr before calling resolve_id_attr.

🤖 Prompt for agents
Code Review: Defines the child workflow state machine with robust event dispatching and replay determinism. Update the query decorator docstring to match the current implementation which requires an explicit name.

1. 💡 Quality: Query decorator docstring contradicts implementation
   Files: cadence/workflow.py:391, cadence/workflow.py:399-400

   The docstring for `query()` states "If not provided, use the function name" (line 391), but the implementation immediately raises `ValueError("name is required")` when `name is None` (line 399-400). The docstring should be updated to match the actual behavior, or the implementation should fall back to `f.__name__` like the docstring promises.
   
   The `signal` decorator has the same pattern (requires name), so this is consistent behavior but the docstring is misleading.

   Fix (Fix docstring to match implementation (name is required)):
   Args:
       name: The name of the query type. Required.

   Fix (Alternatively, implement the fallback to function name as documented):
   if name is None:
       def decorator(f: T) -> T:
           f._workflow_query = f.__name__  # type: ignore[attr-defined]
           return f
       return decorator
   
   def decorator(f: T) -> T:
       f._workflow_query = name  # type: ignore[attr-defined]
       return f
   
   return decorator

Tip

Comment Gitar fix CI or enable auto-apply: gitar auto-apply:on

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants