Skip to content

Workflow checkpoints are not restorable across version upgrades #1611

@saikir1994

Description

@saikir1994

Workflow checkpoints are not restorable across SDK upgrades — TypeId uses Assembly.FullName (incl. version) for executor type matching

Labels: bug, workflows, checkpointing

Summary

Workflow checkpoint restore fails with:

System.IO.InvalidDataException: The specified checkpoint is not compatible with the workflow associated with this runner.

…whenever the Microsoft.Agents.AI.Workflows (or other executor/port type-owning) assembly version changes between the run that wrote the checkpoint and the run that restores it — e.g. after any package upgrade and redeploy. The workflow topology, executor IDs, and state shape are all unchanged; only the assembly version differs.

Because the agent SDK is in fast-moving preview (we've taken 1.3.0 → 1.6.1 → 1.6.2 → 1.8.0 → 1.9.0 over ~5 weeks), every upgrade silently invalidates all previously persisted checkpoints, breaking in-flight conversations/workflows that resume after a deploy.

Root cause

Checkpoint/workflow compatibility is gated by WorkflowInfo.IsMatch, which compares each executor's type via ExecutorInfoTypeId. TypeId identity uses Assembly.FullName, which embeds Version, Culture, and PublicKeyToken:

// TypeId
public TypeId(Type type)
    : this(type.Assembly.FullName, type.FullName) { }   // AssemblyName = "...Version=1.8.0.0, Culture=..., PublicKeyToken=..."

public bool IsMatch(Type type)
{
    if (AssemblyName == type.Assembly.FullName)          // <-- version-sensitive comparison
        return TypeName == type.FullName;
    return false;
}

The runner serializes these TypeIds into the checkpoint and, on restore, re-derives them from the currently loaded assemblies:

// InProcessRunner.RestoreCheckpointCoreAsync
Checkpoint checkpoint = await CheckpointManager.LookupCheckpointAsync(SessionId, checkpointInfo);
if (!CheckWorkflowMatch(checkpoint))                      // checkpoint.Workflow.IsMatch(Workflow)
{
    throw new InvalidDataException(
        "The specified checkpoint is not compatible with the workflow associated with this runner.");
}

So a checkpoint written under ...Version=1.8.0.0 can never match a runner whose executor/port types now resolve to ...Version=1.9.0.0, even though the types (namespace + name) and serialized state are identical.

Steps to reproduce

  1. Build a workflow whose executors are framework-provided (e.g. any agent bound via AsAIAgent(...).WithCheckpointing(...)), run a turn, and persist a checkpoint via an ICheckpointStore/JsonCheckpointStore.
  2. Upgrade Microsoft.Agents.AI.Workflows to any different version (patch/minor/major) — or otherwise change the assembly version.
  3. Reconstruct the same workflow and call RestoreCheckpointAsync (or resume the workflow agent) with the previously stored CheckpointInfo.

Expected: Restore succeeds, since the workflow shape and state are unchanged.
Actual: InvalidDataException: The specified checkpoint is not compatible with the workflow associated with this runner.

Impact

  • Any host that persists workflow checkpoints across process restarts/deploys (the intended durability use case) loses all existing checkpoints on every SDK bump.
  • For interactive multi-turn agents, this surfaces as a hard, unrecoverable error on the first turn after a deploy — the conversation is effectively bricked unless the app detects the string and resets.
  • The failure is opaque: it's a generic InvalidDataException with a message string, with no indication that an assembly version mismatch (vs. a genuine topology change) caused it, and no machine-readable detail about which executor/type diverged.

Suggested fixes (in rough priority order)

  1. Don't include assembly version in type identity for matching. Match on Type.FullName (namespace + type name), and optionally Assembly.GetName().Name (simple name) — not Assembly.FullName. This makes checkpoints portable across version-only changes while still distinguishing genuinely different types.
  2. Make type compatibility pluggable. Allow callers to supply an ITypeCompatibilityResolver/comparer (or a TypeId matching policy: Exact vs NameOnly vs NameAndSimpleAssembly) so hosts can opt into version-tolerant restore.
  3. Add a checkpoint compatibility/version envelope with a documented forward/backward-compatibility contract, instead of relying on Assembly.FullName equality as an implicit schema check.
  4. At minimum, fail better. Throw a typed, catchable exception (e.g. WorkflowCheckpointMismatchException) that includes the specific diff (expected vs. actual TypeId/executor id), so hosts can distinguish "incompatible SDK version" from "topology actually changed" and react deterministically rather than string-matching the message.

Environment

  • Microsoft.Agents.AI.Workflows 1.9.0 (also observed on 1.6.x/1.8.x)
  • Runtime: .NET 10
  • Checkpoint store: custom JsonCheckpointStore (Azure Blob), but the matching logic is store-agnostic
  • OS/host: Linux containers (Azure Container Apps), one process per tenant; fails specifically on the first resume after a deploy that bumps the SDK

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions