Skip to content

UN-3479 [FEAT] SDK subpackage for org-to-org data migration#15

Open
chandrasekharan-zipstack wants to merge 23 commits into
mainfrom
feat/org-migration
Open

UN-3479 [FEAT] SDK subpackage for org-to-org data migration#15
chandrasekharan-zipstack wants to merge 23 commits into
mainfrom
feat/org-migration

Conversation

@chandrasekharan-zipstack
Copy link
Copy Markdown
Contributor

@chandrasekharan-zipstack chandrasekharan-zipstack commented May 23, 2026

Summary

New unstract.migration sub-package — moves an org's configured resources (adapters, connectors, tags, prompt-studio projects, workflows, tool instances, workflow endpoints, ETL/TASK pipelines, API deployments) into another org via two admin-issued Platform API keys. Works cloud↔cloud, on-prem↔on-prem, or within the same deployment.

Tracked in UN-3479. Backend companion: Zipstack/unstract#1987.

Architecture

  • OrgEndpoint — base URL + org slug + Platform API key + configurable api_path_prefix (api/v1 default).
  • PlatformClient — entity-scoped HTTP wrapper; Bearer auth; DRF-OPTIONS-driven writable-field set via get_post_schema(entity_path).
  • MigrationContext — bundles source/target clients, options, per-run RemapTable (source UUID → target UUID).
  • Phase base + per-entity subclasses; each phase: list → per-id GET → POST (or adopt-by-name).
  • walker.remap_uuids — JSON walker that rewrites embedded UUIDs using RemapTable.resolve_any before POST.
  • MigrationReport — rich table summary + Source → Target UUID Map for audit.
  • click CLI: unstract-migrate migrate --source-url ... --source-org ... --target-url ... --target-org ... with env-var fallback for keys, plus --include / --exclude for phase selection and --dry-run.

Phase order is owned by orchestrator.PHASES — phases never call each other:

adapter → connector → tag → custom_tool → workflow → tool_instance
       → workflow_endpoint → pipeline → api_deployment

Phases

Phase Notes
Adapter Skips is_friction_less=True. Encrypted adapter_metadata_b survives via per-id GET.
Connector Same list→GET→POST pattern. Adopt-by-name idempotent.
Tag Backend now exposes POST/PATCH/DELETE; SDK creates tags on target.
CustomTool Composite — tool + profile_managers + prompts + republish to registry. Resolves profile adapters by name (UUIDs differ across orgs).
Workflow tool_metadata UUID-walk during remap.
ToolInstance ≤1 per workflow; FK-rewrites via remap.
WorkflowEndpoint Rewrites connector + workflow refs.
Pipeline ETL + TASK only. DEFAULT (legacy v1) and APP (Streamlit) skipped. Warns when source has >1 active key.
APIDeployment Adopts by api_name. Same key-warning behaviour. Sequenced after workflow_endpoint (serializer requires configured source+destination endpoints).

Constraints (from design ADRs)

  • Secrets carried verbatim through the SDK (same posture as the FE; ADR 0006).
  • No local state file (ADR 0007); idempotency = name-based GET against target.
  • No UUID preservation; embedded refs rewritten via the walker.
  • No DB / schema migrations.

Carry-forward gotchas

  1. List endpoints often serve stripped payloads (e.g. AdapterListSerializer omits adapter_metadata_b) — every phase does list → per-id GET before POST.
  2. PATH_PREFIX is env-controlled on the backend; --api-prefix flag exposed.
  3. AuditSerializer auto-adds creator to shared_users M2M — relevant for any cleanup scripting.
  4. Service-account is a real User row (is_service_account=True) + OrganizationMember.
  5. Backend auto-provisions one active API key per Pipeline/APIDeployment on POST; key UUIDs are editable=False. SDK does not preserve source key values — operator must rotate. Warning is emitted when source has >1 active key.

Test plan

  • 56 unit tests across all phases (happy / dry-run / adopt / abort / extra-keys-warning paths) + remap table + walker + HTTP client + CLI.
  • Local end-to-end smoke against docker stack (org_Q4qgjLWIbaJlfStsorg_migration_target) — full pipeline creates on first run, adopts idempotently on re-run.
  • Cloud↔cloud smoke against staging before un-drafting.
  • On-prem↔on-prem smoke before un-drafting.

Out of scope (separate work)

  • shared_to_org / shared_users propagation (resource-ownership semantics).
  • Prompt-studio uploaded document blobs (file migration).
  • API-key value preservation (would need backend design change).
  • Pipeline types DEFAULT (legacy) and APP (Streamlit).

Related

  • JIRA: UN-3479
  • Backend PR: Zipstack/unstract#1987 (feat/org-migration-platform-api-gaps)
  • KB (internal): ~/Documents/Obsidian Vault/zipstuff/org-data-migration/

chandrasekharan-zipstack and others added 11 commits May 24, 2026 01:15
…apter)

Adds `unstract.migration` — a CLI + library that drives org-to-org data
migration between Unstract deployments via the existing Platform API
surface. v1 ships the adapter phase end-to-end as the reference impl;
remaining phases (connector, tag, custom_tool, workflow, tool_instance,
workflow_endpoint) follow in subsequent commits.

Design highlights:
- Two admin Platform API keys in, populated target out. No bundle, no
  Django mgmt command, no out-of-band secret step.
- Idempotency via in-memory remap (per run) + name-based GET against
  target (across runs). No state file — target IS the state.
- Fresh target UUIDs on every create; embedded UUID references remapped
  in-memory via `walker.remap_uuids` pre-POST.
- Secrets carried verbatim from source GET to target POST (same surface
  the FE already consumes when an admin opens an adapter card).

Package layout:
- `client.py`      thin Platform API wrapper (one per OrgEndpoint)
- `context.py`     OrgEndpoint, MigrationOptions, MigrationContext, RemapTable
- `walker.py`     `remap_uuids` JSON walker
- `report.py`      rich-rendered MigrationReport + plain-text fallback
- `phases/`        base.Phase ABC + adapter.AdapterPhase (reference impl)
- `orchestrator.py` top-level `migrate()` + phase order
- `cli.py`         click CLI: `unstract-migrate migrate ...`

CLI:
- Entry point `unstract-migrate` (also `python -m unstract.migration`)
- Platform keys via flags or env (UNSTRACT_SRC_PLATFORM_KEY /
  UNSTRACT_TGT_PLATFORM_KEY) to keep them out of shell history
- `--api-prefix` overrides PATH_PREFIX (default api/v1 to match OSS
  docker compose; cloud/on-prem set as needed)
- `--dry-run`, `--include`, `--exclude`, `--on-name-conflict adopt|abort`

Audit:
- Per-line logs include source UUID + target UUID for every adopted/created
  entity (`src=... -> tgt=...`)
- Final report renders a source -> target UUID map table for traceability

Deps gated behind optional `[migration]` extra — core SDK consumers
unaffected. Tests: 13 unit tests (RemapTable, remap_uuids, AdapterPhase
happy path / idempotency / dry-run / abort). Integration smoke verified
locally against docker compose: 9 adapters migrated source -> target,
re-run reports 0 created / 9 adopted.
Per the project's code-comments guidance: comments explain WHY in
generic terms, not 'see file X in repo Y'. Path/class references rot
when files move and don't help a future reader of this package who
may not have the backend repo open.

Behavior unchanged. 13 unit tests + integration smoke still green.
Two independent leaf phases following the AdapterPhase template:
list -> per-id GET -> POST/adopt by name, record source->target remap.

- ConnectorPhase: skips Unstract Cloud Storage rows (catalog id is
  redacted on the wire — target re-provisions per-org). OAuth-backed
  connectors land without refresh tokens; operator re-authorises on
  target.
- TagPhase: simplest entity — name + description, no encryption, no
  list-vs-detail divergence.

Orchestrator PHASES order extended; 22 unit tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drops the hardcoded *_POST_FIELDS tuples and the UCS catalog-ID constant.
Backend serializer is now the single source of truth for which fields the
SDK posts:

- PlatformClient.get_post_schema(entity_path) issues OPTIONS once per
  path, caches the writable-field set (DRF SimpleMetadata already strips
  read_only fields from actions.POST).
- Each phase fetches its schema in run(); builds the POST body by
  intersecting the source GET payload with the schema.
- ConnectorPhase: replace UCS_CONNECTOR_ID hardcode with an empty-metadata
  signal — backend redacts metadata to {} for auto-provisioned rows, so a
  falsy connector_metadata on the wire is unmigratable. Future redactions
  (any catalog) are covered automatically.

Test FakeClients gain a get_post_schema() that mirrors the writable
subset; UCS test renamed to test_redacted_metadata_connector_skipped.
22/22 unit tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add phases.base.build_post_payload(src, writable) used by every phase:

- Subtracts SERVER_MANAGED from the OPTIONS-derived writable set. DRF
  exposes id/organization/created_by/modified_by/shared_users/timestamps
  as writable on ModelSerializer, but the view's perform_create overrides
  them server-side — posting them is noise (and a source-org value for
  organization/created_by would mismatch the target).
- Skips empty strings as well as None. DRF treats '' on a required field
  as blank and 400s (hit on connector_version='').

Local smoke: 8/8 connectors + 2/2 tags migrated, idempotent re-run
adopts all 10.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Migrates prompt-studio projects with their ProfileManager + ToolStudioPrompt
children, then republishes PromptStudioRegistry via the backend's export-tool
action (avoids carrying tool_metadata across orgs). Walker-remaps adapter
UUIDs into profile FKs and across embedded JSON fields. Fresh-tool path
deletes the backend's auto-default profile before recreating from source.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Migrates Workflow rows with walker-remapped connector UUIDs in
source_settings + destination_settings JSON blobs. WorkflowEndpoints
are auto-created by the backend on workflow POST; reconciled later by
the dedicated WorkflowEndpoint phase.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the workflow execution loop on target. CustomToolPhase now records
a prompt_studio_registry remap (src_registry_id -> tgt_registry_id) by
looking up both sides via the newly-filterable registry list endpoint.

ToolInstancePhase walks the workflow remap, creates one bare instance per
target workflow (POST overrides metadata server-side), then PATCHes
metadata so source's adapter selections survive. Source metadata stores
adapters as names, which match across orgs since AdapterPhase preserves
them; the backend resolves names to local UUIDs on PATCH.

WorkflowEndpointPhase PATCHes the SOURCE/DESTINATION endpoints that the
backend auto-created on workflow POST, pairing by endpoint_type and
remapping connector FK + walker-rewriting embedded UUIDs in configuration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop the field-by-field reconcile loop (create_custom_tool +
delete_profile + create_profile + set_default_profile + create_prompt)
in favour of the backend's purpose-built endpoints:

- GET prompt-studio/project-transfer/{id} bundles tool_metadata,
  tool_settings, default_profile_settings, prompts in one shot.
- POST prompt-studio/project-transfer/ creates the tool, default
  profile (wired with target-org adapter ids the SDK supplies), and
  prompts server-side in one call.
- POST prompt-studio/{id}/sync-prompts/ rip-and-replaces prompts on
  an existing target tool for the adopt path.

Adapter ids for the import are resolved from the source's default
ProfileManager via the adapter remap table; missing remap fails the
tool cleanly instead of landing a half-wired profile.

Removes hardcoded PROFILE_WRITABLE / PROMPT_WRITABLE frozensets and
the OPTIONS schema fetch for prompt-studio — the project-transfer
endpoint owns the field shape server-side.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Live ProfileManagerSerializer renders adapter FKs as flat NAME strings
(e.g. "gpt4-o"), not UUIDs. The first project-transfer smoke flagged
"no adapter remap for gpt4-o" for every tool — the phase was looking
up a name in the UUID-keyed adapter remap.

Switch resolution to target-side lookup: read the adapter NAME from
the source default profile and ask target's list_adapters(name=...)
for the target UUID. AdapterPhase preserves names across orgs, so the
lookup hits whenever adapters migrated cleanly.

Smoke-test result against local stack: 5/6 source tools created on
fresh run, all 5 adopted on re-run (sync_prompts), 1 source tool
failed cleanly because it has no profile (test data, not a code bug).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Closes the last gap in the org-to-org data migration: ETL/TASK
pipelines and API deployments.

Both phases mirror the WorkflowPhase shape: get-or-create by name on
target, FK rewrites via the workflow remap table, dry-run + abort
support, idempotent re-run. Phase order extended to:

  ... workflow_endpoint -> pipeline -> api_deployment

`api_deployment` requires endpoints to be configured before the
serializer accepts it, so it must run after WorkflowEndpointPhase.

PipelinePhase scope: ETL + TASK only. DEFAULT is dead v1 code; APP is
a Streamlit-style deployment that doesn't fit the pipeline model.

API key handling: backend auto-provisions one active key per
pipeline/deployment on POST. Extra rotated source keys are NOT
mirrored — UUIDs are server-generated (not settable) and operators
should rotate post-migration anyway. Both phases log a WARNING when
the source had more than one active key so the operator notices.

Client surface added:
- list_pipelines, get_pipeline, create_pipeline, update_pipeline
- list_api_deployments, get_api_deployment, create_api_deployment,
  update_api_deployment
- list_pipeline_keys, list_api_deployment_keys, create_api_key

13 new unit tests; full suite green (43 -> 56).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@chandrasekharan-zipstack chandrasekharan-zipstack changed the title feat(migration): SDK subpackage for org-to-org data migration feat(migration): SDK subpackage for org-to-org data migration [UN-3479] May 24, 2026
@chandrasekharan-zipstack chandrasekharan-zipstack changed the title feat(migration): SDK subpackage for org-to-org data migration [UN-3479] UN-3479 [FEAT] SDK subpackage for org-to-org data migration May 24, 2026
@chandrasekharan-zipstack chandrasekharan-zipstack marked this pull request as ready for review May 24, 2026 07:00
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented May 24, 2026

Greptile Summary

This PR introduces the unstract.clone sub-package, a CLI and SDK for migrating configured org resources (adapters, connectors, tags, prompt-studio projects, workflows, tool instances, endpoints, pipelines, API deployments) between Unstract orgs via Platform API keys. It's a well-structured, phased migration tool with a solid test suite (56 unit tests) and clear idempotency semantics.

  • Architecture: OrgEndpoint + PlatformClient + CloneContext/RemapTable + per-entity Phase subclasses, orchestrated in strict topological order; remap_uuids walker rewrites embedded UUIDs before each POST.
  • Correctness improvements already landed (in prior commits on this branch): build_post_payload preserves False/0, pipeline/api_deployment phases now do per-id GET before POST, workflow_endpoint skips rather than silently detaches on unresolvable connectors, dry-run adopt path no longer fires export_custom_tool, list_custom_tools hoisted above loop in CustomToolPhase.
  • One gap remains: WorkflowPhase builds its POST payload directly from the list-endpoint summary — the only create-path phase that skips the get_workflow(src_id) detail fetch — making it susceptible to the same field-stripping problem that was already fixed for pipelines and API deployments.

Confidence Score: 4/5

Safe to merge after confirming whether the workflow list serializer includes source_settings/destination_settings; if it does, the current code is correct; if not, workflows would be created with stripped connector settings.

The package is well-designed with a thorough test suite and most previously identified correctness gaps already addressed. The one remaining concern is that WorkflowPhase builds its POST payload from the list-endpoint summary without calling get_workflow(src_id) first — the only create-path phase in the codebase that follows this pattern. Given the PR's own guidance that list serializers strip fields, and that the two flagged fields (source_settings, destination_settings) embed connector UUIDs that also need UUID-walking, a silent miss here would result in workflows created without their connector configuration.

src/unstract/clone/phases/workflow.py — confirm the backend workflow list serializer includes all fields needed for POST (especially source_settings and destination_settings) or add a per-id GET call.

Important Files Changed

Filename Overview
src/unstract/clone/phases/workflow.py Migrate workflows; uses list-response payload directly without a per-id GET, which is inconsistent with the stated pattern and could silently omit fields stripped by the list serializer.
src/unstract/clone/phases/files.py Prompt-Studio document file migration with retry logic; calls list_custom_tools() once per tool inside the main loop for a cosmetic log-only lookup, causing N redundant API requests.
src/unstract/clone/phases/custom_tool.py Prompt-Studio project migration; dry-run adopt path fixed (no longer fires export_custom_tool), target tool list hoisted above loop, adapter fallback UUID dropped.
src/unstract/clone/phases/pipeline.py ETL/TASK pipeline migration; now does per-id GET before POST (fixing earlier list-serializer gap), correctly skips DEFAULT/APP types and warns on extra source API keys.
src/unstract/clone/phases/api_deployment.py API deployment migration; per-id GET added before POST, adopt-by-name idempotency, extra-key warning present.
src/unstract/clone/phases/workflow_endpoint.py Workflow endpoint PATCH migration; fixed to skip instead of silently detach when source connector has no remap, null connection_type no longer coerced to empty string.
src/unstract/clone/phases/base.py Base phase helpers; build_post_payload correctly preserves False and 0 after the prior fix, server-managed fields stripped universally.
src/unstract/clone/client.py Thin HTTP client scoped per org; OPTIONS-driven schema caching, entity-scoped methods, Bearer auth, clean context-manager lifecycle.
src/unstract/clone/walker.py JSON UUID walker; cleanly rewrites UUID-shaped strings via RemapTable, unknown UUIDs pass through, dict keys not walked (correct for JSON field names).
src/unstract/clone/phases/tool_instance.py ToolInstance migration; properly detects broken adapter sentinels, strips source identity fields, two-step POST+PATCH pattern correctly mirrors backend behavior.
src/unstract/clone/orchestrator.py Top-level clone() entry point; strict topological phase order, client lifecycle managed in finally block, CloneError aborts cleanly.
src/unstract/clone/context.py Shared state dataclasses; RemapTable correctly implements resolve_any for the UUID walker, OrgEndpoint is frozen, CloneContext bundles clients and remap.

Sequence Diagram

sequenceDiagram
    participant CLI
    participant Orchestrator
    participant Phase
    participant SrcClient as Source PlatformClient
    participant TgtClient as Target PlatformClient
    participant RemapTable

    CLI->>Orchestrator: clone(src_endpoint, tgt_endpoint, options)
    Orchestrator->>SrcClient: create (Bearer key)
    Orchestrator->>TgtClient: create (Bearer key)

    loop For each Phase in topological order
        Orchestrator->>Phase: run(report)
        Phase->>TgtClient: OPTIONS entity/ → get_post_schema()
        Phase->>SrcClient: list_entities()
        loop For each source entity
            Phase->>SrcClient: get_entity(id) [detail fetch]
            Phase->>TgtClient: "list_entities(name=...) [adopt check]"
            alt Entity exists on target
                Phase->>RemapTable: record(src_id, tgt_id)
            else dry_run
                Note over Phase: skip, no write
            else create
                Phase->>RemapTable: remap_uuids(payload)
                Phase->>TgtClient: create_entity(payload)
                Phase->>RemapTable: record(src_id, tgt_id)
            end
        end
    end
    Orchestrator->>CLI: CloneReport (counts + remap snapshot)
Loading

Fix All in Claude Code

Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
src/unstract/clone/phases/workflow.py:81-83
**Workflow payload built from list response, not per-id GET**

`WorkflowPhase._clone_one` applies `remap_uuids` and `build_post_payload` directly to the list-endpoint summary (`src`), skipping the `get_workflow(src_id)` call that every other phase in this package performs for the same reason. The PR's own "carry-forward gotcha #1" states: _"List endpoints often serve stripped payloads … every phase does list → per-id GET before POST."_ If the backend's workflow list serializer strips `source_settings` or `destination_settings` (the two fields called out as non-trivial in this module's own docstring), those blobs — and any connector UUIDs inside them that need UUID-walking — would be silently absent from the POST, creating a misconfigured workflow on the target. `PipelinePhase` and `APIDeploymentPhase` had identical gaps that were fixed in 9391ce3; `WorkflowPhase` appears to have been missed.

### Issue 2 of 2
src/unstract/clone/phases/files.py:70-71
`_lookup_tool_name` fires a full `list_custom_tools()` API call on every loop iteration — one call per tool in `tool_remap`. For N tools this means N round-trips that each fetch the entire tool list, just to produce a name for log messages. The same antipattern was identified and fixed in `CustomToolPhase` (hoisted to before the loop). Caching the list once before the loop removes the redundant requests.

```suggestion
        try:
            _all_tgt_tools = self.ctx.target.list_custom_tools()
        except Exception:
            _all_tgt_tools = []
        _tgt_name_by_id = {t["tool_id"]: t["tool_name"] for t in _all_tgt_tools}

        for src_tool_id, tgt_tool_id in tool_remap.items():
            tool_name = _tgt_name_by_id.get(tgt_tool_id) or src_tool_id
```

Reviews (11): Last reviewed commit: "fix(clone): address PR #15 review feedba..." | Re-trigger Greptile

Comment thread src/unstract/migration/phases/base.py Outdated
Comment thread src/unstract/clone/phases/workflow_endpoint.py
Comment thread src/unstract/clone/phases/custom_tool.py Outdated
Comment thread src/unstract/migration/phases/custom_tool.py Outdated
chandrasekharan-zipstack and others added 3 commits May 24, 2026 13:18
Adds a `files` phase that moves Prompt Studio document files between orgs
using the existing Platform API endpoints — no new BE surface.

- Default mode (`--file-strategy=platform_api`): lists target DM rows
  once per tool for idempotency, downloads each missing source file via
  `fetch_contents_ide`, decodes per mime, and POSTs as multipart through
  `upload_for_ide`.
- Skip mode (`--skip-files`): metadata only; source filenames go into
  `MigrationReport.skipped_files` for operator-driven UI re-upload.
- Mixed-mode reporting: files above `--max-file-size` (default 25MB)
  land in `oversize_files`; mime types the BE endpoint can't round-trip
  losslessly (Excel placeholder, etc.) land in `unsupported_files`.
  Sibling files always continue — phase never aborts on file-level
  issues. Transport-level failures land in `failed_files`.
- Idempotency-only retries on 5xx + transient connection errors.
- Wired after `custom_tool` in the orchestrator (consumes its remap).

Report grows four new typed lists (`uploaded_files`, `skipped_files`,
`oversize_files`, `unsupported_files`, `failed_files`) plus
end-of-phase rendering in both rich and plain modes.

12 new unit tests cover happy path, idempotency skip, oversize,
unsupported mime, skip strategy, dry-run, retry on 5xx, missing
custom_tool remap, per-tool source-list failure isolation, upload
failure capture, and parametrised text/csv + text/plain round-trips.
…nce NOT FOUND

Multiple fixes uncovered by the first local-stack run:

client.py
- list_prompt_documents: mount under prompt-studio/ prefix
  (BE include in urls_v2.py).
- download_prompt_file: use ?document_id=, matching
  fetch_contents_ide serializer (was ?file_name=, BE ignored it
  and returned 400 ValidationError).
- upload_prompt_file: drop trailing slash, BE pattern is
  prompt-studio/file/<uuid:pk> with no slash so POST 404'd.
- add get_custom_tool / update_custom_tool for default-doc PATCH.

phases/files.py
- After upload loop per tool, mirror source's CustomTool.output
  by filename so FE auto-selects on load. Fall back to first
  target doc. Preserve any existing target output (operator
  may have already picked manually on a re-run).

phases/tool_instance.py
- Detect source serializer sentinels ([X NOT FOUND],
  [DELETED ADAPTER ...], [NEEDS UPDATE]) in stored metadata and
  skip the PATCH instead of round-tripping a broken adapter
  reference. ToolInstance row exists with backend defaults;
  operator re-binds in UI.

report.py
- Drop the full source->target UUID map from rendered output
  (noisy on large migrations). Print per-entity counts only;
  full map still in as_dict() and at DEBUG log level.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
base.build_post_payload:
  Previous `value not in (None, "")` dropped booleans False and numeric
  0 along with None/"" — DRF BooleanField False and numeric defaults
  were silently stripped from POST payloads. Switch to explicit
  identity + equality checks. New test_base_helpers.py guards this.

workflow_endpoint._patch_endpoint:
  When source endpoint had a connector but its remap entry is missing
  (e.g. connector phase skipped a row), we previously PATCHed the
  target endpoint with connector_instance_id=None — silently detaching
  it. Now skip the PATCH, increment result.skipped, and append an
  error entry so the operator sees the broken link in the report.
  Existing test rewritten to assert the new skip-and-flag behaviour.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Comment thread src/unstract/clone/phases/custom_tool.py
list_custom_tools() was called once per source tool — N target-API
round-trips for the same invariant data. Fetch once before the loop
and append locally on create so adoption lookups stay correct on
re-runs.

Also drop the value.get("id") fallback in _extract_adapter_name —
returning a UUID where the caller expects a name made list_adapters
silently miss with a confusing warning.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- src/unstract/migration -> src/unstract/clone, tests/migration -> tests/clone
- MigrationReport/Context/Options/Error -> Clone*
- CLI subcommand 'migrate' -> 'clone', script 'unstract-migrate' -> 'unstract-clone'
- pyproject extra 'migration' -> 'clone'
- README + docstrings + log lines updated; report title 'Migration Report' -> 'Clone Report'

All 75 unit tests pass.
Copy link
Copy Markdown
Contributor

@jaseemjaskp jaseemjaskp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated multi-agent review of the clone subpackage (Code Reviewer, Silent-Failure Hunter, Type-Design, Test Analyzer, Comment Analyzer, Code Simplifier). Findings below are grouped by file; severity is tagged inline (P1 highest). Findings already covered by existing review comments (e.g. the greptile dry-run write on custom_tool.py:180, the _extract_adapter_name / list_custom_tools / base.build_post_payload / workflow_endpoint items) are intentionally omitted as fixed/duplicate. Tests pass (75 green).

src_ti_id, tgt_ti["id"], tgt_wf_id,
)

# PATCH the metadata regardless of created/adopted — keeps tool config
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 — dry-run performs a real PATCH on the adopt path. The dry_run guard is only the elif at line 132, which is reachable just when not existing. When the target workflow already has a tool instance (the adopt path, if existing: at line 125), dry_run=True still falls through to update_tool_instance_metadata at line 175 — a real write during a dry run. The existing dry-run test seeds an empty target so it never exercises this. This is the same bug class as the greptile finding on custom_tool.py:180.

Fix: short-circuit before the metadata PATCH when dry-run, e.g. once existing is determined add if self.ctx.options.dry_run: result.skipped += 1; return (and don't bump adopted), or guard the PATCH block (lines 158-184) with if not self.ctx.options.dry_run:.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 9391ce3. Adopt branch now checks dry_run before falling through to the PATCH. Records the remap (so downstream dry-run output stays coherent) and returns. New test_dry_run_on_adopt_path_does_not_repatch_target exercises this.

for doc in src_docs:
file_name = doc.get("document_name")
src_document_id = doc.get("document_id")
if not file_name or not src_document_id:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 — malformed source DM rows are silently dropped. If a row lacks document_name/document_id (renamed field, partial serializer response), the bare continue skips it with no counter, no errors entry, and no warning — indistinguishable from "nothing to do". The operator sees a green run while a document never migrated.

Fix: increment result.failed (or at least result.skipped) and append to result.errors/report.failed_files, e.g. result.errors.append(f"src DM row for tool {tool_name} missing document_id/name: keys={list(doc)}") before continuing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 9391ce3. Malformed source DM rows now bump result.skipped and append an error entry (and warn-level log). New test_malformed_source_dm_row_bumps_skipped_with_error.

"files: unsupported mime tool=%s file=%s mime=%s",
tool_name, file_name, mime,
)
report.unsupported_files.append(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 — unsupported-mime and oversize files are recorded but never counted, so the run still reports success. Both this unsupported_files.append(...) path (line 189) and the oversize_files.append(...) path (line 200) return without touching result.failed/result.skipped. The report's headline verdict (report.py:108-114) is driven solely by totals["failed"], so a run that left every Excel/oversize test doc behind still prints Completed successfully in green — contradicting the unsupported_files/oversize_files sections it also renders.

Fix: result.skipped += 1 on both paths, and/or have render() factor non-empty unsupported_files/oversize_files/failed_files into the final verdict line.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 9391ce3. Both unsupported_files and oversize_files paths now bump result.skipped so the phase counters surface them. Existing oversize/unsupported tests extended with explicit result.skipped == 1 assertions. Note: I chose skipped over failed because these are known limitations the operator chose (cap setting, mime support), not transport failures — the per-file report list still carries the details.

src_regs = self.ctx.source.list_registries(custom_tool=src_tool_id)
tgt_regs = self.ctx.target.list_registries(custom_tool=tgt_tool_id)
except Exception as e:
logger.warning(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 — registry-remap lookup failure is swallowed at WARNING with no counter. A failed list_registries on source/target logs a warning and returns, leaving the prompt_studio_registry remap unrecorded. Downstream ToolInstancePhase then hits if not tgt_tool_id (tool_instance.py:108) and silently skips the tool instance — the workflow lands with no tool wired, and both phases report zero failures. A genuine API/network error is thereby indistinguishable from the legitimate "tool unpublished on source" case.

Fix: append to result.errors (and consider result.failed += 1) here so the lookup failure is visible in the structured report, distinct from the unpublished-tool path.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 9391ce3. The exception branch now bumps result.failed and appends registry remap lookup {tool_name}: {e} so the operator sees the partial-success in the phase counters.

if isinstance(obj, list):
return [remap_uuids(v, remap) for v in obj]
if isinstance(obj, str) and UUID_RE.match(obj):
return remap.resolve_any(obj) or obj
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 — resolve_any can silently rewrite a UUID to an unrelated target id. remap_uuids resolves any UUID-shaped string against all entity maps and returns the first hit (context.py:78-83). A UUID embedded in a JSON blob (tool_instance metadata, workflow settings, endpoint config) that is not actually a foreign key — or that collides across entity maps — gets rewritten to an unrelated target UUID, producing a structurally valid payload pointing at the wrong row, with no log at all.

Fix: at minimum logger.debug each rewrite (which entity matched, src→tgt). Better: have known-FK fields pass an expected entity type, and have resolve_any warn when a UUID matches in more than one entity map.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged but deferring. UUID collisions across entity tables are astronomically unlikely in practice (UUIDv4 birthday bound), and the walker's contract is documented as 'first hit wins'. The right fix is to make the walker entity-aware at each call site rather than scoping resolve_any — that's a broader refactor I'd like to do in a follow-up rather than as part of this review. Tracking as future work.

]


def clone(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test gap (P1) — clone() is never exercised end-to-end. Untested: phase ordering, opts.includes() include/exclude → skipped_phases, the CloneError → abort/break path (does an abort actually halt subsequent phases?), and remap_snapshot population on the returned report. This is the integration seam the whole package hangs on.

Suggest a test that monkeypatches PHASES with two trivial fake phases (one raising NameConflictError) and asserts the second never runs, report.aborted is True, abort_reason is set, and --exclude lands a name in skipped_phases.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 9391ce3. New tests/clone/test_orchestrator.py (6 tests) covers: phase ordering, include/exclude → skipped_phases, CloneError abort + remaining phases skipped, non-CloneError propagation, client close on both paths, remap snapshot on the report.

abort_reason: str | None = None
# Files-phase artifacts. Each entry carries enough context for an
# operator to act on it without cross-referencing the run log.
uploaded_files: list[dict[str, Any]] = field(default_factory=list)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type design (P2) — the five file-row buckets are loose list[dict[str, Any]] with divergent, implicit schemas. Each is built ad hoc in files.py with a different key set (uploaded_files has size_bytes+mime_type; oversize_files adds cap_bytes; failed_files has error), and the reader _describe_file_row defensively .get()s keys that may or may not exist. A typo in a producer key surfaces as "?" in the report rather than failing anywhere.

Suggest a small frozen dataclass FileOutcome(tool_id, file_name, tool_name=None, size_bytes=None, mime_type=None, error=None) for these fields; as_dict() becomes asdict(x) and the defensive .get()s in the renderer go away.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. Dataclasses for the five buckets would tighten the contract, but the report layer is read-only (built once, rendered once) and the loose dict shape isn't currently a foot-gun anywhere. Tracking as a follow-up if/when a consumer outside the renderer starts depending on these structures.

dry_run: bool = False
include: tuple[str, ...] | None = None
exclude: tuple[str, ...] = ()
on_name_conflict: str = "adopt" # "adopt" | "abort"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type design (P3) — closed option sets use magic strings. on_name_conflict ("adopt"/"abort") and file_strategy ("platform_api"/"skip") are compared by string literal far away (adapter.py:74 == "abort", files.py:84 == "skip"), so a typo on either side is undetectable until runtime and the CLI must re-validate by hand.

Suggest class NameConflict(str, Enum) / class FileStrategy(str, Enum)str-based keeps CLI/JSON ergonomics while making illegal values unconstructable and the comparison sites self-documenting.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Click's type=Choice(...) already validates at the CLI boundary and the option-set is small (4 strings across 2 fields). A StrEnum would tighten the in-process checks but the readability win is marginal at this scale. Deferring.

}


class Phase(ABC):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplification (P2) — six phases duplicate the same run() preamble and adopt/create skeleton. adapter, connector, tag, workflow, pipeline, api_deployment each repeat: try get_post_schema(PATH) → on error append OPTIONS <name> and return; try list_<x>() → on error append list source <x> and return; loop. The adopt/dry-run/create branch (with NameConflictError raise + remap.record) is likewise near-identical, differing only in the create call and entity label.

Suggest hoisting _fetch_writable(path, result, label) and _adopt_or_create(...) helpers onto this base class — removes ~15 duplicated lines per module across six phases while preserving the (self.name-derived) error strings. Lower priority: _warn_if_extra_source_keys is byte-identical in pipeline.py and api_deployment.py and could also move here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged. The per-phase divergence (CustomTool composite, ToolInstance two-step POST+PATCH, Files DM-row enumeration, WorkflowEndpoint PATCH-not-POST) makes a single run() template fragile — three of the ten phases would need template hooks. I'd rather wait until the divergent phases stabilize before extracting. Tracking for a future cleanup.

and rejects on required fields).
"""
keys = writable - SERVER_MANAGED
# Equality with `(None, "")` matched False and 0 too (Python: False == 0,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment accuracy (P3) — this explanatory comment is self-contradicting and hard to parse. "0 in (None, "") is False, but 0 not in (...) falsely returns True" — those two statements are consistent, not contradictory, so the comment muddles the reasoning behind the (correct) code below it.

Suggest: "Use is not None and != '' (not membership in (None, '')) so meaningful falsy values like False and 0 survive — membership tests against (None, '') are easy to get subtly wrong."

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 2b80d2a0 — the test now covers 9 behavioral scenarios: confirm-path (resolve via searchPersons → flatten → POST bulk-assign with the correct persona ids), already-tagged-filter (drops personas carrying the target tag), already-assigned-set filter, all-pre-tagged short-circuit (no POST + friendly toast + onComplete not called), partial-failure pass-through to onComplete, bulk-assign 500 error path, and search 500 error path. Uses the same vi.mock approach the SearchBuilderDrawer test uses for PersonSearchPicker — mocks the drawer to expose onConfirm deterministically without coupling to its internal filter/review wiring.

Comment on lines +139 to +143
payload: dict[str, Any] = {
"connection_type": src_ep.get("connection_type") or "",
"configuration": remap_uuids(src_ep.get("configuration") or {}, self.ctx.remap),
"connector_instance_id": tgt_conn_id,
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Using or "" coerces a None connection_type to an empty string. If the backend's connection_type is a DRF ChoiceField (typical values are "filesystem", "api", etc.), sending "" will be rejected with a 400 error because allow_blank is False by default. Pass None when the field is absent so the target can apply its own default, or omit the key entirely from the PATCH.

Suggested change
payload: dict[str, Any] = {
"connection_type": src_ep.get("connection_type") or "",
"configuration": remap_uuids(src_ep.get("configuration") or {}, self.ctx.remap),
"connector_instance_id": tgt_conn_id,
}
payload: dict[str, Any] = {
"configuration": remap_uuids(src_ep.get("configuration") or {}, self.ctx.remap),
"connector_instance_id": tgt_conn_id,
}
if src_ep.get("connection_type"):
payload["connection_type"] = src_ep["connection_type"]
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/unstract/clone/phases/workflow_endpoint.py
Line: 139-143

Comment:
Using `or ""` coerces a `None` `connection_type` to an empty string. If the backend's `connection_type` is a DRF `ChoiceField` (typical values are `"filesystem"`, `"api"`, etc.), sending `""` will be rejected with a 400 error because `allow_blank` is `False` by default. Pass `None` when the field is absent so the target can apply its own default, or omit the key entirely from the PATCH.

```suggestion
        payload: dict[str, Any] = {
            "configuration": remap_uuids(src_ep.get("configuration") or {}, self.ctx.remap),
            "connector_instance_id": tgt_conn_id,
        }
        if src_ep.get("connection_type"):
            payload["connection_type"] = src_ep["connection_type"]
```

How can I resolve this? If you propose a fix, please make it concise.

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 9391ce3. Dropped the or "" coercion; connection_type is now only added to the PATCH payload when source has a non-None value. New test_endpoint_with_null_connection_type_omits_key_in_payload locks the regression.

logger.exception("Failed to list target tools: %s", e)
result.failed += 1
result.errors.append(f"list target tools: {e}")
return
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Bare return breaks the PhaseResult contract

Line 83 returns None from a method declared -> PhaseResult. Every other early-exit in this file (e.g. the matching return result on line 70) and in every other run() method throughout the codebase returns result. If the target list call fails, the orchestrator gets None back — and any caller that relies on the return value would receive the wrong type. The fix is return result.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/unstract/clone/phases/custom_tool.py
Line: 83

Comment:
**Bare `return` breaks the `PhaseResult` contract**

Line 83 returns `None` from a method declared `-> PhaseResult`. Every other early-exit in this file (e.g. the matching `return result` on line 70) and in every other `run()` method throughout the codebase returns `result`. If the target list call fails, the orchestrator gets `None` back — and any caller that relies on the return value would receive the wrong type. The fix is `return result`.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same root cause as #3296606829 — fixed in 9391ce3 by returning result from the phase-init error branch.

Comment thread src/unstract/clone/phases/pipeline.py Outdated
Comment on lines +109 to +111
remapped = remap_uuids(src, self.ctx.remap)
payload = build_post_payload(remapped, self._writable)
payload["workflow"] = tgt_wf_id
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Pipeline and API-deployment phases skip the per-id GET, risking incomplete payloads

The PR's own "carry-forward gotcha #1" documents that list endpoints serve stripped payloads and that every phase should follow the list → per-id GET → POST pattern. AdapterPhase and ConnectorPhase both call get_adapter/get_connector before building the POST payload; PipelinePhase and APIDeploymentPhase build the payload directly from the list summary using remap_uuids(src, ...) / build_post_payload(remapped, ...). Fields absent from the list serializer (e.g. cron_string for TASK pipelines, or detailed scheduler config) would be silently omitted from the POST, creating a misconfigured entity on the target. The client already exposes get_pipeline and get_api_deployment — they should be called before constructing the payload, mirroring the adapter/connector pattern.

The same issue applies at api_deployment.py line 106–108.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/unstract/clone/phases/pipeline.py
Line: 109-111

Comment:
**Pipeline and API-deployment phases skip the per-id GET, risking incomplete payloads**

The PR's own "carry-forward gotcha #1" documents that list endpoints serve stripped payloads and that every phase should follow the `list → per-id GET → POST` pattern. `AdapterPhase` and `ConnectorPhase` both call `get_adapter`/`get_connector` before building the POST payload; `PipelinePhase` and `APIDeploymentPhase` build the payload directly from the list summary using `remap_uuids(src, ...)` / `build_post_payload(remapped, ...)`. Fields absent from the list serializer (e.g. `cron_string` for TASK pipelines, or detailed scheduler config) would be silently omitted from the POST, creating a misconfigured entity on the target. The client already exposes `get_pipeline` and `get_api_deployment` — they should be called before constructing the payload, mirroring the adapter/connector pattern.

The same issue applies at `api_deployment.py` line 106–108.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 9391ce3. Both PipelinePhase and APIDeploymentPhase now do get_pipeline/get_api_deployment per source id before remapping + posting, so list-only serializer omissions can't strip required fields. New test_create_uses_per_id_get_not_stripped_list_payload in the pipeline tests proves the path.

P1 fixes:
- tool_instance: dry-run no longer PATCHes adopted targets.
- custom_tool: dry-run no longer republishes the registry on adopt.
- custom_tool: registry remap-lookup failure now counts as failed.
- custom_tool: phase-init failure returns `result`, not `None`.
- files: malformed DM rows + unsupported-mime + oversize bump skipped
  with an error entry so the run no longer reports green when items
  needed manual attention.
- files: tighten `_lookup_tool_name` exceptions to PlatformAPIError
  and transport errors.
- workflow_endpoint: drop `connection_type or ""` coercion that
  could turn None into a blank DRF would reject; omit the key.
- pipeline/api_deployment: do per-id GET before POST so list-only
  serializer fields don't get stripped from the create payload.

P2/P3 quick wins:
- PlatformClient: add `close()` + context manager; orchestrator
  closes both clients in a finally block.
- pipeline/api_deployment: promote DEBUG to WARNING when the
  source key-list call fails (operator-facing).
- cli: distinguish `--max-file-size 0` from unparseable; preserve 0.
- files: drop the dead `docs/internal/...` reference.

Tests:
- New test_client.py + test_orchestrator.py + test_cli.py cover
  HTTP-layer + orchestrator paths that previously had no coverage.
- Regression tests added to existing phase tests for each P1 fix.
- 105 tests pass (was 75).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants