UN-3479 [FEAT] SDK subpackage for org-to-org data migration by chandrasekharan-zipstack · Pull Request #15 · Zipstack/unstract-python-client

chandrasekharan-zipstack · 2026-05-23T20:15:42Z

Summary

New unstract.migration sub-package — moves an org's configured resources (adapters, connectors, tags, prompt-studio projects, workflows, tool instances, workflow endpoints, ETL/TASK pipelines, API deployments) into another org via two admin-issued Platform API keys. Works cloud↔cloud, on-prem↔on-prem, or within the same deployment.

Tracked in UN-3479. Backend companion: Zipstack/unstract#1987.

Architecture

OrgEndpoint — base URL + org slug + Platform API key + configurable api_path_prefix (api/v1 default).
PlatformClient — entity-scoped HTTP wrapper; Bearer auth; DRF-OPTIONS-driven writable-field set via get_post_schema(entity_path).
MigrationContext — bundles source/target clients, options, per-run RemapTable (source UUID → target UUID).
Phase base + per-entity subclasses; each phase: list → per-id GET → POST (or adopt-by-name).
walker.remap_uuids — JSON walker that rewrites embedded UUIDs using RemapTable.resolve_any before POST.
MigrationReport — rich table summary + Source → Target UUID Map for audit.
click CLI: unstract-migrate migrate --source-url ... --source-org ... --target-url ... --target-org ... with env-var fallback for keys, plus --include / --exclude for phase selection and --dry-run.

Phase order is owned by orchestrator.PHASES — phases never call each other:

adapter → connector → tag → custom_tool → workflow → tool_instance
       → workflow_endpoint → pipeline → api_deployment

Phases

Phase	Notes
Adapter	Skips `is_friction_less=True`. Encrypted `adapter_metadata_b` survives via per-id GET.
Connector	Same list→GET→POST pattern. Adopt-by-name idempotent.
Tag	Backend now exposes POST/PATCH/DELETE; SDK creates tags on target.
CustomTool	Composite — tool + profile_managers + prompts + republish to registry. Resolves profile adapters by name (UUIDs differ across orgs).
Workflow	`tool_metadata` UUID-walk during remap.
ToolInstance	≤1 per workflow; FK-rewrites via remap.
WorkflowEndpoint	Rewrites connector + workflow refs.
Pipeline	ETL + TASK only. `DEFAULT` (legacy v1) and `APP` (Streamlit) skipped. Warns when source has >1 active key.
APIDeployment	Adopts by `api_name`. Same key-warning behaviour. Sequenced after `workflow_endpoint` (serializer requires configured source+destination endpoints).

Constraints (from design ADRs)

Secrets carried verbatim through the SDK (same posture as the FE; ADR 0006).
No local state file (ADR 0007); idempotency = name-based GET against target.
No UUID preservation; embedded refs rewritten via the walker.
No DB / schema migrations.

Carry-forward gotchas

List endpoints often serve stripped payloads (e.g. AdapterListSerializer omits adapter_metadata_b) — every phase does list → per-id GET before POST.
PATH_PREFIX is env-controlled on the backend; --api-prefix flag exposed.
AuditSerializer auto-adds creator to shared_users M2M — relevant for any cleanup scripting.
Service-account is a real User row (is_service_account=True) + OrganizationMember.
Backend auto-provisions one active API key per Pipeline/APIDeployment on POST; key UUIDs are editable=False. SDK does not preserve source key values — operator must rotate. Warning is emitted when source has >1 active key.

Test plan

56 unit tests across all phases (happy / dry-run / adopt / abort / extra-keys-warning paths) + remap table + walker + HTTP client + CLI.
Local end-to-end smoke against docker stack (org_Q4qgjLWIbaJlfSts → org_migration_target) — full pipeline creates on first run, adopts idempotently on re-run.
Cloud↔cloud smoke against staging before un-drafting.
On-prem↔on-prem smoke before un-drafting.

Out of scope (separate work)

shared_to_org / shared_users propagation (resource-ownership semantics).
Prompt-studio uploaded document blobs (file migration).
API-key value preservation (would need backend design change).
Pipeline types DEFAULT (legacy) and APP (Streamlit).

…apter) Adds `unstract.migration` — a CLI + library that drives org-to-org data migration between Unstract deployments via the existing Platform API surface. v1 ships the adapter phase end-to-end as the reference impl; remaining phases (connector, tag, custom_tool, workflow, tool_instance, workflow_endpoint) follow in subsequent commits. Design highlights: - Two admin Platform API keys in, populated target out. No bundle, no Django mgmt command, no out-of-band secret step. - Idempotency via in-memory remap (per run) + name-based GET against target (across runs). No state file — target IS the state. - Fresh target UUIDs on every create; embedded UUID references remapped in-memory via `walker.remap_uuids` pre-POST. - Secrets carried verbatim from source GET to target POST (same surface the FE already consumes when an admin opens an adapter card). Package layout: - `client.py` thin Platform API wrapper (one per OrgEndpoint) - `context.py` OrgEndpoint, MigrationOptions, MigrationContext, RemapTable - `walker.py` `remap_uuids` JSON walker - `report.py` rich-rendered MigrationReport + plain-text fallback - `phases/` base.Phase ABC + adapter.AdapterPhase (reference impl) - `orchestrator.py` top-level `migrate()` + phase order - `cli.py` click CLI: `unstract-migrate migrate ...` CLI: - Entry point `unstract-migrate` (also `python -m unstract.migration`) - Platform keys via flags or env (UNSTRACT_SRC_PLATFORM_KEY / UNSTRACT_TGT_PLATFORM_KEY) to keep them out of shell history - `--api-prefix` overrides PATH_PREFIX (default api/v1 to match OSS docker compose; cloud/on-prem set as needed) - `--dry-run`, `--include`, `--exclude`, `--on-name-conflict adopt|abort` Audit: - Per-line logs include source UUID + target UUID for every adopted/created entity (`src=... -> tgt=...`) - Final report renders a source -> target UUID map table for traceability Deps gated behind optional `[migration]` extra — core SDK consumers unaffected. Tests: 13 unit tests (RemapTable, remap_uuids, AdapterPhase happy path / idempotency / dry-run / abort). Integration smoke verified locally against docker compose: 9 adapters migrated source -> target, re-run reports 0 created / 9 adopted.

Per the project's code-comments guidance: comments explain WHY in generic terms, not 'see file X in repo Y'. Path/class references rot when files move and don't help a future reader of this package who may not have the backend repo open. Behavior unchanged. 13 unit tests + integration smoke still green.

Two independent leaf phases following the AdapterPhase template: list -> per-id GET -> POST/adopt by name, record source->target remap. - ConnectorPhase: skips Unstract Cloud Storage rows (catalog id is redacted on the wire — target re-provisions per-org). OAuth-backed connectors land without refresh tokens; operator re-authorises on target. - TagPhase: simplest entity — name + description, no encryption, no list-vs-detail divergence. Orchestrator PHASES order extended; 22 unit tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Drops the hardcoded *_POST_FIELDS tuples and the UCS catalog-ID constant. Backend serializer is now the single source of truth for which fields the SDK posts: - PlatformClient.get_post_schema(entity_path) issues OPTIONS once per path, caches the writable-field set (DRF SimpleMetadata already strips read_only fields from actions.POST). - Each phase fetches its schema in run(); builds the POST body by intersecting the source GET payload with the schema. - ConnectorPhase: replace UCS_CONNECTOR_ID hardcode with an empty-metadata signal — backend redacts metadata to {} for auto-provisioned rows, so a falsy connector_metadata on the wire is unmigratable. Future redactions (any catalog) are covered automatically. Test FakeClients gain a get_post_schema() that mirrors the writable subset; UCS test renamed to test_redacted_metadata_connector_skipped. 22/22 unit tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add phases.base.build_post_payload(src, writable) used by every phase: - Subtracts SERVER_MANAGED from the OPTIONS-derived writable set. DRF exposes id/organization/created_by/modified_by/shared_users/timestamps as writable on ModelSerializer, but the view's perform_create overrides them server-side — posting them is noise (and a source-org value for organization/created_by would mismatch the target). - Skips empty strings as well as None. DRF treats '' on a required field as blank and 400s (hit on connector_version=''). Local smoke: 8/8 connectors + 2/2 tags migrated, idempotent re-run adopts all 10. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Migrates prompt-studio projects with their ProfileManager + ToolStudioPrompt children, then republishes PromptStudioRegistry via the backend's export-tool action (avoids carrying tool_metadata across orgs). Walker-remaps adapter UUIDs into profile FKs and across embedded JSON fields. Fresh-tool path deletes the backend's auto-default profile before recreating from source. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Migrates Workflow rows with walker-remapped connector UUIDs in source_settings + destination_settings JSON blobs. WorkflowEndpoints are auto-created by the backend on workflow POST; reconciled later by the dedicated WorkflowEndpoint phase. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Closes the workflow execution loop on target. CustomToolPhase now records a prompt_studio_registry remap (src_registry_id -> tgt_registry_id) by looking up both sides via the newly-filterable registry list endpoint. ToolInstancePhase walks the workflow remap, creates one bare instance per target workflow (POST overrides metadata server-side), then PATCHes metadata so source's adapter selections survive. Source metadata stores adapters as names, which match across orgs since AdapterPhase preserves them; the backend resolves names to local UUIDs on PATCH. WorkflowEndpointPhase PATCHes the SOURCE/DESTINATION endpoints that the backend auto-created on workflow POST, pairing by endpoint_type and remapping connector FK + walker-rewriting embedded UUIDs in configuration. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Drop the field-by-field reconcile loop (create_custom_tool + delete_profile + create_profile + set_default_profile + create_prompt) in favour of the backend's purpose-built endpoints: - GET prompt-studio/project-transfer/{id} bundles tool_metadata, tool_settings, default_profile_settings, prompts in one shot. - POST prompt-studio/project-transfer/ creates the tool, default profile (wired with target-org adapter ids the SDK supplies), and prompts server-side in one call. - POST prompt-studio/{id}/sync-prompts/ rip-and-replaces prompts on an existing target tool for the adopt path. Adapter ids for the import are resolved from the source's default ProfileManager via the adapter remap table; missing remap fails the tool cleanly instead of landing a half-wired profile. Removes hardcoded PROFILE_WRITABLE / PROMPT_WRITABLE frozensets and the OPTIONS schema fetch for prompt-studio — the project-transfer endpoint owns the field shape server-side. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Live ProfileManagerSerializer renders adapter FKs as flat NAME strings (e.g. "gpt4-o"), not UUIDs. The first project-transfer smoke flagged "no adapter remap for gpt4-o" for every tool — the phase was looking up a name in the UUID-keyed adapter remap. Switch resolution to target-side lookup: read the adapter NAME from the source default profile and ask target's list_adapters(name=...) for the target UUID. AdapterPhase preserves names across orgs, so the lookup hits whenever adapters migrated cleanly. Smoke-test result against local stack: 5/6 source tools created on fresh run, all 5 adopted on re-run (sync_prompts), 1 source tool failed cleanly because it has no profile (test data, not a code bug). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Closes the last gap in the org-to-org data migration: ETL/TASK pipelines and API deployments. Both phases mirror the WorkflowPhase shape: get-or-create by name on target, FK rewrites via the workflow remap table, dry-run + abort support, idempotent re-run. Phase order extended to: ... workflow_endpoint -> pipeline -> api_deployment `api_deployment` requires endpoints to be configured before the serializer accepts it, so it must run after WorkflowEndpointPhase. PipelinePhase scope: ETL + TASK only. DEFAULT is dead v1 code; APP is a Streamlit-style deployment that doesn't fit the pipeline model. API key handling: backend auto-provisions one active key per pipeline/deployment on POST. Extra rotated source keys are NOT mirrored — UUIDs are server-generated (not settable) and operators should rotate post-migration anyway. Both phases log a WARNING when the source had more than one active key so the operator notices. Client surface added: - list_pipelines, get_pipeline, create_pipeline, update_pipeline - list_api_deployments, get_api_deployment, create_api_deployment, update_api_deployment - list_pipeline_keys, list_api_deployment_keys, create_api_key 13 new unit tests; full suite green (43 -> 56). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

greptile-apps · 2026-05-24T07:16:50Z

Greptile Summary

This PR introduces the unstract.clone sub-package, a CLI and SDK for migrating configured org resources (adapters, connectors, tags, prompt-studio projects, workflows, tool instances, endpoints, pipelines, API deployments) between Unstract orgs via Platform API keys. It's a well-structured, phased migration tool with a solid test suite (56 unit tests) and clear idempotency semantics.

Architecture: OrgEndpoint + PlatformClient + CloneContext/RemapTable + per-entity Phase subclasses, orchestrated in strict topological order; remap_uuids walker rewrites embedded UUIDs before each POST.
Correctness improvements already landed (in prior commits on this branch): build_post_payload preserves False/0, pipeline/api_deployment phases now do per-id GET before POST, workflow_endpoint skips rather than silently detaches on unresolvable connectors, dry-run adopt path no longer fires export_custom_tool, list_custom_tools hoisted above loop in CustomToolPhase.
One gap remains: WorkflowPhase builds its POST payload directly from the list-endpoint summary — the only create-path phase that skips the get_workflow(src_id) detail fetch — making it susceptible to the same field-stripping problem that was already fixed for pipelines and API deployments.

Confidence Score: 4/5

Safe to merge after confirming whether the workflow list serializer includes source_settings/destination_settings; if it does, the current code is correct; if not, workflows would be created with stripped connector settings.

The package is well-designed with a thorough test suite and most previously identified correctness gaps already addressed. The one remaining concern is that WorkflowPhase builds its POST payload from the list-endpoint summary without calling get_workflow(src_id) first — the only create-path phase in the codebase that follows this pattern. Given the PR's own guidance that list serializers strip fields, and that the two flagged fields (source_settings, destination_settings) embed connector UUIDs that also need UUID-walking, a silent miss here would result in workflows created without their connector configuration.

src/unstract/clone/phases/workflow.py — confirm the backend workflow list serializer includes all fields needed for POST (especially source_settings and destination_settings) or add a per-id GET call.

Important Files Changed

Filename	Overview
src/unstract/clone/phases/workflow.py	Migrate workflows; uses list-response payload directly without a per-id GET, which is inconsistent with the stated pattern and could silently omit fields stripped by the list serializer.
src/unstract/clone/phases/files.py	Prompt-Studio document file migration with retry logic; calls list_custom_tools() once per tool inside the main loop for a cosmetic log-only lookup, causing N redundant API requests.
src/unstract/clone/phases/custom_tool.py	Prompt-Studio project migration; dry-run adopt path fixed (no longer fires export_custom_tool), target tool list hoisted above loop, adapter fallback UUID dropped.
src/unstract/clone/phases/pipeline.py	ETL/TASK pipeline migration; now does per-id GET before POST (fixing earlier list-serializer gap), correctly skips DEFAULT/APP types and warns on extra source API keys.
src/unstract/clone/phases/api_deployment.py	API deployment migration; per-id GET added before POST, adopt-by-name idempotency, extra-key warning present.
src/unstract/clone/phases/workflow_endpoint.py	Workflow endpoint PATCH migration; fixed to skip instead of silently detach when source connector has no remap, null connection_type no longer coerced to empty string.
src/unstract/clone/phases/base.py	Base phase helpers; build_post_payload correctly preserves False and 0 after the prior fix, server-managed fields stripped universally.
src/unstract/clone/client.py	Thin HTTP client scoped per org; OPTIONS-driven schema caching, entity-scoped methods, Bearer auth, clean context-manager lifecycle.
src/unstract/clone/walker.py	JSON UUID walker; cleanly rewrites UUID-shaped strings via RemapTable, unknown UUIDs pass through, dict keys not walked (correct for JSON field names).
src/unstract/clone/phases/tool_instance.py	ToolInstance migration; properly detects broken adapter sentinels, strips source identity fields, two-step POST+PATCH pattern correctly mirrors backend behavior.
src/unstract/clone/orchestrator.py	Top-level clone() entry point; strict topological phase order, client lifecycle managed in finally block, CloneError aborts cleanly.
src/unstract/clone/context.py	Shared state dataclasses; RemapTable correctly implements resolve_any for the UUID walker, OrgEndpoint is frozen, CloneContext bundles clients and remap.

Sequence Diagram

sequenceDiagram
    participant CLI
    participant Orchestrator
    participant Phase
    participant SrcClient as Source PlatformClient
    participant TgtClient as Target PlatformClient
    participant RemapTable

    CLI->>Orchestrator: clone(src_endpoint, tgt_endpoint, options)
    Orchestrator->>SrcClient: create (Bearer key)
    Orchestrator->>TgtClient: create (Bearer key)

    loop For each Phase in topological order
        Orchestrator->>Phase: run(report)
        Phase->>TgtClient: OPTIONS entity/ → get_post_schema()
        Phase->>SrcClient: list_entities()
        loop For each source entity
            Phase->>SrcClient: get_entity(id) [detail fetch]
            Phase->>TgtClient: "list_entities(name=...) [adopt check]"
            alt Entity exists on target
                Phase->>RemapTable: record(src_id, tgt_id)
            else dry_run
                Note over Phase: skip, no write
            else create
                Phase->>RemapTable: remap_uuids(payload)
                Phase->>TgtClient: create_entity(payload)
                Phase->>RemapTable: record(src_id, tgt_id)
            end
        end
    end
    Orchestrator->>CLI: CloneReport (counts + remap snapshot)

Prompt To Fix All With AI

Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
src/unstract/clone/phases/workflow.py:81-83
**Workflow payload built from list response, not per-id GET**

`WorkflowPhase._clone_one` applies `remap_uuids` and `build_post_payload` directly to the list-endpoint summary (`src`), skipping the `get_workflow(src_id)` call that every other phase in this package performs for the same reason. The PR's own "carry-forward gotcha #1" states: _"List endpoints often serve stripped payloads … every phase does list → per-id GET before POST."_ If the backend's workflow list serializer strips `source_settings` or `destination_settings` (the two fields called out as non-trivial in this module's own docstring), those blobs — and any connector UUIDs inside them that need UUID-walking — would be silently absent from the POST, creating a misconfigured workflow on the target. `PipelinePhase` and `APIDeploymentPhase` had identical gaps that were fixed in 9391ce3; `WorkflowPhase` appears to have been missed.

### Issue 2 of 2
src/unstract/clone/phases/files.py:70-71
`_lookup_tool_name` fires a full `list_custom_tools()` API call on every loop iteration — one call per tool in `tool_remap`. For N tools this means N round-trips that each fetch the entire tool list, just to produce a name for log messages. The same antipattern was identified and fixed in `CustomToolPhase` (hoisted to before the loop). Caching the list once before the loop removes the redundant requests.

```suggestion
        try:
            _all_tgt_tools = self.ctx.target.list_custom_tools()
        except Exception:
            _all_tgt_tools = []
        _tgt_name_by_id = {t["tool_id"]: t["tool_name"] for t in _all_tgt_tools}

        for src_tool_id, tgt_tool_id in tool_remap.items():
            tool_name = _tgt_name_by_id.get(tgt_tool_id) or src_tool_id
```

_{Reviews (11): Last reviewed commit: "fix(clone): address PR #15 review feedba..." | Re-trigger Greptile}

Adds a `files` phase that moves Prompt Studio document files between orgs using the existing Platform API endpoints — no new BE surface. - Default mode (`--file-strategy=platform_api`): lists target DM rows once per tool for idempotency, downloads each missing source file via `fetch_contents_ide`, decodes per mime, and POSTs as multipart through `upload_for_ide`. - Skip mode (`--skip-files`): metadata only; source filenames go into `MigrationReport.skipped_files` for operator-driven UI re-upload. - Mixed-mode reporting: files above `--max-file-size` (default 25MB) land in `oversize_files`; mime types the BE endpoint can't round-trip losslessly (Excel placeholder, etc.) land in `unsupported_files`. Sibling files always continue — phase never aborts on file-level issues. Transport-level failures land in `failed_files`. - Idempotency-only retries on 5xx + transient connection errors. - Wired after `custom_tool` in the orchestrator (consumes its remap). Report grows four new typed lists (`uploaded_files`, `skipped_files`, `oversize_files`, `unsupported_files`, `failed_files`) plus end-of-phase rendering in both rich and plain modes. 12 new unit tests cover happy path, idempotency skip, oversize, unsupported mime, skip strategy, dry-run, retry on 5xx, missing custom_tool remap, per-tool source-list failure isolation, upload failure capture, and parametrised text/csv + text/plain round-trips.

…nce NOT FOUND Multiple fixes uncovered by the first local-stack run: client.py - list_prompt_documents: mount under prompt-studio/ prefix (BE include in urls_v2.py). - download_prompt_file: use ?document_id=, matching fetch_contents_ide serializer (was ?file_name=, BE ignored it and returned 400 ValidationError). - upload_prompt_file: drop trailing slash, BE pattern is prompt-studio/file/<uuid:pk> with no slash so POST 404'd. - add get_custom_tool / update_custom_tool for default-doc PATCH. phases/files.py - After upload loop per tool, mirror source's CustomTool.output by filename so FE auto-selects on load. Fall back to first target doc. Preserve any existing target output (operator may have already picked manually on a re-run). phases/tool_instance.py - Detect source serializer sentinels ([X NOT FOUND], [DELETED ADAPTER ...], [NEEDS UPDATE]) in stored metadata and skip the PATCH instead of round-tripping a broken adapter reference. ToolInstance row exists with backend defaults; operator re-binds in UI. report.py - Drop the full source->target UUID map from rendered output (noisy on large migrations). Print per-entity counts only; full map still in as_dict() and at DEBUG log level. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

base.build_post_payload: Previous `value not in (None, "")` dropped booleans False and numeric 0 along with None/"" — DRF BooleanField False and numeric defaults were silently stripped from POST payloads. Switch to explicit identity + equality checks. New test_base_helpers.py guards this. workflow_endpoint._patch_endpoint: When source endpoint had a connector but its remap entry is missing (e.g. connector phase skipped a row), we previously PATCHed the target endpoint with connector_instance_id=None — silently detaching it. Now skip the PATCH, increment result.skipped, and append an error entry so the operator sees the broken link in the report. Existing test rewritten to assert the new skip-and-flag behaviour. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

list_custom_tools() was called once per source tool — N target-API round-trips for the same invariant data. Fetch once before the loop and append locally on create so adoption lookups stay correct on re-runs. Also drop the value.get("id") fallback in _extract_adapter_name — returning a UUID where the caller expects a name made list_adapters silently miss with a confusing warning. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Signed-off-by: Chandrasekharan M <117059509+chandrasekharan-zipstack@users.noreply.github.com>

- src/unstract/migration -> src/unstract/clone, tests/migration -> tests/clone - MigrationReport/Context/Options/Error -> Clone* - CLI subcommand 'migrate' -> 'clone', script 'unstract-migrate' -> 'unstract-clone' - pyproject extra 'migration' -> 'clone' - README + docstrings + log lines updated; report title 'Migration Report' -> 'Clone Report' All 75 unit tests pass.

jaseemjaskp

Automated multi-agent review of the clone subpackage (Code Reviewer, Silent-Failure Hunter, Type-Design, Test Analyzer, Comment Analyzer, Code Simplifier). Findings below are grouped by file; severity is tagged inline (P1 highest). Findings already covered by existing review comments (e.g. the greptile dry-run write on custom_tool.py:180, the _extract_adapter_name / list_custom_tools / base.build_post_payload / workflow_endpoint items) are intentionally omitted as fixed/duplicate. Tests pass (75 green).

jaseemjaskp · 2026-05-25T07:09:43Z

+                src_ti_id, tgt_ti["id"], tgt_wf_id,
+            )
+
+        # PATCH the metadata regardless of created/adopted — keeps tool config


P1 — dry-run performs a real PATCH on the adopt path. The dry_run guard is only the elif at line 132, which is reachable just when not existing. When the target workflow already has a tool instance (the adopt path, if existing: at line 125), dry_run=True still falls through to update_tool_instance_metadata at line 175 — a real write during a dry run. The existing dry-run test seeds an empty target so it never exercises this. This is the same bug class as the greptile finding on custom_tool.py:180.

Fix: short-circuit before the metadata PATCH when dry-run, e.g. once existing is determined add if self.ctx.options.dry_run: result.skipped += 1; return (and don't bump adopted), or guard the PATCH block (lines 158-184) with if not self.ctx.options.dry_run:.

Fixed in 9391ce3. Adopt branch now checks dry_run before falling through to the PATCH. Records the remap (so downstream dry-run output stays coherent) and returns. New test_dry_run_on_adopt_path_does_not_repatch_target exercises this.

jaseemjaskp · 2026-05-25T07:09:43Z

+        for doc in src_docs:
+            file_name = doc.get("document_name")
+            src_document_id = doc.get("document_id")
+            if not file_name or not src_document_id:


P1 — malformed source DM rows are silently dropped. If a row lacks document_name/document_id (renamed field, partial serializer response), the bare continue skips it with no counter, no errors entry, and no warning — indistinguishable from "nothing to do". The operator sees a green run while a document never migrated.

Fix: increment result.failed (or at least result.skipped) and append to result.errors/report.failed_files, e.g. result.errors.append(f"src DM row for tool {tool_name} missing document_id/name: keys={list(doc)}") before continuing.

Fixed in 9391ce3. Malformed source DM rows now bump result.skipped and append an error entry (and warn-level log). New test_malformed_source_dm_row_bumps_skipped_with_error.

jaseemjaskp · 2026-05-25T07:09:43Z

+                "files: unsupported mime tool=%s file=%s mime=%s",
+                tool_name, file_name, mime,
+            )
+            report.unsupported_files.append(


P1 — unsupported-mime and oversize files are recorded but never counted, so the run still reports success. Both this unsupported_files.append(...) path (line 189) and the oversize_files.append(...) path (line 200) return without touching result.failed/result.skipped. The report's headline verdict (report.py:108-114) is driven solely by totals["failed"], so a run that left every Excel/oversize test doc behind still prints Completed successfully in green — contradicting the unsupported_files/oversize_files sections it also renders.

Fix: result.skipped += 1 on both paths, and/or have render() factor non-empty unsupported_files/oversize_files/failed_files into the final verdict line.

Fixed in 9391ce3. Both unsupported_files and oversize_files paths now bump result.skipped so the phase counters surface them. Existing oversize/unsupported tests extended with explicit result.skipped == 1 assertions. Note: I chose skipped over failed because these are known limitations the operator chose (cap setting, mime support), not transport failures — the per-file report list still carries the details.

jaseemjaskp · 2026-05-25T07:09:43Z

+            src_regs = self.ctx.source.list_registries(custom_tool=src_tool_id)
+            tgt_regs = self.ctx.target.list_registries(custom_tool=tgt_tool_id)
+        except Exception as e:
+            logger.warning(


P2 — registry-remap lookup failure is swallowed at WARNING with no counter. A failed list_registries on source/target logs a warning and returns, leaving the prompt_studio_registry remap unrecorded. Downstream ToolInstancePhase then hits if not tgt_tool_id (tool_instance.py:108) and silently skips the tool instance — the workflow lands with no tool wired, and both phases report zero failures. A genuine API/network error is thereby indistinguishable from the legitimate "tool unpublished on source" case.

Fix: append to result.errors (and consider result.failed += 1) here so the lookup failure is visible in the structured report, distinct from the unpublished-tool path.

Fixed in 9391ce3. The exception branch now bumps result.failed and appends registry remap lookup {tool_name}: {e} so the operator sees the partial-success in the phase counters.

jaseemjaskp · 2026-05-25T07:09:43Z

+    if isinstance(obj, list):
+        return [remap_uuids(v, remap) for v in obj]
+    if isinstance(obj, str) and UUID_RE.match(obj):
+        return remap.resolve_any(obj) or obj


P2 — resolve_any can silently rewrite a UUID to an unrelated target id. remap_uuids resolves any UUID-shaped string against all entity maps and returns the first hit (context.py:78-83). A UUID embedded in a JSON blob (tool_instance metadata, workflow settings, endpoint config) that is not actually a foreign key — or that collides across entity maps — gets rewritten to an unrelated target UUID, producing a structurally valid payload pointing at the wrong row, with no log at all.

Fix: at minimum logger.debug each rewrite (which entity matched, src→tgt). Better: have known-FK fields pass an expected entity type, and have resolve_any warn when a UUID matches in more than one entity map.

Acknowledged but deferring. UUID collisions across entity tables are astronomically unlikely in practice (UUIDv4 birthday bound), and the walker's contract is documented as 'first hit wins'. The right fix is to make the walker entity-aware at each call site rather than scoping resolve_any — that's a broader refactor I'd like to do in a follow-up rather than as part of this review. Tracking as future work.

jaseemjaskp · 2026-05-25T07:09:44Z

+]
+
+
+def clone(


Test gap (P1) — clone() is never exercised end-to-end. Untested: phase ordering, opts.includes() include/exclude → skipped_phases, the CloneError → abort/break path (does an abort actually halt subsequent phases?), and remap_snapshot population on the returned report. This is the integration seam the whole package hangs on.

Suggest a test that monkeypatches PHASES with two trivial fake phases (one raising NameConflictError) and asserts the second never runs, report.aborted is True, abort_reason is set, and --exclude lands a name in skipped_phases.

Addressed in 9391ce3. New tests/clone/test_orchestrator.py (6 tests) covers: phase ordering, include/exclude → skipped_phases, CloneError abort + remaining phases skipped, non-CloneError propagation, client close on both paths, remap snapshot on the report.

jaseemjaskp · 2026-05-25T07:09:44Z

+    abort_reason: str | None = None
+    # Files-phase artifacts. Each entry carries enough context for an
+    # operator to act on it without cross-referencing the run log.
+    uploaded_files: list[dict[str, Any]] = field(default_factory=list)


Type design (P2) — the five file-row buckets are loose list[dict[str, Any]] with divergent, implicit schemas. Each is built ad hoc in files.py with a different key set (uploaded_files has size_bytes+mime_type; oversize_files adds cap_bytes; failed_files has error), and the reader _describe_file_row defensively .get()s keys that may or may not exist. A typo in a producer key surfaces as "?" in the report rather than failing anywhere.

Suggest a small frozen dataclass FileOutcome(tool_id, file_name, tool_name=None, size_bytes=None, mime_type=None, error=None) for these fields; as_dict() becomes asdict(x) and the defensive .get()s in the renderer go away.

Fair point. Dataclasses for the five buckets would tighten the contract, but the report layer is read-only (built once, rendered once) and the loose dict shape isn't currently a foot-gun anywhere. Tracking as a follow-up if/when a consumer outside the renderer starts depending on these structures.

jaseemjaskp · 2026-05-25T07:09:44Z

+    dry_run: bool = False
+    include: tuple[str, ...] | None = None
+    exclude: tuple[str, ...] = ()
+    on_name_conflict: str = "adopt"  # "adopt" | "abort"


Type design (P3) — closed option sets use magic strings. on_name_conflict ("adopt"/"abort") and file_strategy ("platform_api"/"skip") are compared by string literal far away (adapter.py:74 == "abort", files.py:84 == "skip"), so a typo on either side is undetectable until runtime and the CLI must re-validate by hand.

Suggest class NameConflict(str, Enum) / class FileStrategy(str, Enum) — str-based keeps CLI/JSON ergonomics while making illegal values unconstructable and the comparison sites self-documenting.

Click's type=Choice(...) already validates at the CLI boundary and the option-set is small (4 strings across 2 fields). A StrEnum would tighten the in-process checks but the readability win is marginal at this scale. Deferring.

jaseemjaskp · 2026-05-25T07:09:44Z

+    }
+
+
+class Phase(ABC):


Simplification (P2) — six phases duplicate the same run() preamble and adopt/create skeleton. adapter, connector, tag, workflow, pipeline, api_deployment each repeat: try get_post_schema(PATH) → on error append OPTIONS <name> and return; try list_<x>() → on error append list source <x> and return; loop. The adopt/dry-run/create branch (with NameConflictError raise + remap.record) is likewise near-identical, differing only in the create call and entity label.

Suggest hoisting _fetch_writable(path, result, label) and _adopt_or_create(...) helpers onto this base class — removes ~15 duplicated lines per module across six phases while preserving the (self.name-derived) error strings. Lower priority: _warn_if_extra_source_keys is byte-identical in pipeline.py and api_deployment.py and could also move here.

Acknowledged. The per-phase divergence (CustomTool composite, ToolInstance two-step POST+PATCH, Files DM-row enumeration, WorkflowEndpoint PATCH-not-POST) makes a single run() template fragile — three of the ten phases would need template hooks. I'd rather wait until the divergent phases stabilize before extracting. Tracking for a future cleanup.

jaseemjaskp · 2026-05-25T07:09:44Z

+    and rejects on required fields).
+    """
+    keys = writable - SERVER_MANAGED
+    # Equality with `(None, "")` matched False and 0 too (Python: False == 0,


Comment accuracy (P3) — this explanatory comment is self-contradicting and hard to parse. "0 in (None, "") is False, but 0 not in (...) falsely returns True" — those two statements are consistent, not contradictory, so the comment muddles the reasoning behind the (correct) code below it.

Suggest: "Use is not None and != '' (not membership in (None, '')) so meaningful falsy values like False and 0 survive — membership tests against (None, '') are easy to get subtly wrong."

Addressed in 2b80d2a0 — the test now covers 9 behavioral scenarios: confirm-path (resolve via searchPersons → flatten → POST bulk-assign with the correct persona ids), already-tagged-filter (drops personas carrying the target tag), already-assigned-set filter, all-pre-tagged short-circuit (no POST + friendly toast + onComplete not called), partial-failure pass-through to onComplete, bulk-assign 500 error path, and search 500 error path. Uses the same vi.mock approach the SearchBuilderDrawer test uses for PersonSearchPicker — mocks the drawer to expose onConfirm deterministically without coupling to its internal filter/review wiring.

greptile-apps · 2026-05-26T06:07:48Z

+        payload: dict[str, Any] = {
+            "connection_type": src_ep.get("connection_type") or "",
+            "configuration": remap_uuids(src_ep.get("configuration") or {}, self.ctx.remap),
+            "connector_instance_id": tgt_conn_id,
+        }


Using or "" coerces a None connection_type to an empty string. If the backend's connection_type is a DRF ChoiceField (typical values are "filesystem", "api", etc.), sending "" will be rejected with a 400 error because allow_blank is False by default. Pass None when the field is absent so the target can apply its own default, or omit the key entirely from the PATCH.

Suggested change

payload: dict[str, Any] = {

"connection_type": src_ep.get("connection_type") or "",

"configuration": remap_uuids(src_ep.get("configuration") or {}, self.ctx.remap),

"connector_instance_id": tgt_conn_id,

}

payload: dict[str, Any] = {

"configuration": remap_uuids(src_ep.get("configuration") or {}, self.ctx.remap),

"connector_instance_id": tgt_conn_id,

}

if src_ep.get("connection_type"):

payload["connection_type"] = src_ep["connection_type"]

Prompt To Fix With AI

This is a comment left during a code review. Path: src/unstract/clone/phases/workflow_endpoint.py Line: 139-143 Comment: Using `or ""` coerces a `None` `connection_type` to an empty string. If the backend's `connection_type` is a DRF `ChoiceField` (typical values are `"filesystem"`, `"api"`, etc.), sending `""` will be rejected with a 400 error because `allow_blank` is `False` by default. Pass `None` when the field is absent so the target can apply its own default, or omit the key entirely from the PATCH. ```suggestion payload: dict[str, Any] = { "configuration": remap_uuids(src_ep.get("configuration") or {}, self.ctx.remap), "connector_instance_id": tgt_conn_id, } if src_ep.get("connection_type"): payload["connection_type"] = src_ep["connection_type"] ``` How can I resolve this? If you propose a fix, please make it concise.

Fixed in 9391ce3. Dropped the or "" coercion; connection_type is now only added to the PATCH payload when source has a non-None value. New test_endpoint_with_null_connection_type_omits_key_in_payload locks the regression.

greptile-apps · 2026-05-26T07:25:35Z

+            logger.exception("Failed to list target tools: %s", e)
+            result.failed += 1
+            result.errors.append(f"list target tools: {e}")
+            return


Bare return breaks the PhaseResult contract

Line 83 returns None from a method declared -> PhaseResult. Every other early-exit in this file (e.g. the matching return result on line 70) and in every other run() method throughout the codebase returns result. If the target list call fails, the orchestrator gets None back — and any caller that relies on the return value would receive the wrong type. The fix is return result.

Prompt To Fix With AI

This is a comment left during a code review. Path: src/unstract/clone/phases/custom_tool.py Line: 83 Comment: **Bare `return` breaks the `PhaseResult` contract** Line 83 returns `None` from a method declared `-> PhaseResult`. Every other early-exit in this file (e.g. the matching `return result` on line 70) and in every other `run()` method throughout the codebase returns `result`. If the target list call fails, the orchestrator gets `None` back — and any caller that relies on the return value would receive the wrong type. The fix is `return result`. How can I resolve this? If you propose a fix, please make it concise.

Same root cause as #3296606829 — fixed in 9391ce3 by returning result from the phase-init error branch.

greptile-apps · 2026-05-26T07:25:36Z

+            remapped = remap_uuids(src, self.ctx.remap)
+            payload = build_post_payload(remapped, self._writable)
+            payload["workflow"] = tgt_wf_id


Pipeline and API-deployment phases skip the per-id GET, risking incomplete payloads

The PR's own "carry-forward gotcha #1" documents that list endpoints serve stripped payloads and that every phase should follow the list → per-id GET → POST pattern. AdapterPhase and ConnectorPhase both call get_adapter/get_connector before building the POST payload; PipelinePhase and APIDeploymentPhase build the payload directly from the list summary using remap_uuids(src, ...) / build_post_payload(remapped, ...). Fields absent from the list serializer (e.g. cron_string for TASK pipelines, or detailed scheduler config) would be silently omitted from the POST, creating a misconfigured entity on the target. The client already exposes get_pipeline and get_api_deployment — they should be called before constructing the payload, mirroring the adapter/connector pattern.

The same issue applies at api_deployment.py line 106–108.

Prompt To Fix With AI

This is a comment left during a code review. Path: src/unstract/clone/phases/pipeline.py Line: 109-111 Comment: **Pipeline and API-deployment phases skip the per-id GET, risking incomplete payloads** The PR's own "carry-forward gotcha #1" documents that list endpoints serve stripped payloads and that every phase should follow the `list → per-id GET → POST` pattern. `AdapterPhase` and `ConnectorPhase` both call `get_adapter`/`get_connector` before building the POST payload; `PipelinePhase` and `APIDeploymentPhase` build the payload directly from the list summary using `remap_uuids(src, ...)` / `build_post_payload(remapped, ...)`. Fields absent from the list serializer (e.g. `cron_string` for TASK pipelines, or detailed scheduler config) would be silently omitted from the POST, creating a misconfigured entity on the target. The client already exposes `get_pipeline` and `get_api_deployment` — they should be called before constructing the payload, mirroring the adapter/connector pattern. The same issue applies at `api_deployment.py` line 106–108. How can I resolve this? If you propose a fix, please make it concise.

Fixed in 9391ce3. Both PipelinePhase and APIDeploymentPhase now do get_pipeline/get_api_deployment per source id before remapping + posting, so list-only serializer omissions can't strip required fields. New test_create_uses_per_id_get_not_stripped_list_payload in the pipeline tests proves the path.

P1 fixes: - tool_instance: dry-run no longer PATCHes adopted targets. - custom_tool: dry-run no longer republishes the registry on adopt. - custom_tool: registry remap-lookup failure now counts as failed. - custom_tool: phase-init failure returns `result`, not `None`. - files: malformed DM rows + unsupported-mime + oversize bump skipped with an error entry so the run no longer reports green when items needed manual attention. - files: tighten `_lookup_tool_name` exceptions to PlatformAPIError and transport errors. - workflow_endpoint: drop `connection_type or ""` coercion that could turn None into a blank DRF would reject; omit the key. - pipeline/api_deployment: do per-id GET before POST so list-only serializer fields don't get stripped from the create payload. P2/P3 quick wins: - PlatformClient: add `close()` + context manager; orchestrator closes both clients in a finally block. - pipeline/api_deployment: promote DEBUG to WARNING when the source key-list call fails (operator-facing). - cli: distinguish `--max-file-size 0` from unparseable; preserve 0. - files: drop the dead `docs/internal/...` reference. Tests: - New test_client.py + test_orchestrator.py + test_cli.py cover HTTP-layer + orchestrator paths that previously had no coverage. - Regression tests added to existing phase tests for each P1 fix. - 105 tests pass (was 75).

chandrasekharan-zipstack and others added 11 commits May 24, 2026 01:15

chandrasekharan-zipstack changed the title ~~feat(migration): SDK subpackage for org-to-org data migration~~ feat(migration): SDK subpackage for org-to-org data migration [UN-3479] May 24, 2026

chandrasekharan-zipstack changed the title ~~feat(migration): SDK subpackage for org-to-org data migration [UN-3479]~~ UN-3479 [FEAT] SDK subpackage for org-to-org data migration May 24, 2026

chandrasekharan-zipstack marked this pull request as ready for review May 24, 2026 07:00

chandrasekharan-zipstack self-assigned this May 24, 2026

greptile-apps Bot reviewed May 24, 2026

View reviewed changes

Comment thread src/unstract/migration/phases/base.py Outdated

Comment thread src/unstract/clone/phases/workflow_endpoint.py

Comment thread src/unstract/clone/phases/custom_tool.py Outdated

Comment thread src/unstract/migration/phases/custom_tool.py Outdated

chandrasekharan-zipstack and others added 3 commits May 24, 2026 13:18

greptile-apps Bot reviewed May 24, 2026

View reviewed changes

Comment thread src/unstract/clone/phases/custom_tool.py

chandrasekharan-zipstack requested review from Deepak-Kesavan, hari-kuriakose, jaseemjaskp and muhammad-ali-e May 24, 2026 12:00

chandrasekharan-zipstack and others added 4 commits May 24, 2026 17:37

Delete docs/internal/files-migration-plan.md

0811d31

Signed-off-by: Chandrasekharan M <117059509+chandrasekharan-zipstack@users.noreply.github.com>

docs(migration): rewrite README for end users

b73556f

docs(migration): add sample report + cross-deployment note

d2cc32f

docs(migration): slim README, defer to public docs

d8e6490

jaseemjaskp reviewed May 25, 2026

View reviewed changes

docs(clone): top-of-README note on users + install instructions

da07247

greptile-apps Bot reviewed May 26, 2026

View reviewed changes

docs(clone): update public docs URL to /cloning-orgs/

836962c

greptile-apps Bot reviewed May 26, 2026

View reviewed changes

		]


		def clone(

Conversation

chandrasekharan-zipstack commented May 23, 2026 • edited by atlassian Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architecture

Phases

Constraints (from design ADRs)

Carry-forward gotchas

Test plan

Out of scope (separate work)

Related

Uh oh!

greptile-apps Bot commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jaseemjaskp left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

chandrasekharan-zipstack commented May 23, 2026 •

edited by atlassian Bot

Loading

greptile-apps Bot commented May 24, 2026 •

edited

Loading