diff --git a/AGENTS.md b/AGENTS.md index 18278981..18db0056 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -34,23 +34,25 @@ Use the **codegraph MCP** as the first tool for browsing code. `codegraph_explor ## 2. Public API surface -Everything users import lives in `src/index.ts`. The surface is **infrastructure-only** — no built-in transformers or pipelines are re-exported. The package ships two built-in presets: **`v5-to-v6-ddb`** (full DDB + S3 migration) and **`v5-to-v6-os`** (OpenSearch companion table migration). `PresetLoader` scans `src/presets/` (resolved relative to its own `import.meta.url`, works from source or `node_modules/`) — convention is **filename = preset name**, drop a `.ts` file in there and it ships, no other code change. The authoring reference lives in `templates/presets/example.ts` (scaffolded into user projects by `init`). +Everything users import lives in `src/index.ts`. The surface is primarily infrastructure, plus stable widely-useful transformers. The package ships five built-in presets: **`v5-to-v6-ddb`** (full DDB + S3 migration), **`v5-to-v6-os`** (OpenSearch companion table migration), **`copy-ddb`** (verbatim DDB + S3 copy), **`copy-os`** (verbatim OpenSearch copy), and **`copy-files`** (S3-only file copy). `PresetLoader` scans `src/presets/` (resolved relative to its own `import.meta.url`, works from source or `node_modules/`) — convention is **filename = preset name**, drop a `.ts` file in there and it ships, no other code change. The authoring reference lives in `templates/presets/example.ts` (scaffolded into user projects by `init`). - **Config builder:** `createConfig` — single unified builder; replaces the old `createDdbConfig` / `createOsConfig` (both deleted 2026-05-10). DDB + S3 always required; `source.opensearch` / `target.opensearch` optional. -- **Env helpers:** `loadEnv` (dotenv loader), `fromEnv(name, default?)` (required string env, throws on missing), `numberFromEnv(name, default?)` (typed numeric, throws on parse failure). Empty string counts as missing in both — `.env`'s `KEY=` is almost always a forgotten value, not an intentional empty override. -- **AWS credential helpers:** re-exports from `@aws-sdk/credential-providers` so users don't need the direct dep. `fromAwsProfile` (= `fromIni`) binds an explicit profile from `~/.aws/credentials` — best for local dev where a stray env var shouldn't hijack auth. `fromAwsCredentialChain` (= `fromNodeProviderChain`) runs the AWS SDK default chain (env → ini → SSO → EC2/ECS IAM) — best for CI / cloud. `credentials` in config also accepts a literal `{accessKeyId, secretAccessKey, sessionToken?}`; the union is schema-validated at `createDdbConfig` / `createOsConfig` time. +- **Env helpers:** `loadEnv` (dotenv loader), `fromEnv(name)` (required string env, throws on missing), `fromEnv(name, default)` (returns default when missing), `fromEnv(name, null)` (returns `string | null` — use for optional config sections), `numberFromEnv(name, default?)` (typed numeric, throws on parse failure). Empty string counts as missing in both — `.env`'s `KEY=` is almost always a forgotten value, not an intentional empty override. +- **AWS credential helpers:** re-exports from `@aws-sdk/credential-providers` so users don't need the direct dep. `fromAwsProfile` (= `fromIni`) binds an explicit profile from `~/.aws/credentials` — best for local dev where a stray env var shouldn't hijack auth. `fromAwsCredentialChain` (= `fromNodeProviderChain`) runs the AWS SDK default chain (env → ini → SSO → EC2/ECS IAM) — best for CI / cloud. `credentials` in config also accepts a literal `{accessKeyId, secretAccessKey, sessionToken?}`; the union is schema-validated at `createConfig` time. - **Snapshot (debugging):** `config.debug.snapshot` (boolean or `{dir?, compress?}`) dumps per-record JSONL files at `//segment-.{source,post-transform,commands}.jsonl[.gz]` + `/dropped/segment-.jsonl[.gz]`. Default dir: `.transfer//snapshot`, gzipped. Opt-in, no-op when disabled — PipelineRunner depends on SnapshotWriter unconditionally so the hot path has no branching. - **Log file (debugging):** `config.debug.logFile` (boolean or string). `true` → each process writes raw pino JSONL to `.transfer//logs/.log` (per-process files so parallel appends can't interleave). String → shared path across all processes. Bootstrap resolves the path; `detectProcessKind()` reads `--segment N` from argv to distinguish workers from the orchestrator. - **Transformer factories:** `createTransformer`, `createDdbTransformer`, `createOsTransformer` -- **Built-in transformers (public):** `copyFileToTarget` — emits a verbatim S3 copy for file records (`ctx.copyFile(key, key)` on `text@key`). Handles both raw v5 (`record.values["text@key"]`) and post-`wrapInData` (`record.data.values["text@key"]`) shapes. Requires `S3Processor` in the pipeline. Use with `isFmFile` for a verbatim DDB + S3 file copy. +- **Built-in transformers (public):** `copyFileToTarget` — emits a verbatim S3 copy for file records (`ctx.copyFile(key, key)` on `text@key`). Handles both raw v5 (`record.values["text@key"]`) and post-`wrapInData` (`record.data.values["text@key"]`) shapes. Requires `S3Processor` in the pipeline. Use with `isFmFile` for a verbatim DDB + S3 file copy. `replaceFileUrls` — rewrites file-manager URLs in CMS rich-text / long-text fields from the source domain to the target domain. Requires `fileUrls: { source, target }` in the config root. - **Filter factory:** `createFilter` + `Filter` type -- **Built-in filter predicates (public):** all predicates from `src/domain/transform/filters.ts` are re-exported: `byType`, `byTypePrefix`, `isCmsGroup`, `isCmsModel`, `isCmsEntry`, `byIncludesModelId`, `isAcoSearchRecord`, `isBackgroundTask`, `isFmFile`, `isFlpRecord`, `isBuiltInSecurityRole`, `isSecurityTeam`, `isOsBackgroundTask`, `isOsMailerSettings`, `isAuditLogEntry`, `isMigrationRecord`. All handle both raw v5 and post-`wrapInData` record shapes. Pass them to `createFilter(predicate)` or compose inline. +- **Built-in filter predicates (public):** all predicates from `src/domain/transform/filters.ts` are re-exported: `byType`, `byTypePrefix`, `isCmsGroup`, `isCmsModel`, `isCmsEntry`, `byIncludesModelId`, `isAcoSearchRecord`, `isBackgroundTask`, `isFmFile`, `isFlpRecord`, `isBuiltInSecurityRole`, `isSecurityTeam`, `isOsBackgroundTask`, `isOsMailerSettings`, `isAuditLogEntry`, `isMigrationRecord`, `isFormBuilderRecord`. All handle both raw v5 and post-`wrapInData` record shapes. Pass them to `createFilter(predicate)` or compose inline. - **Scanner implementations:** `DdbScanner`, `OsScanner` — both share `Symbol("Core/Scanner")`, same as processors share `Symbol("Core/Processor")`. - **Processor implementations:** `DdbProcessor`, `OsProcessor`, `S3Processor`, `AuditLogProcessor` (slice-merging; see below). All share `Symbol("Core/Processor")` — no per-processor abstraction tokens. - **Processor abstraction:** `Processor` — users implementing custom processors use this. - **Pipeline construction:** `PipelineBuilderFactory` — injected into `preset.configure({...})` as `pipelineBuilderFactory`. +- **Preset helper:** `createTransferPreset` — typed identity helper for authoring preset files; use instead of raw `MigrationPreset` annotation. - **MigrationPreset type** + `PresetConfigureContext` (the `{runner, pipelineBuilderFactory, container}` arg bag). -- **Context types:** `BaseTransformContext`, `DdbTransformContext`, `OsTransformContext` (type aliases = base ∩ processor slices; see below) +- **Config schema:** `migrationConfigSchema` (Zod schema) + `MigrationConfiguration` (inferred type) — for users who need schema-level validation or type inference. +- **Context types:** `BaseTransformContext`, `DdbTransformContext`, `DdbCoreTransformContext` (= Base ∧ DdbProcessorSlice, no S3 — for DDB-only transformers), `OsTransformContext` (type aliases = base ∩ processor slices; see below) - **Transformer type:** `Transformer` (namespace with `.Interface`) - **Utility types:** `NonEmptyArray` (for typed processor arrays) - **Setup helper:** `initDataTransfer` + `InitDataTransferContext` (user-side custom DI wiring — see "setup.ts" below) @@ -102,7 +104,7 @@ src/ ├── domain/ │ ├── pipeline/ # Pipeline abstractions │ │ ├── abstractions/ -│ │ │ ├── Processor.ts # extendContext? + onEnd? + execute + afterShard?; slice type parameter. +│ │ │ ├── Processor.ts # checkAccess + extendContext? + onEnd? + execute + afterShard?; slice type parameter. │ │ │ ├── Scanner.ts # Scanner.Interface │ │ │ ├── Hook.ts # per-merge-group hook │ │ │ └── Transformer.ts @@ -135,11 +137,11 @@ src/ │ ├── DdbExecutor/ # Shared primitive: PutRecord[] → TargetDynamoDbClient.batchPut. │ │ # DdbProcessor + OsProcessor both compose this. │ ├── TouchedIndexes/ # per-worker singleton: index → original refresh_interval -│ ├── PipelineRunner/ # register(...) + run() + getProcessors(); per-record slice merge + onEnd; shard-end execute +│ ├── PipelineRunner/ # register(...) + run() + getProcessors() + getShardStats(); per-record slice merge + onEnd; shard-end execute │ ├── PipelineBuilderFactory/ # Injects all Processor + Scanner instances (multiple: true deps); .create({name, scanner, processors}) │ │ # finds each instance by constructor identity → PipelineBuilder (carries instances) │ ├── TransformContext/ # Single BaseTransformContextFactory; factory returns { ctx, commands } -│ ├── MigrationConfig/ # createDdbConfig / createOsConfig (Zod-validated) +│ ├── MigrationConfig/ # createConfig (Zod-validated, unified) │ ├── ModelProvider/ # Loads CMS model definitions from DB + modelsDir JSON files. │ │ # Accepted JSON shapes (auto-detected, mixed OK in same dir): │ │ # single model: { modelId, fields: [...], ... } @@ -148,9 +150,14 @@ src/ │ │ # Disambiguation guard: object must have fields[] to be treated │ │ # as a model definition (CMS entry records also have modelId but │ │ # no fields[] — this prevents entries from being loaded as models). +│ ├── SnapshotWriter/ # Per-record JSONL debug dumps (opt-in via config.debug.snapshot) +│ ├── DroppedRecordLog/ # Writes segment-N-unmatched.log + segment-N-blackholed.log +│ ├── TransferredRecordLog/ # Writes segment-N-transferred.log +│ ├── AccessChecker/ # Aggregates checkAccess() across all processors; AccessCheck.Status +│ ├── PresetLifecycle/ # BeforeLoadPresetHook / AfterLoadPresetHook composites + ModelPreloaderHook │ ├── TenantLocales/ PresetLoader/ WorkerSpawner/ -│ └── TransferLifecycle/ # BeforeTransferHook / AfterTransferHook composites -├── transformers/ # 21 built-in transformers (user-land examples) +│ └── TransferLifecycle/ # BeforeTransferHookComposite / AfterTransferHookComposite +├── transformers/ # ~30 built-in transformers (user-land examples) │ ├── createTransformer.ts createDdbTransformer.ts createOsTransformer.ts │ ├── global/ cms/ file-manager/ folders/ mailer/ security/ │ │ └── (cms/ also has fieldUtils.ts, fieldVisitor.ts, lexicalRenderer.ts, @@ -166,7 +173,9 @@ src/ │ # (filename = preset name). │ # v5-to-v6-ddb: full DDB + S3 Webiny migration. │ # v5-to-v6-os: OpenSearch companion table migration. -│ # example.ts excluded from discovery by exact filename match. +│ # copy-ddb: verbatim DDB + S3 copy. +│ # copy-os: verbatim OpenSearch copy. +│ # copy-files: S3-only file copy. └── utils/ ├── load-env.ts # loadEnv(import.meta.url) — dotenv loader, public API └── fromEnv.ts # fromEnv + numberFromEnv — public API, used in user configs @@ -228,9 +237,8 @@ src/features/FeatureName/ - `record: TRecord` — mutable, transformers change this. - `original: Readonly` — **frozen snapshot of the pre-transform record, always present**. Users may consume it for gate-checks, audits, etc. — do NOT remove even if no built-in code uses it. - `addCommand(cmd: Command)` — push a command to the (internal) bag. Canonical primitive; slice helpers are sugar over it. -- `modelProvider`, `cache`, `logger` — shared singletons. Use `ctx.logger` instead of `console.*` inside transformers — it's bound to the current worker and respects the configured log level. +- `modelProvider`, `cache`, `logger`, `compressionHandler` — shared singletons. Use `ctx.logger` instead of `console.*` inside transformers — it's bound to the current worker and respects the configured log level. `compressionHandler` is used by rich-text and compressed-field transformers. - `replace(newRecord)` — replaces `ctx.record`. -- `queryRecord(pk, sk?)` — source-table lookup, generic return type, Promise-returning. **Raw `commands` bag is NOT on the public ctx** — `addCommand` is the only public push path. The bag still exists internally for `Processor.execute(commands)` at shard end. `Commands.unclaimedKeys()` tracks keys whose commands nobody drained, used by the runner to warn-once. @@ -238,8 +246,8 @@ src/features/FeatureName/ Slice inventory: -- **`DdbProcessor` slice**: `putRecord(record)` → emits PutRecord targeting DDB primary. -- **`OsProcessor` slice**: `putRecord(record)` → emits PutRecord targeting OS DDB table (same key as DdbProcessor → mutually exclusive in one pipeline). +- **`DdbProcessor` slice**: `putRecord(record)`, `querySourceRecord(pk, sk?)`, `queryTargetRecord(pk, sk?)` → emits PutRecord targeting DDB primary; source/target table lookups. +- **`OsProcessor` slice**: `putRecord(record)`, `querySourceRecord(pk, sk?)`, `queryTargetRecord(pk, sk?)` → emits PutRecord targeting OS DDB table (same slice keys as DdbProcessor → mutually exclusive in one pipeline). - **`S3Processor` slice**: `copyFile(src, tgt)`, `getFile(key)` → PushQueue S3Copy / sync read from source bucket. Type aliases `DdbTransformContext` (= Base ∧ DdbProcessorSlice ∧ S3ProcessorSlice) and `OsTransformContext` (= Base ∧ OsProcessorSlice) are exported for transformer authors who want typed ctx parameters. @@ -249,8 +257,8 @@ Type aliases `DdbTransformContext` (= Base ∧ DdbProcessorSlice ∧ S3Processor ### Scanner / Processor / Executor - **Scanner** = source iterator. Yields records per shard. `DynamoDbClient.scan` is generic so scanners can narrow the raw row type. -- **Processor** = per-command-type unit implementing `Processor.Interface`. Has optional `extendContext(base) → slice` (context helpers), optional `onEnd(ctx) → void | Promise` (per-record terminal hook, replaces legacy auto-put magic), `execute(commands) → Promise` (drains its command keys), optional `afterShard({ segment, totalSegments }) → void | Promise` (per-shard persistence hook for processors that carry state across the worker→orchestrator boundary — only OsProcessor implements it today, to write `-indexes.json`). -- Per-record orchestration: runner builds base ctx → spreads each processor's slice → applies filters → runs transformers → runs each processor's `onEnd?` SEQUENTIALLY IN ARRAY ORDER. +- **Processor** = per-command-type unit implementing `Processor.Interface`. Required: `checkAccess() → Promise` (pre-flight access validation), `execute(commands) → Promise` (drains its command keys). Optional: `extendContext(base) → slice` (context helpers), `onEnd(ctx) → void | Promise` (per-record terminal hook, replaces legacy auto-put magic), `afterShard({ segment, totalSegments }) → void | Promise` (per-shard persistence hook for processors that carry state across the worker→orchestrator boundary — only OsProcessor implements it today, to write `-indexes.json`). +- Per-record orchestration: filters run first (`pipeline.accepts(record)` in the shard loop) → runner builds base ctx → spreads each processor's slice → runs transformers → runs each processor's `onEnd?` SEQUENTIALLY IN ARRAY ORDER. - Per-shard orchestration: each processor's `execute()` runs SEQUENTIALLY IN ARRAY ORDER. After all processors drain, runner checks `Commands.unclaimedKeys()` and warns once per unmatched key ("transformer pushed X but no processor drained X"). - **`DdbExecutor`** is a SHARED primitive (not a Processor) — `batchPut` against a target DDB table. Both `DdbProcessor.execute` and `OsProcessor.execute` compose it. OS adds gzip + ensureIndex preamble before delegating. - **"Record carries everything"** is a house invariant — do NOT add pre-transform snapshot queues, metadata side-channels, or "executor derives X" logic. If transformers destroyed something a processor needs, users write a transformer that preps it. @@ -262,12 +270,22 @@ Optional `tuning` section on `MigrationConfig`: ```typescript tuning?: { flushEvery?: number; // records per shard flush (default 500); bounds peak memory - ddb?: { maxRetries?: number; initialBackoffMs?: number }; - s3?: { concurrency?: number; maxRetries?: number; initialBackoffMs?: number }; + ddb?: { maxRetries?: number; initialBackoffMs?: number; requestTimeoutMs?: number }; + s3?: { concurrency?: number; maxRetries?: number; initialBackoffMs?: number; requestTimeoutMs?: number }; os?: { maxRetries?: number; retryScheduleMs?: number[]; gzipConcurrency?: number }; } ``` +**Debug options** on `MigrationConfig`: + +```typescript +debug?: { + logLevel?: "debug" | "info" | "warn" | "error"; // default "info"; also overridable via --log-level CLI flag + logFile?: boolean | string; // true → per-process JSONL; string → shared path + snapshot?: boolean | { dir?: string; compress?: boolean }; +} +``` + Fields flow to the respective client/executor; absent = module-level defaults. `BATCH_SIZE = 25` in DDB is AWS-enforced, NOT a user knob. `flushEvery` caps peak per-shard memory: at default 500 × 10 KB avg = ~5 MB/shard. For tables with very large records (approaching the 400 KB DDB max) lower this to 100. Set via `tuning: { flushEvery: numberFromEnv("FLUSH_EVERY", 500) }` in the config. @@ -328,7 +346,7 @@ These are one-line summaries. Each links to a spec or PR if fuller context is ne - **Zero transformers must work** — infra supports pure data-transfer (prod→dev seeding). `PipelineBuilder.build()` never throws for missing `.filter()`; if the pipeline includes a processor with `onEnd` (e.g. `DdbProcessor`), the terminal put fires via that hook for every matching record. - **Record carries everything** — processors + executors trust `ctx.record` at execute time; no side-channel queues or pre-transform snapshot passing. The OS refactor on 2026-04-19 made this explicit. - **`ctx.original` always present** — frozen pre-transform snapshot, on every context, permanently. Don't remove even if no built-in code consumes it. -- **Transformers + presets are user-land** — the `src/transformers/` files are examples. They will be revisited when the core infra is stable. Don't design the infra around them; if a refactor breaks them, update the examples or flag for rewrite. The authoring reference lives in `templates/presets/example.ts`. The package ships one built-in preset (`v5-to-v6-ddb`); users may otherwise pass a path to their own preset file. +- **Transformers + presets are user-land** — the `src/transformers/` files are examples. They will be revisited when the core infra is stable. Don't design the infra around them; if a refactor breaks them, update the examples or flag for rewrite. The authoring reference lives in `templates/presets/example.ts`. The package ships five built-in presets (see section 2); users may additionally author their own. - **First-match-wins + scanner-keyed merge groups** — registration order is semantic. More-specific pipelines before catch-alls. Different scanners = different merge groups. - **Impl-class-as-lookup-key** — `pipelineBuilderFactory.create({ scanner: DdbScanner, processors: [DdbProcessor, S3Processor] })` accepts Implementation classes. The type system infers `TRecord` from the scanner and the effective context from all processor slices. At runtime the factory resolves both scanner and processor instances by `x.constructor === implClass` against the arrays injected via `{ multiple: true }` — pipelines store resolved instances, not tokens. All processors share `Symbol("Core/Processor")`; all scanners share `Symbol("Core/Scanner")`; constructor identity is the discriminator. Don't reintroduce per-type abstraction tokens or "abstraction-only" signatures. - **PutRecord target is baked in by the processor** — `ctx.putRecord(record)` (slice helper contributed by `DdbProcessor` or `OsProcessor`) emits a PutRecord command with the target table resolved by that processor's config. Transformers shouldn't need to know table names. @@ -336,11 +354,11 @@ These are one-line summaries. Each links to a spec or PR if fuller context is ne - **OS `ensureIndex` fails the transfer on retry-exhaustion** — the old swallow-and-continue path masked real schema / mapping bugs. If index prep exhausts retries, the whole run aborts so the user sees and fixes it. - **`@webiny/aws-sdk` wrapper** — AWS imports come from `@webiny/aws-sdk/client-{dynamodb,s3}` + helpers `getDocumentClient`, `createS3Client`. Don't import `@aws-sdk/client-*` directly. One exception: `QueryCommand` still comes from `@aws-sdk/lib-dynamodb` because the wrapper's re-export expects pre-marshalled AttributeValues — flagged for Webiny team to fix. - **Slice-merging processors** (2026-04-20) — pipelines take `processors: NonEmptyArray`. Each processor contributes a **slice** of context helpers (via `extendContext(base)`), owns a **terminal hook** (`onEnd?`), and **drains its own commands** (`execute(commands)`). Slice-key collision = mutually exclusive in a pipeline (DdbProcessor + OsProcessor both contribute `putRecord` → TS rejects); `DisjointKeys<>` catches at compile time. Slice + execute run sequentially in array order (don't hammer services). `Commands.unclaimedKeys()` reports commands no processor drained. **No more god-processors**; `DdbProcessor` writes DDB records, `S3Processor` copies S3 objects, `OsProcessor` writes OS records (gzip + ensureIndex + delegate to shared `DdbExecutor`). Adding a new command type = new processor file + add to relevant pipelines. Shared primitive `DdbExecutor` (the raw batchPut) is composed, not a Processor. - - **Command-key coupling**: `DdbProcessor` and `OsProcessor` both drain `PutRecord.key`. If both ever land in one pipeline they'd double-write (same record to DDB and OS). Prevented at compile time via `DisjointKeys<>` (both contribute the `putRecord` slice key); at runtime via a `storage`-mode guard inside each processor's `extendContext` (throws if the wrong mode). The coupling is documented on `src/domain/transform/commands/PutRecord.ts` so future command-sharing scenarios remain visible. -- **Pipeline construction lives in a dedicated factory** — `PipelineBuilderFactory.create({ name, scanner, processors })` is the only entry point. Originally lived on the runner (`runner.pipeline(...)`), extracted 2026-04-20 because construction isn't runner state. Runner's public surface shrank to `register(...) + run(opts?) + getProcessors()`. `.create()` infers `TRecord` from scanner + `EffectiveContext = BaseCtx ∧ MergeSlices` from processors. `.build()` takes no args. The factory is a DI singleton injected into `preset.configure({...})`. Don't reintroduce a `pipeline()` method on the runner. + - **Command-key coupling**: `DdbProcessor` and `OsProcessor` both drain `PutRecord.key`. If both ever land in one pipeline they'd double-write (same record to DDB and OS). Prevented at compile time via `DisjointKeys<>` (both contribute the `putRecord` slice key). The coupling is documented on `src/domain/transform/commands/PutRecord.ts` so future command-sharing scenarios remain visible. +- **Pipeline construction lives in a dedicated factory** — `PipelineBuilderFactory.create({ name, scanner, processors })` is the only entry point. Originally lived on the runner (`runner.pipeline(...)`), extracted 2026-04-20 because construction isn't runner state. Runner's public surface shrank to `register(...) + run(opts?) + getProcessors() + getShardStats()`. `.create()` infers `TRecord` from scanner + `EffectiveContext = BaseCtx ∧ MergeSlices` from processors. `.build()` takes no args. The factory is a DI singleton injected into `preset.configure({...})`. Don't reintroduce a `pipeline()` method on the runner. - **`preset.configure` takes an object arg bag** — signature is `configure({ runner, pipelineBuilderFactory, container }): void | Promise`. Async returns allowed. `container` exposed so users can resolve custom services they registered in `setup.ts`. Object shape is forward-compat — add fields without breaking existing presets. - **User-side custom DI via `setup.ts`** — CLI looks for `setup.ts` sibling of the config file; loads `await fn({ container })` BEFORE `preset.configure({...})`. Use the `initDataTransfer` typed helper. Optional — pure-config users skip it. Canonical location for registering user-authored processors, transformers, or overriding defaults. Don't reintroduce auto-registration-via-inspection magic. -- **Built-in presets are auto-discovered** — `PresetLoader` scans `src/presets/` (relative to its own `import.meta.url`, so dev / installed layouts both work). Convention: **filename === preset name**. `example.ts` is excluded by exact filename match. Adding a built-in is a file drop, not a code change. Don't reintroduce a hardcoded `BUILT_IN_PRESETS` map or a "register your preset here" registry. +- **Built-in presets are auto-discovered** — `PresetLoader` scans `src/presets/` (relative to its own `import.meta.url`, so dev / installed layouts both work). Convention: **filename === preset name**. Adding a built-in is a file drop, not a code change. Don't reintroduce a hardcoded `BUILT_IN_PRESETS` map or a "register your preset here" registry. - **`v5-to-v6-os` pipeline ordering is load-bearing** — `BackgroundTasks` and `MailerSettings` are blackholed and registered BEFORE `CmsEntries` because both are CMS entries in the OS table (same `TYPE` prefix `cms.entry.*`) and would otherwise be claimed by the catch-all. `FileManagerFiles` must also precede `CmsEntries` for the same reason. Mailer settings are blackholed because v6 stores them in the KV store — the DDB preset handles that migration; the OS record has no v6 target. - **DDB parallel scan guarantees same-PK records land in the same segment** — the scan divides by hash range, so all revisions of the same CMS entry (L, P, REV#...) always go to the same worker. This means an in-process `ctx.cache` keyed by PK is sufficient for per-entry deduplication — no cross-worker shared cache is needed. Queries for sibling records within the same entry are deduplicated by the cache; the first record encountered does the query, subsequent siblings hit the cache. - **`addLiveField` cache+sentinel pattern** — the transformer uses `ctx.cache` keyed by `ctx.original.PK`. Sentinel value `-1` means "queried, no published revision found" — avoids re-querying. P records skip the query entirely (they ARE the published revision) and populate the cache for siblings. The sentinel must be non-zero (versions start at 1) and truthy (so `if (cached)` correctly identifies a prior miss). Don't use `null` or `undefined` as the sentinel — those are cache misses. @@ -352,20 +370,17 @@ These are one-line summaries. Each links to a spec or PR if fuller context is ne ## 7. Known open work (in priority order) -### Branch `bruno/feat/di-features` (unmerged) - -The slice-merging-processors refactor landed here in April 2026 plus follow-ups (afterShard hook, ctx-by-reference runner fix, unified process-segment command, dynalite-backed integration suite, v5-to-v6-ddb golden-file preset test). Tests green, ts-check clean, oxfmt clean. Ready to merge but NOT yet on `main`. - -### Branch `bruno/feat/os-transfer` (unmerged) +### Merged branches (for historical context) -Built on top of `bruno/feat/di-features`. Adds: `v5-to-v6-os` built-in preset (`OsScanner` + `OsProcessor`, 4 pipelines with correct first-match-wins ordering), `addLiveField` transformer (DDB source query + `ctx.cache` + `-1` sentinel), `updateOsIndex` transformer (uses `configurations.es` from `@webiny/api-headless-cms-ddb-es`), `osCmsEntryTransformers` stack, OS-specific filters (`isOsBackgroundTask`, `isOsMailerSettings`), `ModelProvider` multi-format JSON support (Webiny export / array / single model), `fakeContext` fixes (record now cloned on create; default real `Map`-backed cache; `makeFakeOsContext`). Tests green, ts-check clean, oxfmt clean. NOT yet on `main`. +- `bruno/feat/di-features` — slice-merging-processors refactor, afterShard hook, dynalite integration suite, golden-file preset test. **Merged to `main`.** +- `bruno/feat/os-transfer` — `v5-to-v6-os` preset, `OsScanner` + `OsProcessor`, `addLiveField`, OS transformers, `ModelProvider` multi-format JSON. **Merged to `main`.** -### Broader open work +### Open work 1. **npm publish story** — the package isn't on npm yet. Needs version strategy, publish script, CI. `npx @webiny/data-transfer init` in the README won't work until this lands. 2. **Init scaffolding smoke** — `init` scaffolds from `templates/`. Scaffold output: `config.ts`, `presets/example.ts`, optional `setup.ts`. Do a smoke run to verify a scaffolded project compiles + runs against a live sandbox. 3. **End-to-end AWS smoke** — no test has ever run against real AWS. Day-long sandbox exercise. Catches real issues mocks can't. -4. **Public API audit pass (post-refactor)** — `src/index.ts` grew with `Processor`, `NonEmptyArray`, `InitDataTransferContext`, `BaseTransformContext`, `DdbTransformContext`, `OsTransformContext`, `initDataTransfer`. Re-audit before publish to confirm the surface matches user-authoring intent (e.g., should `DdbTransformContext` stay as-is or split into the narrower `BaseTransformContext & DdbProcessorSlice` for users who don't include S3Processor?). +4. **Public API audit pass (post-refactor)** — `src/index.ts` grew organically. Re-audit before publish to confirm the surface matches user-authoring intent. `DdbCoreTransformContext` (= Base ∧ DdbProcessorSlice) was added as the narrower alternative to `DdbTransformContext`. --- diff --git a/README.md b/README.md index d0988c1c..d0bdd9b2 100644 --- a/README.md +++ b/README.md @@ -8,7 +8,7 @@ A generic data-transfer tool for Webiny environments. Copies DynamoDB + S3 (or O - **Prod → dev seeding** — zero transformers, just copy. - **Custom transfers** — write your own transformers + pipelines + preset for bespoke data moves. -The package ships four built-in presets (`v5-to-v6-ddb`, `v5-to-v6-os`, `copy-ddb`, `copy-files`) plus full authoring support for your own. +The package ships five built-in presets (`v5-to-v6-ddb`, `v5-to-v6-os`, `copy-ddb`, `copy-os`, `copy-files`) plus full authoring support for your own. ## Quick start @@ -120,6 +120,7 @@ export default createConfig({ - **`fromEnv(name)`** — required string; throws if unset or empty (empty string counts as missing). - **`fromEnv(name, default)`** — falls back to `default` when unset. +- **`fromEnv(name, null)`** — returns `string | null`; returns `null` (instead of throwing) when unset. Use for optional config sections (e.g. `fromEnv("SOURCE_OS_TABLE", null)`). - **`numberFromEnv(name, default?)`** — typed numeric; throws on parse failure (`SEGMENTS=four` fails immediately with a named error). ### Credentials @@ -190,8 +191,8 @@ JSON models override DB-loaded models when both exist. ```typescript tuning: { flushEvery: numberFromEnv("FLUSH_EVERY", 500), // records per shard flush — bounds peak memory - ddb: { maxRetries: 3, initialBackoffMs: 100 }, - s3: { concurrency: 10, maxRetries: 3, initialBackoffMs: 100 }, + ddb: { maxRetries: 3, initialBackoffMs: 100, requestTimeoutMs: 5000 }, + s3: { concurrency: 10, maxRetries: 3, initialBackoffMs: 100, requestTimeoutMs: 10000 }, os: { maxRetries: 3, retryScheduleMs: [5000, 10000, 20000], gzipConcurrency: 16 } } ``` @@ -206,8 +207,9 @@ Add a `debug` block to your config to opt into diagnostics: ```typescript debug: { - snapshot: true, // or: { dir: "./my-snapshot", compress: false } - logFile: true // or: "./my-transfer.log" + logLevel: "debug", // "debug" | "info" | "warn" | "error" (default "info"); also overridable via --log-level CLI flag + snapshot: true, // or: { dir: "./my-snapshot", compress: false } + logFile: true // or: "./my-transfer.log" } ``` @@ -339,6 +341,7 @@ import { | `isOsMailerSettings` | OS mailer settings records | | `isAuditLogEntry` | Audit log entry records | | `isMigrationRecord` | Migration tracking records | +| `isFormBuilderRecord` | Form Builder records (forms + submissions) | Multiple `.filter()` calls on the same pipeline AND-compose — a record must pass all of them. Register more-specific filters before catch-alls. @@ -390,8 +393,9 @@ Select by name when the wizard asks "Which preset do you want to run?": - **`"v5-to-v6-ddb"`** — full Webiny v5 → v6 migration of the primary DynamoDB table (CMS entries, file manager, security, mailer, folder permissions, etc.). - **`"v5-to-v6-os"`** — migration of the OpenSearch companion DynamoDB table. Run **after** `v5-to-v6-ddb`. -- **`"copy-ddb"`** — verbatim DynamoDB-only copy (no transformations). -- **`"copy-files"`** — verbatim DynamoDB + S3 file copy. +- **`"copy-ddb"`** — verbatim DynamoDB + S3 copy (no transformations). +- **`"copy-os"`** — verbatim OpenSearch companion table copy (no transformations). +- **`"copy-files"`** — S3-only file copy. Custom presets placed in your `presetsDir` are listed alongside built-ins. @@ -516,6 +520,27 @@ export default createTransferPreset({ **Requires:** pipeline must include `S3Processor`. **Do not use** when you need a new key path (e.g. the v5→v6 `tenants//files/` migration) — use the internal `createMetadata` transformer instead. +#### `replaceFileUrls` + +Rewrites file-manager URLs in CMS rich-text and long-text fields from the source domain to the target domain. Requires a `fileUrls` block in your config root: + +```typescript +export default createConfig({ + // ...source, target, pipeline as usual... + fileUrls: { + source: "https://old-cdn.example.com", + target: "https://new-cdn.example.com" + } +}); +``` + +```typescript +import { replaceFileUrls } from "@webiny/data-transfer"; + +// In your preset: +.use(replaceFileUrls) +``` + ### Built-in processors | Processor | Slice helpers | Notes | diff --git a/docs/aws-transfer-setup.md b/docs/aws-transfer-setup.md index c6483c45..c332c248 100644 --- a/docs/aws-transfer-setup.md +++ b/docs/aws-transfer-setup.md @@ -45,7 +45,7 @@ region = us-east-1 ### Run ```bash -SOURCE_PROFILE=source TARGET_PROFILE=target yarn start +SOURCE_PROFILE=source TARGET_PROFILE=target yarn transfer ``` ### When local works @@ -211,7 +211,7 @@ sudo dnf install -y nodejs git git clone cd yarn install -SOURCE_ROLE_ARN="arn:aws:iam:::role/CrossAccountTransferReadRole" yarn start +SOURCE_ROLE_ARN="arn:aws:iam:::role/CrossAccountTransferReadRole" yarn transfer ``` ### 6. Tear down when done @@ -220,15 +220,14 @@ Terminate the instance. IAM roles can be kept for future transfers or deleted. ## Script Configuration -The script reads from environment variables: +The tool reads credentials and table/bucket names from a `config.ts` file, which loads values from `.env` via `fromEnv()`. The env var names below are **conventions used in the reference config** — the tool itself does not hardcode them. Wire them in your `config.ts` via `fromEnv("VAR_NAME")`. | Variable | Description | Required when | | --- | --- | --- | -| `SOURCE_PROFILE` | AWS profile name for source account | Running locally | -| `TARGET_PROFILE` | AWS profile name for target account | Running locally | -| `SOURCE_ROLE_ARN` | ARN of source role to assume | Running on EC2 | -| `AWS_REGION` | Target region | Always | -| `SOURCE_REGION` | Source region (if different) | If source ≠ target region | +| `SOURCE_PROFILE` | AWS profile name for source account | Running locally with `fromAwsProfile` | +| `TARGET_PROFILE` | AWS profile name for target account | Running locally with `fromAwsProfile` | +| `SOURCE_ROLE_ARN` | ARN of source role to assume | Running on EC2 with `fromAwsCredentialChain` | +| `SOURCE_REGION` / `TARGET_REGION` | AWS regions | Always (in config.ts) | ## Permissions diff --git a/docs/ddb-es-migration-reference.md b/docs/ddb-es-migration-reference.md index 684ecd47..e52ca9d9 100644 --- a/docs/ddb-es-migration-reference.md +++ b/docs/ddb-es-migration-reference.md @@ -1,5 +1,7 @@ # DDB-ES Migration Pattern Reference +> **Legacy reference only.** This document describes patterns from an external Webiny migration codebase (`.sample/migration/ddb-es`) that is NOT part of this repository. The patterns informed the design of `@webiny/data-transfer` but the file paths, class names, and APIs below do not exist in this project. Refer to AGENTS.md for the current architecture. + Reference documentation for DynamoDB + Elasticsearch migration patterns from `.sample/migration/ddb-es`. ## Architecture Overview diff --git a/docs/pino-logger-implementation.md b/docs/pino-logger-implementation.md index ebc77b68..a7177549 100644 --- a/docs/pino-logger-implementation.md +++ b/docs/pino-logger-implementation.md @@ -2,141 +2,61 @@ ## Overview -Replaced the simple console-based logger with Pino, a fast and structured JSON logger with pretty printing support. +The project uses Pino for structured JSON logging with pretty-printing support. -## Changes Made +## Current architecture -### 1. Dependencies Added +The logger is a DI-managed service, not a standalone utility: -```bash -yarn add pino pino-pretty -``` +- **Abstraction**: `Logger` (`src/tools/Logger/abstractions/Logger.ts`) +- **Implementation**: `PinoLogger` (`src/tools/Logger/PinoLogger.ts`) +- **Feature**: `LoggerFeature.register(container, { logLevel, json, logFile? })` -### 2. Logger Utility (`src/utils/logger.ts`) +Consumers resolve via the DI container: -**New API**: -```typescript -import { createLogger, Logger } from "./utils/logger.ts"; +```ts +const logger = container.resolve(Logger); +logger.info("message"); +logger.child("[segment 0]"); // per-worker prefix +``` -// Create logger with optional configuration -const logger = createLogger({ - level: "info", // Optional: trace, debug, info, warn, error - msgPrefix: "[segment #0] " // Optional: prefix for all messages -}); +There is no `createLogger` function — the logger is always resolved from the container. -// Usage -logger.info("Simple message"); -logger.info("Message with data", { key: "value" }); -logger.warn({ error }, "Warning message"); -logger.error({ error }, "Error message"); -``` +### Log level -**Environment Variable**: -- `LOG_LEVEL`: Set log level (trace, debug, info, warn, error) -- Default: `info` +Configured via: -### 3. Run ID Generation (`src/cli.ts`) +1. `config.debug.logLevel` in the user's `config.ts` (`"debug" | "info" | "warn" | "error"`, default `"info"`) +2. `--log-level` CLI flag (overrides config) -Each migration run now gets a unique ID: -```typescript -const runId = String(Date.now()); -logger.info(`Run ID: ${runId}`); -``` +There is no `LOG_LEVEL` environment variable — log level flows through the config and CLI flag, not env vars. -The runId is: -- Generated once in the main process -- Passed to all worker processes -- Can be used for log file naming and correlation +### Per-worker prefixes -### 4. Segment Prefixes (`src/process-segment.ts`) +Each worker process creates a child logger with a segment prefix: -Each worker process logs with a segment prefix: -```typescript -const logger = createLogger({ - msgPrefix: `[segment #${options.segment}] ` -}); +```ts +// src/commands/processSegment/handler.ts +const childLogger = logger.child("[segment 0]"); ``` -**Output Example**: +**Output example:** ``` [20:31:50.954] INFO: [segment #0] Starting segment 0 of 4 (0%) [20:31:51.123] INFO: [segment #1] Starting segment 1 of 4 (25%) -[20:31:51.245] INFO: [segment #2] Starting segment 2 of 4 (50%) -[20:31:51.367] INFO: [segment #3] Starting segment 3 of 4 (75%) -``` - -### 5. Structured Logging - -Pino uses structured logging with separate fields for metadata: - -**Before** (console-based): -```typescript -logger.error("Migration failed:", error); -``` - -**After** (Pino): -```typescript -logger.error({ error }, "Migration failed"); ``` -This allows: -- Better log parsing and filtering -- Easier debugging with structured data -- JSON output for log aggregation systems +### Log files -## Benefits +Controlled by `config.debug.logFile`: -1. **Performance**: Pino is one of the fastest Node.js loggers -2. **Structured**: JSON-based logging for easy parsing -3. **Pretty Printing**: Human-readable output during development -4. **Prefixes**: Easy identification of which segment is logging -5. **Run Correlation**: Run ID allows tracking all logs from a single migration run -6. **Log Levels**: Configurable verbosity via environment variable +- `true` → each process writes to `.transfer//logs/.log` (per-process files, no interleaving) +- String → all processes append to the given path +- Absent → no log file, stdout only -## Usage Examples +### Run ID -### Basic Logging -```typescript -logger.info("Processing records"); -logger.warn("Found duplicate record"); -logger.error({ recordId: "123" }, "Failed to process record"); -``` - -### With Context -```typescript -logger.info({ - recordsProcessed: 1000, - recordsMigrated: 950, - recordsSkipped: 50 -}, "Batch completed"); -``` - -### With Errors -```typescript -try { - await processRecord(record); -} catch (error) { - logger.error({ error, recordId: record.id }, "Failed to process record"); -} -``` - -## Configuration - -### Log Levels -Set via environment variable: -```bash -LOG_LEVEL=trace yarn transfer ... -LOG_LEVEL=debug yarn transfer ... -LOG_LEVEL=info yarn transfer ... # Default -LOG_LEVEL=warn yarn transfer ... -LOG_LEVEL=error yarn transfer ... -``` - -### Pretty Printing -The logger is configured with: -- Colorized output -- Hidden `pid` and `hostname` fields -- Human-readable timestamps (`HH:MM:ss.l`) +Generated in `src/commands/run/handler.ts` (not `src/cli.ts`). Passed to all worker processes. ## Gotchas @@ -171,18 +91,12 @@ level filtering itself and the stream receives whatever pino decides to emit. **Rule of thumb:** whenever you add an entry to `pino.multistream`, always set `level` explicitly. The omitted-level default of `info` is almost never what you want. -## Future Enhancements +## Structured logging -Based on the DDB-ES migration reference, these features can be added: - -1. **Log Files**: Write logs to temporary directory - ```typescript - const logFilePath = path.join( - os.tmpdir(), - `v5-to-v6-migration-log-${runId}-${segment}.log` - ); - ``` +Pino uses structured logging with separate fields for metadata: -2. **Statistics Tracking**: Write stats to JSON files per segment -3. **Log Aggregation**: Collect all segment logs at the end -4. **Metrics**: Track and report migration statistics +```ts +// Context objects first, message string last +logger.error({ error }, "Migration failed"); +logger.info({ recordsProcessed: 1000, recordsSkipped: 50 }, "Batch completed"); +``` diff --git a/docs/webiny-di-guide.md b/docs/webiny-di-guide.md index 66a1c95c..43d99a27 100644 --- a/docs/webiny-di-guide.md +++ b/docs/webiny-di-guide.md @@ -211,12 +211,13 @@ import { PinoLogger } from "./PinoLogger.ts"; export interface LoggerOptions { logLevel: "debug" | "info" | "warn" | "error"; json: boolean; + logFile?: string; } export const LoggerFeature = createFeature({ name: "Core/LoggerFeature", register(container, options) { - container.registerInstance(Logger, new PinoLoggerImpl(options)); + container.registerInstance(Logger, new PinoLogger(options)); // or: container.register(PinoLogger).inSingletonScope(); } }); diff --git a/templates/.claude/skills/writing-data-transfer-preset/SKILL.md b/templates/.claude/skills/writing-data-transfer-preset/SKILL.md index 89af3e6a..d1a50d2a 100644 --- a/templates/.claude/skills/writing-data-transfer-preset/SKILL.md +++ b/templates/.claude/skills/writing-data-transfer-preset/SKILL.md @@ -1,11 +1,11 @@ --- name: writing-data-transfer-preset -description: Use when writing or editing a @webiny/data-transfer preset file (the one referenced by pipeline.preset in a config). Covers createTransferPreset, pipelineBuilderFactory.create, filter/use/hook composition, first-match-wins record dispatch, unmatched-record drop semantics, writing transformers (createDdbTransformer / createOsTransformer), registering pipelines with the runner, and the onEnd auto-put behavior for DdbProcessor / OsProcessor. +description: Use when writing or editing a @webiny/data-transfer preset file (selected at runtime by the wizard or --preset flag). Covers createTransferPreset, pipelineBuilderFactory.create, filter/use/hook composition, first-match-wins record dispatch, unmatched-record drop semantics, writing transformers (createDdbTransformer / createOsTransformer), registering pipelines with the runner, and the onEnd auto-put behavior for DdbProcessor / OsProcessor. --- # Writing a `@webiny/data-transfer` preset -A preset is a `.ts` file that `export default`s a `createTransferPreset({...})` call. The config points at it via `pipeline.preset: "./presets/my-preset.ts"` (relative path) OR a built-in name. +A preset is a `.ts` file that `export default`s a `createTransferPreset({...})` call. It is selected at runtime by the `TransferWizard` (interactive prompt) or passed as `--preset ` on the CLI. Built-in presets live in `src/presets/`; user presets go in the project's `presetsDir`. The preset's job: register one or more **pipelines** — each pipeline is a `{scanner, processors, filter, transformers}` quadruple that processes matching records. @@ -168,7 +168,7 @@ export const stampMigratedAt = createDdbTransformer( | Member | Description | | -------------------------- | --------------------------------------------------------------------------------------- | | `ctx.copyFile(src, tgt)` | Emit an S3 copy command (source bucket → target bucket). | -| `ctx.getFile(key)` | Read a file from the SOURCE bucket. Returns `Buffer`. | +| `ctx.getFile(key)` | Read a file from the SOURCE bucket. Returns `Buffer \| null`. | **OsProcessor slice** — available when pipeline includes `OsProcessor`: @@ -216,9 +216,10 @@ Use them for index preparation, schema migration, cache warm-up, etc. ## Built-in transformer stacks -`src/transformers/index.ts` defines two pre-built transformer arrays used by the built-in presets. They are **not exported from the `@webiny/data-transfer` public API** — custom presets that need them must import via the source path: +`src/transformers/index.ts` defines two pre-built transformer arrays used by the built-in presets. They are **not exported from the `@webiny/data-transfer` public API** — they are internal to the package. Custom presets running from within the monorepo can import via the source path, but installed-package users must replicate the stack inline: ```ts +// Only works from within the data-transfer monorepo, not from an installed package: import { cmsEntryTransformers } from "../../src/transformers/index.ts"; import { osCmsEntryTransformers } from "../../src/transformers/index.ts"; ``` @@ -230,10 +231,13 @@ import { osCmsEntryTransformers } from "../../src/transformers/index.ts"; ## Built-in presets -Two built-in presets ship with the package (pass by name in `config.pipeline.preset`): +Five built-in presets ship with the package (selected at runtime via wizard or `--preset`): - **`v5-to-v6-ddb`** — full Webiny v5→v6 DDB + S3 migration. Pipelines (registration order): AuditLogs (blackholed when no audit log table), AcoSearchRecordsPage (blackhole), ContentModelGroups, BackgroundTasks (blackhole), FileManagerSettings, FileManagerFiles, MailerSettings, SecurityGroups, SecurityTeams, CmsModels, FolderPermissions, CmsEntries. AuditLogs MUST be registered before AcoSearchRecordsPage and CmsEntries (audit log records share the acoSearchRecord modelId prefix). - **`v5-to-v6-os`** — OpenSearch companion table migration. Pipelines: BackgroundTasks (blackhole), MailerSettings (blackhole), FileManagerFiles, CmsEntries. Uses `OsScanner` + `OsProcessor`. Registration order is load-bearing: blackhole pipelines BEFORE CmsEntries because background tasks and mailer settings ARE CMS entries in the OS table. +- **`copy-ddb`** — verbatim DDB + S3 copy (zero transformers). +- **`copy-os`** — verbatim OpenSearch copy (zero transformers). +- **`copy-files`** — S3-only file copy. ## `addLiveField` — querying siblings via cache diff --git a/templates/AGENTS.md b/templates/AGENTS.md index 14263a83..68fdd723 100644 --- a/templates/AGENTS.md +++ b/templates/AGENTS.md @@ -6,16 +6,15 @@ This project uses [`@webiny/data-transfer`](https://www.npmjs.com/package/@webin ``` projects// - ddb.transfer.config.ts # DDB + S3 transfer - os.transfer.config.ts # OpenSearch transfer - custom.transfer.config.ts # Same as ddb but points at a local preset + config.ts # Unified config (DDB + S3 + optional OpenSearch) setup.ts # Optional custom DI wiring models/ # Optional CMS model overrides + presets/ # User preset files (auto-discovered by presetsDir) .env # Your credentials + table/bucket names (gitignored) .env.example # Template to copy from transformers/ # Your custom record transformers -presets/ # Your custom pipeline presets +presets/ # Shared presets (across projects) features/ # Custom DI features (advanced) ``` @@ -25,12 +24,12 @@ Duplicate the `projects//` folder for each environment — each has it Two Claude skills ship with this project (`.claude/skills/`) and activate automatically when you ask Claude for help: -- **`writing-data-transfer-config`** — when editing any `*.transfer.config.ts`. +- **`writing-data-transfer-config`** — when editing a `config.ts`. - **`writing-data-transfer-preset`** — when editing a preset file under `presets/`. Both include: -- The exact shapes `createDdbConfig` / `createOsConfig` / `createTransferPreset` accept. +- The exact shapes `createConfig` / `createTransferPreset` accept. - `fromEnv` / `numberFromEnv` / `fromAwsProfile` usage. - Source/target collision + whitespace-trimming rules (both built into the Zod validators). - Pipeline filter order, first-match-wins semantics, silent-drop behavior for unmatched records. @@ -63,18 +62,17 @@ After reviewing the written `.env`, run `yarn transfer` again. With `.env` prese ### Direct run (skip wizard) ```bash -# DDB transfer first -yarn transfer --config=./projects/example/ddb.transfer.config.ts - -# OS transfer second (if applicable) -yarn transfer --config=./projects/example/os.transfer.config.ts +# One config covers everything; the preset determines which storage operations run. +# The wizard selects the preset at runtime, or pass --preset directly: +yarn transfer --config=./projects/example/config.ts --preset=v5-to-v6-ddb +yarn transfer --config=./projects/example/config.ts --preset=v5-to-v6-os ``` Always run DDB before OS — OS depends on models + tenants written by the DDB transfer. ## Verifying configs before a real run -A misconfigured transfer can destroy production data. The Zod schema in `createDdbConfig` / `createOsConfig` already rejects: +A misconfigured transfer can destroy production data. The Zod schema in `createConfig` already rejects: - Same S3 bucket on both sides (would overwrite source files). - Same region + same DDB/OS-DDB table name (would read and write to the same table). diff --git a/templates/README.md b/templates/README.md index fe038de9..ec35dcb2 100644 --- a/templates/README.md +++ b/templates/README.md @@ -40,19 +40,19 @@ Mixed formats are allowed. Re-run `yarn transfer` after placing the files. After the wizard writes your `.env`, run `yarn transfer` again. The wizard skips setup and prompts for the config to run. -Or skip the wizard and pass `--config` directly: +Or skip the wizard and pass `--config` + `--preset` directly: ```bash -yarn transfer --config=./projects/my-project/ddb.transfer.config.ts # DDB + S3 first -yarn transfer --config=./projects/my-project/os.transfer.config.ts # OS second (if applicable) +yarn transfer --config=./projects/my-project/config.ts --preset=v5-to-v6-ddb # DDB + S3 first +yarn transfer --config=./projects/my-project/config.ts --preset=v5-to-v6-os # OS second (if applicable) ``` -Always run the DDB transfer first, then the OS transfer. +One `config.ts` covers all storage types — the preset determines which operations run. Always run the DDB transfer first, then the OS transfer. ## Project Structure ``` -projects/ Per-project configs and .env files +projects/ Per-project config.ts and .env files transformers/ Custom record transformers presets/ Custom pipeline presets features/ Custom DI features @@ -65,12 +65,10 @@ Duplicate the `projects/example/` folder for each environment you want to transf ``` projects/ production/ - ddb.transfer.config.ts - os.transfer.config.ts + config.ts .env staging/ - ddb.transfer.config.ts - os.transfer.config.ts + config.ts .env ``` @@ -78,27 +76,21 @@ Each project has its own `.env` file so credentials are isolated. ## Configuration -### DDB Transfer +### Unified Config -The DDB config transfers all DynamoDB records (CMS entries, models, security, file manager, settings) and S3 files. +One `config.ts` covers DDB, S3, and optional OpenSearch. The preset (selected at runtime by the wizard or via `--preset`) determines which storage operations run. -See `projects/example/ddb.transfer.config.ts` for the full template. - -### OS Transfer - -The OS config transfers CMS entries from the OpenSearch DynamoDB table. It decompresses gzipped records, applies transformations, and writes to the target OS DynamoDB table. - -See `projects/example/os.transfer.config.ts` for the full template. +See `projects/example/config.ts` for the full template. ### Pipeline Options -- `preset` - File path to your preset, resolved relative to the config file's directory (e.g. `"./presets/my-preset.ts"` or `"../../presets/example.ts"`). No built-in presets ship with the package — author your own. - `segments` - Number of parallel workers for scanning (default: 1) - `modelsDir` - Path to a directory with custom CMS model JSON files (optional). Resolved relative to the config file's directory. +- `presetsDir` - Directory of user preset files. The wizard discovers them automatically. ### Debug Options -Opt-in via a top-level `debug: { ... }` block on your config. See the commented block in `projects/example/ddb.transfer.config.ts` for the full shape. +Opt-in via a top-level `debug: { ... }` block on your config. See the commented block in `projects/example/config.ts` for the full shape. - `debug.snapshot` — dump every source/post-transform/command record to local JSONL files under `.transfer//snapshot/` (gzipped by default). Use `true` for defaults, or `{ dir, compress }` to override. Great for diffing exactly what a transformer did to a specific record without re-scanning AWS. - `debug.logFile` — write the runner's pino log to disk alongside stdout. `true` → `.transfer//logs/.log` (one file per process, safe under worker parallelism). String → all processes append to the path you provide. Replay with `cat .transfer//logs/*.log | pino-pretty`. @@ -110,7 +102,7 @@ Both outputs land under `.transfer/` by default, which is gitignored in this rep If a run finishes with only some shards failing, you can re-drive just those indices instead of the whole table: ```bash -yarn transfer --config=./projects/my-project/ddb.transfer.config.ts --segments=1,3 +yarn transfer --config=./projects/my-project/config.ts --preset=v5-to-v6-ddb --segments=1,3 ``` Workers still receive the full `--total` (from `pipeline.segments`), so each shard scans the exact same slice it would in a fresh run. Out-of-range indices fail fast before any worker spawns. @@ -121,21 +113,20 @@ The package ships with starter files so you can compose your own transfer: - `transformers/stampMigratedAt.ts` — a minimal custom transformer (plain function mutating `ctx.record`). - `presets/example.ts` — a minimal preset that registers one pipeline using the transformer. -- `projects/example/custom.transfer.config.ts` — a config pointing at the custom preset above. +- `projects/example/config.ts` — a config whose `presetsDir` points at the custom preset above. -Run it: +Run it (the wizard will discover your preset, or pass `--preset` directly): ```bash -yarn transfer --config=./projects/example/custom.transfer.config.ts +yarn transfer --config=./projects/example/config.ts --preset=example ``` A preset is an object: ```typescript -import type { MigrationPreset } from "@webiny/data-transfer"; -import { DdbScanner, DdbProcessor, S3Processor } from "@webiny/data-transfer"; +import { createTransferPreset, DdbScanner, DdbProcessor, S3Processor } from "@webiny/data-transfer"; -const preset: MigrationPreset = { +export default createTransferPreset({ name: "my", description: "...", configure({ runner, pipelineBuilderFactory }) { @@ -152,9 +143,7 @@ const preset: MigrationPreset = { runner.register(myPipeline); } -}; - -export default preset; +}); ``` The `configure` callback also receives `container` — the DI container — if you diff --git a/templates/internal-project/README.md b/templates/internal-project/README.md index 1acaae7a..4a258a69 100644 --- a/templates/internal-project/README.md +++ b/templates/internal-project/README.md @@ -30,12 +30,12 @@ cp .env.example .env Run DDB transfer first, then OS. They are independent and don't share state. ```bash -# From the repo root — guided (wizard selects config): +# From the repo root — guided (wizard selects config + preset): yarn transfer # Or direct: -yarn transfer --config=./projects/{{PROJECT_NAME}}/ddb.transfer.config.ts -yarn transfer --config=./projects/{{PROJECT_NAME}}/os.transfer.config.ts +yarn transfer --config=./projects/{{PROJECT_NAME}}/config.ts --preset=v5-to-v6-ddb +yarn transfer --config=./projects/{{PROJECT_NAME}}/config.ts --preset=v5-to-v6-os ``` ## Config notes @@ -44,7 +44,7 @@ yarn transfer --config=./projects/{{PROJECT_NAME}}/os.transfer.config.ts The `target.auditLog` field is required. It defaults to `null`, which means audit log records are dropped during transfer. To transfer them, uncomment the line in -`ddb.transfer.config.ts` and set `TARGET_AUDIT_LOGS_TABLE` in `.env`: +`config.ts` and set `TARGET_AUDIT_LOGS_TABLE` in `.env`: ``` auditLog: { dynamodb: { tableName: fromEnv("TARGET_AUDIT_LOGS_TABLE") } } @@ -54,12 +54,8 @@ The audit log table must be different from the main target DDB table. ### `presetsDir` -Both configs have `presetsDir: "./presets"` pre-wired. Drop `.ts` preset files -into `presets/` and reference them by filename (without extension) in `pipeline.preset`: - -```ts -pipeline: { preset: "my-preset", presetsDir: "./presets" } -``` +The config has `presetsDir: "./presets"` pre-wired. Drop `.ts` preset files +into `presets/` and the wizard will discover them at runtime. Or pass `--preset my-preset` directly. ### `modelsDir`