Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
79 changes: 47 additions & 32 deletions AGENTS.md

Large diffs are not rendered by default.

39 changes: 32 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ A generic data-transfer tool for Webiny environments. Copies DynamoDB + S3 (or O
- **Prod → dev seeding** — zero transformers, just copy.
- **Custom transfers** — write your own transformers + pipelines + preset for bespoke data moves.

The package ships four built-in presets (`v5-to-v6-ddb`, `v5-to-v6-os`, `copy-ddb`, `copy-files`) plus full authoring support for your own.
The package ships five built-in presets (`v5-to-v6-ddb`, `v5-to-v6-os`, `copy-ddb`, `copy-os`, `copy-files`) plus full authoring support for your own.

## Quick start

Expand Down Expand Up @@ -120,6 +120,7 @@ export default createConfig({

- **`fromEnv(name)`** — required string; throws if unset or empty (empty string counts as missing).
- **`fromEnv(name, default)`** — falls back to `default` when unset.
- **`fromEnv(name, null)`** — returns `string | null`; returns `null` (instead of throwing) when unset. Use for optional config sections (e.g. `fromEnv("SOURCE_OS_TABLE", null)`).
- **`numberFromEnv(name, default?)`** — typed numeric; throws on parse failure (`SEGMENTS=four` fails immediately with a named error).

### Credentials
Expand Down Expand Up @@ -190,8 +191,8 @@ JSON models override DB-loaded models when both exist.
```typescript
tuning: {
flushEvery: numberFromEnv("FLUSH_EVERY", 500), // records per shard flush — bounds peak memory
ddb: { maxRetries: 3, initialBackoffMs: 100 },
s3: { concurrency: 10, maxRetries: 3, initialBackoffMs: 100 },
ddb: { maxRetries: 3, initialBackoffMs: 100, requestTimeoutMs: 5000 },
s3: { concurrency: 10, maxRetries: 3, initialBackoffMs: 100, requestTimeoutMs: 10000 },
os: { maxRetries: 3, retryScheduleMs: [5000, 10000, 20000], gzipConcurrency: 16 }
}
```
Expand All @@ -206,8 +207,9 @@ Add a `debug` block to your config to opt into diagnostics:

```typescript
debug: {
snapshot: true, // or: { dir: "./my-snapshot", compress: false }
logFile: true // or: "./my-transfer.log"
logLevel: "debug", // "debug" | "info" | "warn" | "error" (default "info"); also overridable via --log-level CLI flag
snapshot: true, // or: { dir: "./my-snapshot", compress: false }
logFile: true // or: "./my-transfer.log"
}
```

Expand Down Expand Up @@ -339,6 +341,7 @@ import {
| `isOsMailerSettings` | OS mailer settings records |
| `isAuditLogEntry` | Audit log entry records |
| `isMigrationRecord` | Migration tracking records |
| `isFormBuilderRecord` | Form Builder records (forms + submissions) |

Multiple `.filter()` calls on the same pipeline AND-compose — a record must pass all of them. Register more-specific filters before catch-alls.

Expand Down Expand Up @@ -390,8 +393,9 @@ Select by name when the wizard asks "Which preset do you want to run?":

- **`"v5-to-v6-ddb"`** — full Webiny v5 → v6 migration of the primary DynamoDB table (CMS entries, file manager, security, mailer, folder permissions, etc.).
- **`"v5-to-v6-os"`** — migration of the OpenSearch companion DynamoDB table. Run **after** `v5-to-v6-ddb`.
- **`"copy-ddb"`** — verbatim DynamoDB-only copy (no transformations).
- **`"copy-files"`** — verbatim DynamoDB + S3 file copy.
- **`"copy-ddb"`** — verbatim DynamoDB + S3 copy (no transformations).
- **`"copy-os"`** — verbatim OpenSearch companion table copy (no transformations).
- **`"copy-files"`** — S3-only file copy.

Custom presets placed in your `presetsDir` are listed alongside built-ins.

Expand Down Expand Up @@ -516,6 +520,27 @@ export default createTransferPreset({

**Requires:** pipeline must include `S3Processor`. **Do not use** when you need a new key path (e.g. the v5→v6 `tenants/<id>/files/<key>` migration) — use the internal `createMetadata` transformer instead.

#### `replaceFileUrls`

Rewrites file-manager URLs in CMS rich-text and long-text fields from the source domain to the target domain. Requires a `fileUrls` block in your config root:

```typescript
export default createConfig({
// ...source, target, pipeline as usual...
fileUrls: {
source: "https://old-cdn.example.com",
target: "https://new-cdn.example.com"
}
});
```

```typescript
import { replaceFileUrls } from "@webiny/data-transfer";

// In your preset:
.use(replaceFileUrls)
```

### Built-in processors

| Processor | Slice helpers | Notes |
Expand Down
15 changes: 7 additions & 8 deletions docs/aws-transfer-setup.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ region = us-east-1
### Run

```bash
SOURCE_PROFILE=source TARGET_PROFILE=target yarn start
SOURCE_PROFILE=source TARGET_PROFILE=target yarn transfer
```

### When local works
Expand Down Expand Up @@ -211,7 +211,7 @@ sudo dnf install -y nodejs git
git clone <repo-url>
cd <repo>
yarn install
SOURCE_ROLE_ARN="arn:aws:iam::<SOURCE_ACCOUNT_ID>:role/CrossAccountTransferReadRole" yarn start
SOURCE_ROLE_ARN="arn:aws:iam::<SOURCE_ACCOUNT_ID>:role/CrossAccountTransferReadRole" yarn transfer
```

### 6. Tear down when done
Expand All @@ -220,15 +220,14 @@ Terminate the instance. IAM roles can be kept for future transfers or deleted.

## Script Configuration

The script reads from environment variables:
The tool reads credentials and table/bucket names from a `config.ts` file, which loads values from `.env` via `fromEnv()`. The env var names below are **conventions used in the reference config** — the tool itself does not hardcode them. Wire them in your `config.ts` via `fromEnv("VAR_NAME")`.

| Variable | Description | Required when |
| --- | --- | --- |
| `SOURCE_PROFILE` | AWS profile name for source account | Running locally |
| `TARGET_PROFILE` | AWS profile name for target account | Running locally |
| `SOURCE_ROLE_ARN` | ARN of source role to assume | Running on EC2 |
| `AWS_REGION` | Target region | Always |
| `SOURCE_REGION` | Source region (if different) | If source ≠ target region |
| `SOURCE_PROFILE` | AWS profile name for source account | Running locally with `fromAwsProfile` |
| `TARGET_PROFILE` | AWS profile name for target account | Running locally with `fromAwsProfile` |
| `SOURCE_ROLE_ARN` | ARN of source role to assume | Running on EC2 with `fromAwsCredentialChain` |
| `SOURCE_REGION` / `TARGET_REGION` | AWS regions | Always (in config.ts) |

## Permissions

Expand Down
2 changes: 2 additions & 0 deletions docs/ddb-es-migration-reference.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# DDB-ES Migration Pattern Reference

> **Legacy reference only.** This document describes patterns from an external Webiny migration codebase (`.sample/migration/ddb-es`) that is NOT part of this repository. The patterns informed the design of `@webiny/data-transfer` but the file paths, class names, and APIs below do not exist in this project. Refer to AGENTS.md for the current architecture.

Reference documentation for DynamoDB + Elasticsearch migration patterns from `.sample/migration/ddb-es`.

## Architecture Overview
Expand Down
162 changes: 38 additions & 124 deletions docs/pino-logger-implementation.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,141 +2,61 @@

## Overview

Replaced the simple console-based logger with Pino, a fast and structured JSON logger with pretty printing support.
The project uses Pino for structured JSON logging with pretty-printing support.

## Changes Made
## Current architecture

### 1. Dependencies Added
The logger is a DI-managed service, not a standalone utility:

```bash
yarn add pino pino-pretty
```
- **Abstraction**: `Logger` (`src/tools/Logger/abstractions/Logger.ts`)
- **Implementation**: `PinoLogger` (`src/tools/Logger/PinoLogger.ts`)
- **Feature**: `LoggerFeature.register(container, { logLevel, json, logFile? })`

### 2. Logger Utility (`src/utils/logger.ts`)
Consumers resolve via the DI container:

**New API**:
```typescript
import { createLogger, Logger } from "./utils/logger.ts";
```ts
const logger = container.resolve(Logger);
logger.info("message");
logger.child("[segment 0]"); // per-worker prefix
```

// Create logger with optional configuration
const logger = createLogger({
level: "info", // Optional: trace, debug, info, warn, error
msgPrefix: "[segment #0] " // Optional: prefix for all messages
});
There is no `createLogger` function — the logger is always resolved from the container.

// Usage
logger.info("Simple message");
logger.info("Message with data", { key: "value" });
logger.warn({ error }, "Warning message");
logger.error({ error }, "Error message");
```
### Log level

**Environment Variable**:
- `LOG_LEVEL`: Set log level (trace, debug, info, warn, error)
- Default: `info`
Configured via:

### 3. Run ID Generation (`src/cli.ts`)
1. `config.debug.logLevel` in the user's `config.ts` (`"debug" | "info" | "warn" | "error"`, default `"info"`)
2. `--log-level` CLI flag (overrides config)

Each migration run now gets a unique ID:
```typescript
const runId = String(Date.now());
logger.info(`Run ID: ${runId}`);
```
There is no `LOG_LEVEL` environment variable — log level flows through the config and CLI flag, not env vars.

The runId is:
- Generated once in the main process
- Passed to all worker processes
- Can be used for log file naming and correlation
### Per-worker prefixes

### 4. Segment Prefixes (`src/process-segment.ts`)
Each worker process creates a child logger with a segment prefix:

Each worker process logs with a segment prefix:
```typescript
const logger = createLogger({
msgPrefix: `[segment #${options.segment}] `
});
```ts
// src/commands/processSegment/handler.ts
const childLogger = logger.child("[segment 0]");
```

**Output Example**:
**Output example:**
```
[20:31:50.954] INFO: [segment #0] Starting segment 0 of 4 (0%)
[20:31:51.123] INFO: [segment #1] Starting segment 1 of 4 (25%)
[20:31:51.245] INFO: [segment #2] Starting segment 2 of 4 (50%)
[20:31:51.367] INFO: [segment #3] Starting segment 3 of 4 (75%)
```

### 5. Structured Logging

Pino uses structured logging with separate fields for metadata:

**Before** (console-based):
```typescript
logger.error("Migration failed:", error);
```

**After** (Pino):
```typescript
logger.error({ error }, "Migration failed");
```

This allows:
- Better log parsing and filtering
- Easier debugging with structured data
- JSON output for log aggregation systems
### Log files

## Benefits
Controlled by `config.debug.logFile`:

1. **Performance**: Pino is one of the fastest Node.js loggers
2. **Structured**: JSON-based logging for easy parsing
3. **Pretty Printing**: Human-readable output during development
4. **Prefixes**: Easy identification of which segment is logging
5. **Run Correlation**: Run ID allows tracking all logs from a single migration run
6. **Log Levels**: Configurable verbosity via environment variable
- `true` → each process writes to `.transfer/<runId>/logs/<orchestrator|segment-N>.log` (per-process files, no interleaving)
- String → all processes append to the given path
- Absent → no log file, stdout only

## Usage Examples
### Run ID

### Basic Logging
```typescript
logger.info("Processing records");
logger.warn("Found duplicate record");
logger.error({ recordId: "123" }, "Failed to process record");
```

### With Context
```typescript
logger.info({
recordsProcessed: 1000,
recordsMigrated: 950,
recordsSkipped: 50
}, "Batch completed");
```

### With Errors
```typescript
try {
await processRecord(record);
} catch (error) {
logger.error({ error, recordId: record.id }, "Failed to process record");
}
```

## Configuration

### Log Levels
Set via environment variable:
```bash
LOG_LEVEL=trace yarn transfer ...
LOG_LEVEL=debug yarn transfer ...
LOG_LEVEL=info yarn transfer ... # Default
LOG_LEVEL=warn yarn transfer ...
LOG_LEVEL=error yarn transfer ...
```

### Pretty Printing
The logger is configured with:
- Colorized output
- Hidden `pid` and `hostname` fields
- Human-readable timestamps (`HH:MM:ss.l`)
Generated in `src/commands/run/handler.ts` (not `src/cli.ts`). Passed to all worker processes.

## Gotchas

Expand Down Expand Up @@ -171,18 +91,12 @@ level filtering itself and the stream receives whatever pino decides to emit.
**Rule of thumb:** whenever you add an entry to `pino.multistream`, always set `level` explicitly.
The omitted-level default of `info` is almost never what you want.

## Future Enhancements
## Structured logging

Based on the DDB-ES migration reference, these features can be added:

1. **Log Files**: Write logs to temporary directory
```typescript
const logFilePath = path.join(
os.tmpdir(),
`v5-to-v6-migration-log-${runId}-${segment}.log`
);
```
Pino uses structured logging with separate fields for metadata:

2. **Statistics Tracking**: Write stats to JSON files per segment
3. **Log Aggregation**: Collect all segment logs at the end
4. **Metrics**: Track and report migration statistics
```ts
// Context objects first, message string last
logger.error({ error }, "Migration failed");
logger.info({ recordsProcessed: 1000, recordsSkipped: 50 }, "Batch completed");
```
3 changes: 2 additions & 1 deletion docs/webiny-di-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -211,12 +211,13 @@ import { PinoLogger } from "./PinoLogger.ts";
export interface LoggerOptions {
logLevel: "debug" | "info" | "warn" | "error";
json: boolean;
logFile?: string;
}

export const LoggerFeature = createFeature<LoggerOptions>({
name: "Core/LoggerFeature",
register(container, options) {
container.registerInstance(Logger, new PinoLoggerImpl(options));
container.registerInstance(Logger, new PinoLogger(options));
// or: container.register(PinoLogger).inSingletonScope();
}
});
Expand Down
14 changes: 9 additions & 5 deletions templates/.claude/skills/writing-data-transfer-preset/SKILL.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
---
name: writing-data-transfer-preset
description: Use when writing or editing a @webiny/data-transfer preset file (the one referenced by pipeline.preset in a config). Covers createTransferPreset, pipelineBuilderFactory.create, filter/use/hook composition, first-match-wins record dispatch, unmatched-record drop semantics, writing transformers (createDdbTransformer / createOsTransformer), registering pipelines with the runner, and the onEnd auto-put behavior for DdbProcessor / OsProcessor.
description: Use when writing or editing a @webiny/data-transfer preset file (selected at runtime by the wizard or --preset flag). Covers createTransferPreset, pipelineBuilderFactory.create, filter/use/hook composition, first-match-wins record dispatch, unmatched-record drop semantics, writing transformers (createDdbTransformer / createOsTransformer), registering pipelines with the runner, and the onEnd auto-put behavior for DdbProcessor / OsProcessor.
---

# Writing a `@webiny/data-transfer` preset

A preset is a `.ts` file that `export default`s a `createTransferPreset({...})` call. The config points at it via `pipeline.preset: "./presets/my-preset.ts"` (relative path) OR a built-in name.
A preset is a `.ts` file that `export default`s a `createTransferPreset({...})` call. It is selected at runtime by the `TransferWizard` (interactive prompt) or passed as `--preset <name>` on the CLI. Built-in presets live in `src/presets/`; user presets go in the project's `presetsDir`.

The preset's job: register one or more **pipelines** — each pipeline is a `{scanner, processors, filter, transformers}` quadruple that processes matching records.

Expand Down Expand Up @@ -168,7 +168,7 @@ export const stampMigratedAt = createDdbTransformer(
| Member | Description |
| -------------------------- | --------------------------------------------------------------------------------------- |
| `ctx.copyFile(src, tgt)` | Emit an S3 copy command (source bucket → target bucket). |
| `ctx.getFile(key)` | Read a file from the SOURCE bucket. Returns `Buffer`. |
| `ctx.getFile(key)` | Read a file from the SOURCE bucket. Returns `Buffer \| null`. |

**OsProcessor slice** — available when pipeline includes `OsProcessor`:

Expand Down Expand Up @@ -216,9 +216,10 @@ Use them for index preparation, schema migration, cache warm-up, etc.

## Built-in transformer stacks

`src/transformers/index.ts` defines two pre-built transformer arrays used by the built-in presets. They are **not exported from the `@webiny/data-transfer` public API** — custom presets that need them must import via the source path:
`src/transformers/index.ts` defines two pre-built transformer arrays used by the built-in presets. They are **not exported from the `@webiny/data-transfer` public API** — they are internal to the package. Custom presets running from within the monorepo can import via the source path, but installed-package users must replicate the stack inline:

```ts
// Only works from within the data-transfer monorepo, not from an installed package:
import { cmsEntryTransformers } from "../../src/transformers/index.ts";
import { osCmsEntryTransformers } from "../../src/transformers/index.ts";
```
Expand All @@ -230,10 +231,13 @@ import { osCmsEntryTransformers } from "../../src/transformers/index.ts";

## Built-in presets

Two built-in presets ship with the package (pass by name in `config.pipeline.preset`):
Five built-in presets ship with the package (selected at runtime via wizard or `--preset`):

- **`v5-to-v6-ddb`** — full Webiny v5→v6 DDB + S3 migration. Pipelines (registration order): AuditLogs (blackholed when no audit log table), AcoSearchRecordsPage (blackhole), ContentModelGroups, BackgroundTasks (blackhole), FileManagerSettings, FileManagerFiles, MailerSettings, SecurityGroups, SecurityTeams, CmsModels, FolderPermissions, CmsEntries. AuditLogs MUST be registered before AcoSearchRecordsPage and CmsEntries (audit log records share the acoSearchRecord modelId prefix).
- **`v5-to-v6-os`** — OpenSearch companion table migration. Pipelines: BackgroundTasks (blackhole), MailerSettings (blackhole), FileManagerFiles, CmsEntries. Uses `OsScanner` + `OsProcessor`. Registration order is load-bearing: blackhole pipelines BEFORE CmsEntries because background tasks and mailer settings ARE CMS entries in the OS table.
- **`copy-ddb`** — verbatim DDB + S3 copy (zero transformers).
- **`copy-os`** — verbatim OpenSearch copy (zero transformers).
- **`copy-files`** — S3-only file copy.

## `addLiveField` — querying siblings via cache

Expand Down
Loading
Loading