simstudioai
diff --git a/‎.agents/skills/memory-load-check/SKILL.md‎
Lines changed: 138 additions & 0 deletions b/‎.agents/skills/memory-load-check/SKILL.md‎
Lines changed: 138 additions & 0 deletions
diff --git a/‎.agents/skills/validate-integration/SKILL.md‎
Lines changed: 17 additions & 7 deletions b/‎.agents/skills/validate-integration/SKILL.md‎
Lines changed: 17 additions & 7 deletions
diff --git a/‎.claude/commands/add-enrichment.md‎
Lines changed: 142 additions & 0 deletions b/‎.claude/commands/add-enrichment.md‎
Lines changed: 142 additions & 0 deletions
@@ -0,0 +1,138 @@
+---
+name: memory-load-check
+description: Review PRs and diffs for unbounded memory loading, concurrency explosions, oversized payload materialization, and missing pagination or byte caps. Use when reviewing cleanup jobs, background jobs, data imports/exports, file parsing, API fan-out, workflow execution payloads, large arrays/files, or any change that reads many rows, files, responses, logs, or external API pages into process memory.
+---
+
+# Memory Load Check
+
+Use this skill when a PR or diff could load unbounded data into a Node/Bun process, especially in cron routes, background tasks, API routes, workflow execution, file parsing, cleanup jobs, migrations, import/export flows, and external API integrations.
+
+## Review Goal
+
+Prove each changed path has explicit bounds for:
+- rows held in memory
+- bytes held in memory
+- concurrent promises, DB queries, HTTP calls, storage operations, and jobs
+- number of pages, batches, chunks, retries, and retained intermediate objects
+
+If any bound depends only on current production size or "probably small" data, treat it as a finding.
+
+## References
+
+Read these when doing a deeper pass:
+- Node.js streams/backpressure: https://nodejs.org/learn/modules/backpressuring-in-streams
+- Node.js stream usage: https://nodejs.org/en/learn/modules/how-to-use-streams
+- Keyset/cursor pagination over offset scans: https://blog.sequinstream.com/keyset-cursors-not-offsets-for-postgres-pagination/
+- Postgres pagination tradeoffs: https://www.citusdata.com/blog/2016/03/30/five-ways-to-paginate/
+
+## Sim Helpers To Prefer
+
+- `apps/sim/lib/cleanup/batch-delete.ts`
+  - `chunkedBatchDelete`: bounded SELECT -> optional side effect -> DELETE loop.
+  - `batchDeleteByWorkspaceAndTimestamp`: common workspace/timestamp cleanup wrapper.
+  - `selectRowsByIdChunks`: chunks large ID sets and enforces an overall row cap.
+  - `chunkArray`: use only after the input set itself is already bounded.
+- `apps/sim/lib/core/utils/stream-limits.ts`
+  - `PayloadSizeLimitError`
+  - `assertKnownSizeWithinLimit`
+  - `assertContentLengthWithinLimit`
+  - `readStreamToBufferWithLimit`
+  - `readNodeStreamToBufferWithLimit`
+  - `readResponseToBufferWithLimit`
+  - `readResponseTextWithLimit`
+- Cleanup dispatcher pattern in `apps/sim/lib/billing/cleanup-dispatcher.ts`
+  - page active workspaces with `WHERE id > afterId ORDER BY id LIMIT N`
+  - dispatch concrete chunks (`workspaceIds`, retention, label) instead of one giant scope
+  - prefer Trigger.dev queue/concurrency keys when available
+  - execute inline fallback chunks sequentially, not with unbounded `Promise.all`
+- File parse route pattern in `apps/sim/app/api/files/parse/route.ts`
+  - cap downloads and parsed output separately
+  - preserve partial results when a later item exceeds the cap
+  - never read untrusted response bodies without a byte cap
+- Large workflow value payloads
+  - prefer durable references/manifests over inlining large arrays or files
+  - materialize refs only behind an explicit byte budget
+
+## Review Workflow
+
+1. Identify every changed data source:
+   - database queries
+   - storage lists/downloads/uploads
+   - external API pagination
+   - file reads and HTTP responses
+   - workflow logs, snapshots, payloads, arrays, and manifests
+   - queues, cron routes, and background jobs
+2. For each source, write down the maximum cardinality and maximum bytes. If the code does not enforce one, it is unbounded.
+3. Trace whether data is processed incrementally or accumulated:
+   - arrays from `select`, `findMany`, `Promise.all`, `map`, `filter`, `flatMap`
+   - maps/sets keyed by all users, workspaces, executions, files, or rows
+   - `Buffer.concat`, `response.arrayBuffer()`, `response.text()`, `JSON.stringify`, `JSON.parse`
+   - queues of promises or job payloads built before dispatch
+4. Check concurrency separately from memory:
+   - no `Promise.all(items.map(...))` unless `items` is already small and bounded
+   - use chunks, sequential loops, queue concurrency, or a concurrency limiter
+   - align concurrency with DB pool size, storage/API limits, and task queue semantics
+5. Verify SQL shape:
+   - every bulk query has `LIMIT`
+   - large pagination uses cursor/keyset style (`id > afterId`, timestamps plus unique ID), not deep `OFFSET`
+   - `IN (...)` lists are chunked
+   - side-effect rows selected before delete have per-batch and per-run caps
+6. Verify byte safety:
+   - check `Content-Length` when available
+   - stream with cumulative byte accounting
+   - cap both input bytes and expanded output bytes
+   - reject or reference oversized values before serializing large JSON responses
+7. Confirm failure behavior:
+   - exceeding a cap should stop before loading more data
+   - partial successful work should be preserved when the API contract expects it
+   - retries should not duplicate huge in-memory state
+   - cleanup jobs should make progress over future runs instead of widening one run
+
+## Red Flags
+
+- loads all active workspaces, users, executions, logs, files, messages, or subscriptions before filtering
+- builds a full `Map` or `Set` for a platform-wide scope
+- uses `Promise.all` over rows from an unbounded query
+- fetches all pages from an external API before processing
+- reads an entire file, HTTP response, or stream without a max byte budget
+- checks size only after `Buffer.concat`, `arrayBuffer`, `text`, `JSON.parse`, or parse expansion
+- chunks only after loading the complete dataset
+- paginates with unbounded/deep `OFFSET` on a mutable or large table
+- creates one queue job per row without batching or a queue-level concurrency key
+- accumulates per-row errors/results with no maximum
+- adds a cache, singleton, or module-level collection without eviction or size limits
+
+## Preferred Fixes
+
+- Move filters into SQL/API requests and select only needed columns.
+- Replace full-table loads with cursor/keyset pagination and a deterministic order.
+- Process one page/batch at a time; do not keep previous pages unless needed.
+- Add per-batch and per-run row caps so long backlogs drain across repeated jobs.
+- Split large ID lists with `selectRowsByIdChunks` or `chunkArray` after bounding the source.
+- Use `chunkedBatchDelete` for cleanup loops with row side effects.
+- Use stream-limit helpers for file/HTTP/body reads.
+- Store large workflow values as refs/manifests and materialize only within a caller budget.
+- Replace unbounded `Promise.all` with sequential chunk loops, queue concurrency, or a small limiter.
+- Include tests that prove caps stop work early and partial results or progress are preserved.
+
+## Findings Format
+
+Lead with concrete findings, ordered by risk:
+
+```markdown
+## Findings
+
+- **P1 Unbounded workspace load in cleanup dispatch** (`path/to/file.ts`)
+  The new path calls `select().from(workspace)` without a limit, then builds maps for every row before dispatch. In production this scales with all active workspaces and can exhaust the app process. Page by `workspace.id` with a fixed limit and dispatch bounded chunks.
+
+## Good Signals
+
+- Uses `readResponseToBufferWithLimit` for external downloads.
+- Inline fallback processes chunks sequentially.
+
+## Residual Risk
+
+- The row cap is explicit, but no test currently proves the loop stops at the cap.
+```
+
+Only say "good to go" when every changed source has explicit row, byte, and concurrency bounds or the boundedness is proven by a stable invariant.
@@ -102,8 +102,8 @@ For **every** tool file, check:
 - [ ] No fields are missing that the API provides and users would commonly need
 - [ ] No phantom fields defined that the API doesn't return
 - [ ] `optional: true` is set on fields that may not exist in all responses
-- [ ] When using `type: 'json'` and the shape is known, `properties` defines the inner fields
-- [ ] When using `type: 'array'`, `items` defines the item structure with `properties`
+- [ ] When using `type: 'json'` and the shape is known, `properties` defines the inner fields (tool outputs only — block outputs do not support `properties`)
+- [ ] When using `type: 'array'`, `items` defines the item structure with `properties` (tool outputs only)
 - [ ] Field descriptions are accurate and helpful
 
 ### Types (types.ts)
@@ -190,9 +190,8 @@ For **each tool** in `tools.access`:
 ### Block Outputs
 - [ ] Outputs cover the key fields returned by ALL tools (not just one operation)
 - [ ] Output types are correct (`'string'`, `'number'`, `'boolean'`, `'json'`)
-- [ ] `type: 'json'` outputs either:
-  - Describe inner fields in the description string (GOOD): `'User profile (id, name, username, bio)'`
-  - Use nested output definitions (BEST): `{ id: { type: 'string' }, name: { type: 'string' } }`
+- [ ] `type: 'json'` outputs describe inner fields in the description string: `'User profile (id, name, username, bio)'` or `'[{address, status, type}]'` for arrays
+- [ ] **Do NOT add a `properties: {...}` field on block outputs.** Block-level `OutputFieldDefinition` (from `@sim/workflow-types/blocks`) only accepts `{ type, description?, condition?, hiddenFromDisplay? }`. Nested `properties` is a tool-level construct (`OutputProperty`) — adding it to a block output will fail TypeScript at build time
 - [ ] No opaque `type: 'json'` with vague descriptions like `'Response data'`
 - [ ] Outputs that only appear for certain operations use `condition` if supported, or document which operations return them
 
@@ -232,13 +231,23 @@ If any tools support pagination:
 - [ ] Pagination response fields (`nextToken`, `cursor`, etc.) are included in tool outputs
 - [ ] Pagination subBlocks are set to `mode: 'advanced'`
 
-## Step 7: Validate Error Handling
+## Step 7: Validate Memory Load Safety
+
+If any tool lists, searches, exports, imports, downloads, uploads, paginates, batches, transforms arrays, or reads file/HTTP bodies, read `.agents/skills/memory-load-check/SKILL.md` and apply it to the integration.
+
+- [ ] List/search tools expose API limits and do not auto-fetch every page into memory
+- [ ] Transform logic does not build unbounded arrays, maps, sets, or `Promise.all` fan-outs
+- [ ] File and HTTP body reads use explicit byte caps or existing stream-limit helpers
+- [ ] Large result payloads are summarized, paginated, referenced, or capped rather than raw-dumped
+- [ ] Pagination and download tests cover caps, early stop behavior, or partial-result preservation when relevant
+
+## Step 8: Validate Error Handling
 
 - [ ] `transformResponse` checks for error conditions before accessing data
 - [ ] Error responses include meaningful messages (not just generic "failed")
 - [ ] HTTP error status codes are handled (check `response.ok` or status codes)
 
-## Step 8: Report and Fix
+## Step 9: Report and Fix
 
 ### Report Format
 
@@ -297,6 +306,7 @@ After fixing, confirm:
 - [ ] Validated OAuth scopes use centralized utilities (getScopesForService, getCanonicalScopesForProvider) — no hardcoded arrays
 - [ ] Validated scope descriptions exist in `SCOPE_DESCRIPTIONS` within `lib/oauth/utils.ts` for all scopes
 - [ ] Validated pagination consistency across tools and block
+- [ ] Validated memory load safety using `.agents/skills/memory-load-check/SKILL.md` when tools list/search/download/import/export/batch data
 - [ ] Validated error handling (error checks, meaningful messages)
 - [ ] Validated registry entries (tools and block, alphabetical, correct imports)
 - [ ] Reported all issues grouped by severity
 
@@ -0,0 +1,142 @@
+---
+description: Add a code-defined table enrichment (registry entry) backed by a provider cascade, ensuring each provider tool has hosted-key support
+argument-hint: <enrichment-name>
+---
+
+# Adding a Table Enrichment
+
+Enrichments are code-defined entries in `apps/sim/enrichments/` that run **directly per table row** (no workflow). Each enrichment declares inputs, outputs, and an ordered list of **providers**; the cascade runner tries providers in order and the first non-empty result fills the cell. Each provider calls one existing Sim tool via `executeTool`, which injects the workspace's BYOK key or a **hosted key** and bills usage automatically.
+
+Because enrichments run on Sim's hosted keys by default, **every provider tool you reference must have hosted-key support** — otherwise it can only run when the workspace brings its own key. This command makes that check a required step.
+
+## Overview
+
+| Step | What | Where |
+|------|------|-------|
+| 1 | Pick the data-source tool(s) for each output | `tools/{service}/` + `tools/registry.ts` |
+| 2 | **Verify each tool has `hosting`; if not, run `/add-hosted-key`** | `tools/{service}/{action}.ts` |
+| 3 | Write the enrichment definition | `enrichments/{name}/{name}.ts` + `index.ts` |
+| 4 | Register it | `enrichments/registry.ts` |
+| 5 | Verify | tsc / biome / manual run |
+
+## Architecture (what you're plugging into)
+
+- **`enrichments/types.ts`** — `EnrichmentConfig { id, name, description, icon, inputs, outputs, providers }` and `EnrichmentProvider { id, label, toolId, buildParams, mapOutput }`. Providers are **plain data** (no `@/tools` import) so the catalog stays client-safe.
+- **`enrichments/providers.ts`** — `toolProvider(...)` (typed passthrough) plus shared input helpers: `str(v)`, `normalizeDomain(v)`, `firstNonEmpty(arr)`, `splitName(fullName)`.
+- **`enrichments/run.ts`** — the server-only cascade runner. Calls `executeTool(provider.toolId, { ...params, _context: { workspaceId } })`, accumulates hosted-key cost, returns the first non-empty mapped result. **You do not edit this** — it works for any registry entry.
+- **`enrichments/registry.ts`** — `ENRICHMENT_REGISTRY` / `ALL_ENRICHMENTS` / `getEnrichment`. Register new entries here.
+
+Outputs automatically become table columns; billing, the catalog/sidebar UI, the column meta-header icon, and per-row execution all work with no extra wiring.
+
+## Step 1: Pick the data-source tool(s)
+
+For each output the enrichment produces, decide which existing tool provides it. Look up the service's API and the tool in `apps/sim/tools/{service}/` (e.g. `hunter_email_finder`, `pdl_person_enrich`, `pdl_company_enrich`). Confirm:
+
+- The tool id is registered in `apps/sim/tools/registry.ts`.
+- Its `params` accept what you can derive from table columns (read the tool's `params`).
+- Its `outputs` / `transformResponse` actually expose the field you need (read the real output shape — don't assume).
+
+Order providers **cheapest / most-likely-to-hit first**; the cascade stops at the first non-empty result. Apollo / LinkedIn are not hosted-safe (ToS) — don't use them.
+
+## Step 2: Verify hosted-key support — chain to `/add-hosted-key` if missing
+
+**This is the required gate.** For every tool a provider calls, open `apps/sim/tools/{service}/{action}.ts` and check for a `hosting` block:
+
+```typescript
+hosting: {
+  envKeyPrefix: 'SERVICE_API_KEY',
+  apiKeyParam: 'apiKey',
+  byokProviderId: 'service',
+  pricing: { /* ... */ },
+  rateLimit: { /* ... */ },
+}
+```
+
+- **If `hosting` is present** — good. Note the `envKeyPrefix`; the deployment needs `{PREFIX}_COUNT` + `{PREFIX}_1..N` env vars set for the hosted key to actually resolve at runtime (ops concern, not code). If those env vars aren't set in the target environment, the provider will only run with a workspace BYOK key.
+- **If `hosting` is absent** — the tool can't use a Sim-provided key, so the enrichment would silently produce blank cells on hosted Sim. **Stop and run `/add-hosted-key <service>`** to add hosted-key support to that tool first, then come back. Do this for every provider tool that lacks it.
+
+Why it matters: the cascade runner only bills (and only reads `output.cost.total`) when `executeTool` injected a hosted key, which requires the tool's `hosting` config. No `hosting` → no hosted key → the enrichment depends entirely on per-workspace BYOK.
+
+## Step 3: Write the enrichment definition
+
+Create `apps/sim/enrichments/{name}/{name}.ts` and a barrel `index.ts`. Mirror the existing entries (`work-email`, `phone-number`, `company-domain`, `company-info`).
+
+```typescript
+import { SomeIcon } from 'lucide-react'
+import { filterUndefined } from '@sim/utils/object'
+import { normalizeDomain, splitName, str, toolProvider } from '@/enrichments/providers'
+import type { EnrichmentConfig } from '@/enrichments/types'
+
+export const myEnrichment: EnrichmentConfig = {
+  id: 'my-enrichment',
+  name: 'My Enrichment',
+  description: 'One concise sentence describing what it finds.',
+  icon: SomeIcon,
+  inputs: [
+    // Person enrichments take a single canonical `fullName` (Clay-style);
+    // split it with splitName() for tools that need first/last.
+    { id: 'fullName', name: 'Full name', type: 'string', required: true },
+    { id: 'companyDomain', name: 'Company domain', type: 'string' },
+  ],
+  outputs: [{ id: 'value', name: 'value', type: 'string' }],
+  providers: [
+    toolProvider({
+      id: 'provider-a',
+      label: 'Provider A',
+      toolId: 'service_action', // must have `hosting` (Step 2)
+      buildParams: (inputs) => {
+        // Return null when there aren't enough inputs → cascade skips this provider.
+        const name = splitName(inputs.fullName)
+        const domain = normalizeDomain(inputs.companyDomain)
+        if (!name || !domain) return null
+        return { domain, first_name: name.firstName, last_name: name.lastName }
+      },
+      mapOutput: (output) => {
+        // Return { [outputId]: value } on a hit, or null to fall through.
+        const value = str(output.value)
+        return value ? { value } : null
+      },
+    }),
+    // ...additional fallback providers, in priority order.
+  ],
+}
+```
+
+```typescript
+// apps/sim/enrichments/{name}/index.ts
+export { myEnrichment } from './my-enrichment'
+```
+
+Rules:
+- Keep the file **client-safe**: import only `lucide-react`, `@sim/utils/*`, `@/enrichments/providers`, and the types. **Never import `@/tools`** here — the runner does the tool call.
+- `buildParams` returns `null` when inputs are insufficient (provider skipped). `mapOutput` returns `null`/empty for a miss (falls through). Use `filterUndefined` when assembling optional tool params; coerce numbers explicitly (don't pass `''` to number outputs).
+- Output `id`s are the keys `mapOutput` returns; output `name`s are the default column names (the user can rename them in the config).
+
+## Step 4: Register it
+
+In `apps/sim/enrichments/registry.ts`, import and add the entry (catalog order is registration order):
+
+```typescript
+import { myEnrichment } from '@/enrichments/my-enrichment'
+
+export const ENRICHMENT_REGISTRY: EnrichmentRegistry = {
+  // ...existing
+  [myEnrichment.id]: myEnrichment,
+}
+```
+
+## Step 5: Verify
+
+1. `bunx tsc --noEmit` (from `apps/sim`, `NODE_OPTIONS=--max-old-space-size=8192`) and `bunx biome check` on the changed files.
+2. In a table → **+ New column → Enrichments** → pick the new enrichment, map its inputs to columns, name the output column(s), Save. Confirm it appears in the catalog with its icon/description.
+3. With hosted keys (or a workspace BYOK key) configured for each provider's service, run a row and confirm the cell fills; the dev-server log shows `Enrichment hit { provider }`. A row whose providers all miss completes blank; a row where every provider errored shows an error cell.
+
+## Checklist
+
+- [ ] Each output mapped to a real tool field (verified against the tool's `params`/`outputs`)
+- [ ] **Every provider tool has a `hosting` block — ran `/add-hosted-key` for any that didn't**
+- [ ] Providers ordered cheapest / most-likely-first; Apollo/LinkedIn not used
+- [ ] Enrichment file is client-safe (no `@/tools` import); uses `toolProvider` + shared helpers
+- [ ] `buildParams` returns `null` on insufficient inputs; `mapOutput` returns `null` on a miss
+- [ ] Registered in `enrichments/registry.ts`
+- [ ] tsc + biome clean; created and ran the column end-to-end