Skip to content

Commit 0403765

Browse files
committed
Merge remote-tracking branch 'origin/staging' into dev
Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # apps/sim/app/api/logs/route.ts # apps/sim/app/api/mcp/copilot/route.ts # apps/sim/app/api/v1/copilot/chat/route.ts # apps/sim/lib/copilot/chat/payload.test.ts # apps/sim/lib/copilot/chat/payload.ts # apps/sim/lib/copilot/generated/tool-catalog-v1.ts # apps/sim/lib/copilot/generated/tool-schemas-v1.ts # apps/sim/lib/copilot/tools/server/workflow/get-execution-summary.ts # apps/sim/lib/copilot/tools/server/workflow/get-workflow-logs.ts # apps/sim/providers/models.ts
2 parents 84f5377 + b4787dd commit 0403765

1,175 files changed

Lines changed: 315629 additions & 22635 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
Lines changed: 138 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,138 @@
1+
---
2+
name: memory-load-check
3+
description: Review PRs and diffs for unbounded memory loading, concurrency explosions, oversized payload materialization, and missing pagination or byte caps. Use when reviewing cleanup jobs, background jobs, data imports/exports, file parsing, API fan-out, workflow execution payloads, large arrays/files, or any change that reads many rows, files, responses, logs, or external API pages into process memory.
4+
---
5+
6+
# Memory Load Check
7+
8+
Use this skill when a PR or diff could load unbounded data into a Node/Bun process, especially in cron routes, background tasks, API routes, workflow execution, file parsing, cleanup jobs, migrations, import/export flows, and external API integrations.
9+
10+
## Review Goal
11+
12+
Prove each changed path has explicit bounds for:
13+
- rows held in memory
14+
- bytes held in memory
15+
- concurrent promises, DB queries, HTTP calls, storage operations, and jobs
16+
- number of pages, batches, chunks, retries, and retained intermediate objects
17+
18+
If any bound depends only on current production size or "probably small" data, treat it as a finding.
19+
20+
## References
21+
22+
Read these when doing a deeper pass:
23+
- Node.js streams/backpressure: https://nodejs.org/learn/modules/backpressuring-in-streams
24+
- Node.js stream usage: https://nodejs.org/en/learn/modules/how-to-use-streams
25+
- Keyset/cursor pagination over offset scans: https://blog.sequinstream.com/keyset-cursors-not-offsets-for-postgres-pagination/
26+
- Postgres pagination tradeoffs: https://www.citusdata.com/blog/2016/03/30/five-ways-to-paginate/
27+
28+
## Sim Helpers To Prefer
29+
30+
- `apps/sim/lib/cleanup/batch-delete.ts`
31+
- `chunkedBatchDelete`: bounded SELECT -> optional side effect -> DELETE loop.
32+
- `batchDeleteByWorkspaceAndTimestamp`: common workspace/timestamp cleanup wrapper.
33+
- `selectRowsByIdChunks`: chunks large ID sets and enforces an overall row cap.
34+
- `chunkArray`: use only after the input set itself is already bounded.
35+
- `apps/sim/lib/core/utils/stream-limits.ts`
36+
- `PayloadSizeLimitError`
37+
- `assertKnownSizeWithinLimit`
38+
- `assertContentLengthWithinLimit`
39+
- `readStreamToBufferWithLimit`
40+
- `readNodeStreamToBufferWithLimit`
41+
- `readResponseToBufferWithLimit`
42+
- `readResponseTextWithLimit`
43+
- Cleanup dispatcher pattern in `apps/sim/lib/billing/cleanup-dispatcher.ts`
44+
- page active workspaces with `WHERE id > afterId ORDER BY id LIMIT N`
45+
- dispatch concrete chunks (`workspaceIds`, retention, label) instead of one giant scope
46+
- prefer Trigger.dev queue/concurrency keys when available
47+
- execute inline fallback chunks sequentially, not with unbounded `Promise.all`
48+
- File parse route pattern in `apps/sim/app/api/files/parse/route.ts`
49+
- cap downloads and parsed output separately
50+
- preserve partial results when a later item exceeds the cap
51+
- never read untrusted response bodies without a byte cap
52+
- Large workflow value payloads
53+
- prefer durable references/manifests over inlining large arrays or files
54+
- materialize refs only behind an explicit byte budget
55+
56+
## Review Workflow
57+
58+
1. Identify every changed data source:
59+
- database queries
60+
- storage lists/downloads/uploads
61+
- external API pagination
62+
- file reads and HTTP responses
63+
- workflow logs, snapshots, payloads, arrays, and manifests
64+
- queues, cron routes, and background jobs
65+
2. For each source, write down the maximum cardinality and maximum bytes. If the code does not enforce one, it is unbounded.
66+
3. Trace whether data is processed incrementally or accumulated:
67+
- arrays from `select`, `findMany`, `Promise.all`, `map`, `filter`, `flatMap`
68+
- maps/sets keyed by all users, workspaces, executions, files, or rows
69+
- `Buffer.concat`, `response.arrayBuffer()`, `response.text()`, `JSON.stringify`, `JSON.parse`
70+
- queues of promises or job payloads built before dispatch
71+
4. Check concurrency separately from memory:
72+
- no `Promise.all(items.map(...))` unless `items` is already small and bounded
73+
- use chunks, sequential loops, queue concurrency, or a concurrency limiter
74+
- align concurrency with DB pool size, storage/API limits, and task queue semantics
75+
5. Verify SQL shape:
76+
- every bulk query has `LIMIT`
77+
- large pagination uses cursor/keyset style (`id > afterId`, timestamps plus unique ID), not deep `OFFSET`
78+
- `IN (...)` lists are chunked
79+
- side-effect rows selected before delete have per-batch and per-run caps
80+
6. Verify byte safety:
81+
- check `Content-Length` when available
82+
- stream with cumulative byte accounting
83+
- cap both input bytes and expanded output bytes
84+
- reject or reference oversized values before serializing large JSON responses
85+
7. Confirm failure behavior:
86+
- exceeding a cap should stop before loading more data
87+
- partial successful work should be preserved when the API contract expects it
88+
- retries should not duplicate huge in-memory state
89+
- cleanup jobs should make progress over future runs instead of widening one run
90+
91+
## Red Flags
92+
93+
- loads all active workspaces, users, executions, logs, files, messages, or subscriptions before filtering
94+
- builds a full `Map` or `Set` for a platform-wide scope
95+
- uses `Promise.all` over rows from an unbounded query
96+
- fetches all pages from an external API before processing
97+
- reads an entire file, HTTP response, or stream without a max byte budget
98+
- checks size only after `Buffer.concat`, `arrayBuffer`, `text`, `JSON.parse`, or parse expansion
99+
- chunks only after loading the complete dataset
100+
- paginates with unbounded/deep `OFFSET` on a mutable or large table
101+
- creates one queue job per row without batching or a queue-level concurrency key
102+
- accumulates per-row errors/results with no maximum
103+
- adds a cache, singleton, or module-level collection without eviction or size limits
104+
105+
## Preferred Fixes
106+
107+
- Move filters into SQL/API requests and select only needed columns.
108+
- Replace full-table loads with cursor/keyset pagination and a deterministic order.
109+
- Process one page/batch at a time; do not keep previous pages unless needed.
110+
- Add per-batch and per-run row caps so long backlogs drain across repeated jobs.
111+
- Split large ID lists with `selectRowsByIdChunks` or `chunkArray` after bounding the source.
112+
- Use `chunkedBatchDelete` for cleanup loops with row side effects.
113+
- Use stream-limit helpers for file/HTTP/body reads.
114+
- Store large workflow values as refs/manifests and materialize only within a caller budget.
115+
- Replace unbounded `Promise.all` with sequential chunk loops, queue concurrency, or a small limiter.
116+
- Include tests that prove caps stop work early and partial results or progress are preserved.
117+
118+
## Findings Format
119+
120+
Lead with concrete findings, ordered by risk:
121+
122+
```markdown
123+
## Findings
124+
125+
- **P1 Unbounded workspace load in cleanup dispatch** (`path/to/file.ts`)
126+
The new path calls `select().from(workspace)` without a limit, then builds maps for every row before dispatch. In production this scales with all active workspaces and can exhaust the app process. Page by `workspace.id` with a fixed limit and dispatch bounded chunks.
127+
128+
## Good Signals
129+
130+
- Uses `readResponseToBufferWithLimit` for external downloads.
131+
- Inline fallback processes chunks sequentially.
132+
133+
## Residual Risk
134+
135+
- The row cap is explicit, but no test currently proves the loop stops at the cap.
136+
```
137+
138+
Only say "good to go" when every changed source has explicit row, byte, and concurrency bounds or the boundedness is proven by a stable invariant.

.agents/skills/validate-integration/SKILL.md

Lines changed: 17 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -102,8 +102,8 @@ For **every** tool file, check:
102102
- [ ] No fields are missing that the API provides and users would commonly need
103103
- [ ] No phantom fields defined that the API doesn't return
104104
- [ ] `optional: true` is set on fields that may not exist in all responses
105-
- [ ] When using `type: 'json'` and the shape is known, `properties` defines the inner fields
106-
- [ ] When using `type: 'array'`, `items` defines the item structure with `properties`
105+
- [ ] When using `type: 'json'` and the shape is known, `properties` defines the inner fields (tool outputs only — block outputs do not support `properties`)
106+
- [ ] When using `type: 'array'`, `items` defines the item structure with `properties` (tool outputs only)
107107
- [ ] Field descriptions are accurate and helpful
108108

109109
### Types (types.ts)
@@ -190,9 +190,8 @@ For **each tool** in `tools.access`:
190190
### Block Outputs
191191
- [ ] Outputs cover the key fields returned by ALL tools (not just one operation)
192192
- [ ] Output types are correct (`'string'`, `'number'`, `'boolean'`, `'json'`)
193-
- [ ] `type: 'json'` outputs either:
194-
- Describe inner fields in the description string (GOOD): `'User profile (id, name, username, bio)'`
195-
- Use nested output definitions (BEST): `{ id: { type: 'string' }, name: { type: 'string' } }`
193+
- [ ] `type: 'json'` outputs describe inner fields in the description string: `'User profile (id, name, username, bio)'` or `'[{address, status, type}]'` for arrays
194+
- [ ] **Do NOT add a `properties: {...}` field on block outputs.** Block-level `OutputFieldDefinition` (from `@sim/workflow-types/blocks`) only accepts `{ type, description?, condition?, hiddenFromDisplay? }`. Nested `properties` is a tool-level construct (`OutputProperty`) — adding it to a block output will fail TypeScript at build time
196195
- [ ] No opaque `type: 'json'` with vague descriptions like `'Response data'`
197196
- [ ] Outputs that only appear for certain operations use `condition` if supported, or document which operations return them
198197

@@ -232,13 +231,23 @@ If any tools support pagination:
232231
- [ ] Pagination response fields (`nextToken`, `cursor`, etc.) are included in tool outputs
233232
- [ ] Pagination subBlocks are set to `mode: 'advanced'`
234233

235-
## Step 7: Validate Error Handling
234+
## Step 7: Validate Memory Load Safety
235+
236+
If any tool lists, searches, exports, imports, downloads, uploads, paginates, batches, transforms arrays, or reads file/HTTP bodies, read `.agents/skills/memory-load-check/SKILL.md` and apply it to the integration.
237+
238+
- [ ] List/search tools expose API limits and do not auto-fetch every page into memory
239+
- [ ] Transform logic does not build unbounded arrays, maps, sets, or `Promise.all` fan-outs
240+
- [ ] File and HTTP body reads use explicit byte caps or existing stream-limit helpers
241+
- [ ] Large result payloads are summarized, paginated, referenced, or capped rather than raw-dumped
242+
- [ ] Pagination and download tests cover caps, early stop behavior, or partial-result preservation when relevant
243+
244+
## Step 8: Validate Error Handling
236245

237246
- [ ] `transformResponse` checks for error conditions before accessing data
238247
- [ ] Error responses include meaningful messages (not just generic "failed")
239248
- [ ] HTTP error status codes are handled (check `response.ok` or status codes)
240249

241-
## Step 8: Report and Fix
250+
## Step 9: Report and Fix
242251

243252
### Report Format
244253

@@ -297,6 +306,7 @@ After fixing, confirm:
297306
- [ ] Validated OAuth scopes use centralized utilities (getScopesForService, getCanonicalScopesForProvider) — no hardcoded arrays
298307
- [ ] Validated scope descriptions exist in `SCOPE_DESCRIPTIONS` within `lib/oauth/utils.ts` for all scopes
299308
- [ ] Validated pagination consistency across tools and block
309+
- [ ] Validated memory load safety using `.agents/skills/memory-load-check/SKILL.md` when tools list/search/download/import/export/batch data
300310
- [ ] Validated error handling (error checks, meaningful messages)
301311
- [ ] Validated registry entries (tools and block, alphabetical, correct imports)
302312
- [ ] Reported all issues grouped by severity

.claude/commands/add-enrichment.md

Lines changed: 142 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
---
2+
description: Add a code-defined table enrichment (registry entry) backed by a provider cascade, ensuring each provider tool has hosted-key support
3+
argument-hint: <enrichment-name>
4+
---
5+
6+
# Adding a Table Enrichment
7+
8+
Enrichments are code-defined entries in `apps/sim/enrichments/` that run **directly per table row** (no workflow). Each enrichment declares inputs, outputs, and an ordered list of **providers**; the cascade runner tries providers in order and the first non-empty result fills the cell. Each provider calls one existing Sim tool via `executeTool`, which injects the workspace's BYOK key or a **hosted key** and bills usage automatically.
9+
10+
Because enrichments run on Sim's hosted keys by default, **every provider tool you reference must have hosted-key support** — otherwise it can only run when the workspace brings its own key. This command makes that check a required step.
11+
12+
## Overview
13+
14+
| Step | What | Where |
15+
|------|------|-------|
16+
| 1 | Pick the data-source tool(s) for each output | `tools/{service}/` + `tools/registry.ts` |
17+
| 2 | **Verify each tool has `hosting`; if not, run `/add-hosted-key`** | `tools/{service}/{action}.ts` |
18+
| 3 | Write the enrichment definition | `enrichments/{name}/{name}.ts` + `index.ts` |
19+
| 4 | Register it | `enrichments/registry.ts` |
20+
| 5 | Verify | tsc / biome / manual run |
21+
22+
## Architecture (what you're plugging into)
23+
24+
- **`enrichments/types.ts`**`EnrichmentConfig { id, name, description, icon, inputs, outputs, providers }` and `EnrichmentProvider { id, label, toolId, buildParams, mapOutput }`. Providers are **plain data** (no `@/tools` import) so the catalog stays client-safe.
25+
- **`enrichments/providers.ts`**`toolProvider(...)` (typed passthrough) plus shared input helpers: `str(v)`, `normalizeDomain(v)`, `firstNonEmpty(arr)`, `splitName(fullName)`.
26+
- **`enrichments/run.ts`** — the server-only cascade runner. Calls `executeTool(provider.toolId, { ...params, _context: { workspaceId } })`, accumulates hosted-key cost, returns the first non-empty mapped result. **You do not edit this** — it works for any registry entry.
27+
- **`enrichments/registry.ts`**`ENRICHMENT_REGISTRY` / `ALL_ENRICHMENTS` / `getEnrichment`. Register new entries here.
28+
29+
Outputs automatically become table columns; billing, the catalog/sidebar UI, the column meta-header icon, and per-row execution all work with no extra wiring.
30+
31+
## Step 1: Pick the data-source tool(s)
32+
33+
For each output the enrichment produces, decide which existing tool provides it. Look up the service's API and the tool in `apps/sim/tools/{service}/` (e.g. `hunter_email_finder`, `pdl_person_enrich`, `pdl_company_enrich`). Confirm:
34+
35+
- The tool id is registered in `apps/sim/tools/registry.ts`.
36+
- Its `params` accept what you can derive from table columns (read the tool's `params`).
37+
- Its `outputs` / `transformResponse` actually expose the field you need (read the real output shape — don't assume).
38+
39+
Order providers **cheapest / most-likely-to-hit first**; the cascade stops at the first non-empty result. Apollo / LinkedIn are not hosted-safe (ToS) — don't use them.
40+
41+
## Step 2: Verify hosted-key support — chain to `/add-hosted-key` if missing
42+
43+
**This is the required gate.** For every tool a provider calls, open `apps/sim/tools/{service}/{action}.ts` and check for a `hosting` block:
44+
45+
```typescript
46+
hosting: {
47+
envKeyPrefix: 'SERVICE_API_KEY',
48+
apiKeyParam: 'apiKey',
49+
byokProviderId: 'service',
50+
pricing: { /* ... */ },
51+
rateLimit: { /* ... */ },
52+
}
53+
```
54+
55+
- **If `hosting` is present** — good. Note the `envKeyPrefix`; the deployment needs `{PREFIX}_COUNT` + `{PREFIX}_1..N` env vars set for the hosted key to actually resolve at runtime (ops concern, not code). If those env vars aren't set in the target environment, the provider will only run with a workspace BYOK key.
56+
- **If `hosting` is absent** — the tool can't use a Sim-provided key, so the enrichment would silently produce blank cells on hosted Sim. **Stop and run `/add-hosted-key <service>`** to add hosted-key support to that tool first, then come back. Do this for every provider tool that lacks it.
57+
58+
Why it matters: the cascade runner only bills (and only reads `output.cost.total`) when `executeTool` injected a hosted key, which requires the tool's `hosting` config. No `hosting` → no hosted key → the enrichment depends entirely on per-workspace BYOK.
59+
60+
## Step 3: Write the enrichment definition
61+
62+
Create `apps/sim/enrichments/{name}/{name}.ts` and a barrel `index.ts`. Mirror the existing entries (`work-email`, `phone-number`, `company-domain`, `company-info`).
63+
64+
```typescript
65+
import { SomeIcon } from 'lucide-react'
66+
import { filterUndefined } from '@sim/utils/object'
67+
import { normalizeDomain, splitName, str, toolProvider } from '@/enrichments/providers'
68+
import type { EnrichmentConfig } from '@/enrichments/types'
69+
70+
export const myEnrichment: EnrichmentConfig = {
71+
id: 'my-enrichment',
72+
name: 'My Enrichment',
73+
description: 'One concise sentence describing what it finds.',
74+
icon: SomeIcon,
75+
inputs: [
76+
// Person enrichments take a single canonical `fullName` (Clay-style);
77+
// split it with splitName() for tools that need first/last.
78+
{ id: 'fullName', name: 'Full name', type: 'string', required: true },
79+
{ id: 'companyDomain', name: 'Company domain', type: 'string' },
80+
],
81+
outputs: [{ id: 'value', name: 'value', type: 'string' }],
82+
providers: [
83+
toolProvider({
84+
id: 'provider-a',
85+
label: 'Provider A',
86+
toolId: 'service_action', // must have `hosting` (Step 2)
87+
buildParams: (inputs) => {
88+
// Return null when there aren't enough inputs → cascade skips this provider.
89+
const name = splitName(inputs.fullName)
90+
const domain = normalizeDomain(inputs.companyDomain)
91+
if (!name || !domain) return null
92+
return { domain, first_name: name.firstName, last_name: name.lastName }
93+
},
94+
mapOutput: (output) => {
95+
// Return { [outputId]: value } on a hit, or null to fall through.
96+
const value = str(output.value)
97+
return value ? { value } : null
98+
},
99+
}),
100+
// ...additional fallback providers, in priority order.
101+
],
102+
}
103+
```
104+
105+
```typescript
106+
// apps/sim/enrichments/{name}/index.ts
107+
export { myEnrichment } from './my-enrichment'
108+
```
109+
110+
Rules:
111+
- Keep the file **client-safe**: import only `lucide-react`, `@sim/utils/*`, `@/enrichments/providers`, and the types. **Never import `@/tools`** here — the runner does the tool call.
112+
- `buildParams` returns `null` when inputs are insufficient (provider skipped). `mapOutput` returns `null`/empty for a miss (falls through). Use `filterUndefined` when assembling optional tool params; coerce numbers explicitly (don't pass `''` to number outputs).
113+
- Output `id`s are the keys `mapOutput` returns; output `name`s are the default column names (the user can rename them in the config).
114+
115+
## Step 4: Register it
116+
117+
In `apps/sim/enrichments/registry.ts`, import and add the entry (catalog order is registration order):
118+
119+
```typescript
120+
import { myEnrichment } from '@/enrichments/my-enrichment'
121+
122+
export const ENRICHMENT_REGISTRY: EnrichmentRegistry = {
123+
// ...existing
124+
[myEnrichment.id]: myEnrichment,
125+
}
126+
```
127+
128+
## Step 5: Verify
129+
130+
1. `bunx tsc --noEmit` (from `apps/sim`, `NODE_OPTIONS=--max-old-space-size=8192`) and `bunx biome check` on the changed files.
131+
2. In a table → **+ New column → Enrichments** → pick the new enrichment, map its inputs to columns, name the output column(s), Save. Confirm it appears in the catalog with its icon/description.
132+
3. With hosted keys (or a workspace BYOK key) configured for each provider's service, run a row and confirm the cell fills; the dev-server log shows `Enrichment hit { provider }`. A row whose providers all miss completes blank; a row where every provider errored shows an error cell.
133+
134+
## Checklist
135+
136+
- [ ] Each output mapped to a real tool field (verified against the tool's `params`/`outputs`)
137+
- [ ] **Every provider tool has a `hosting` block — ran `/add-hosted-key` for any that didn't**
138+
- [ ] Providers ordered cheapest / most-likely-first; Apollo/LinkedIn not used
139+
- [ ] Enrichment file is client-safe (no `@/tools` import); uses `toolProvider` + shared helpers
140+
- [ ] `buildParams` returns `null` on insufficient inputs; `mapOutput` returns `null` on a miss
141+
- [ ] Registered in `enrichments/registry.ts`
142+
- [ ] tsc + biome clean; created and ran the column end-to-end

0 commit comments

Comments
 (0)