diff --git a/develop-docs/sdk/foundations/client/data-collection/index.mdx b/develop-docs/sdk/foundations/client/data-collection/index.mdx new file mode 100644 index 0000000000000..8f158e326c46e --- /dev/null +++ b/develop-docs/sdk/foundations/client/data-collection/index.mdx @@ -0,0 +1,327 @@ +--- +title: Data Collection +description: Configuration for what data SDKs collect by default — technical context, PII, and sensitive data. +spec_id: sdk/foundations/client/data-collection +spec_version: 1.0.0 +spec_status: draft +spec_depends_on: + - id: sdk/foundations/client + version: ">=1.0.0" +spec_changelog: + - version: 1.0.0 + date: 2025-03-05 + summary: Initial spec; dataCollection config, three data tiers, cookies/headers denylist, replace sendDefaultPii. +sidebar_order: 1 +--- + + + + + +## Overview + +This spec defines how SDKs control **what data is collected automatically** from the runtime (device, requests, responses, user context). It replaces the single `sendDefaultPii` (or platform-equivalent) flag with a structured `dataCollection` configuration so users can enable or restrict collection by category and by field. + +Related specs: + +- [Data Handling](/sdk/expected-features/data-handling/) — structuring data for scrubbing (spans, breadcrumbs), variable size limits +- [Client](/sdk/foundations/client/) — client lifecycle and event pipeline +- [Configuration](/sdk/foundations/client/configuration/) — top-level init options including `send_default_pii` (deprecated in favor of this spec) + +--- + +## Concepts + + + +### Data Tiers + +Collected data is grouped into three tiers. SDKs **MUST** treat these tiers consistently when applying defaults and user configuration. + +#### 1. Technical Context Data + +Non-identifying context used for debugging and performance: + +- Device and environment context (OS, runtime, non-PII identifiers) +- Performance and error context (stack frames, breadcrumbs, span metadata) +- Framework/routing context where it does not contain PII or secrets +- AI agent messages (input, output, metadata) + +This tier is **not** gated by the data collection configuration. SDKs **MAY** collect it by default. + +#### 2. PII Data + +Personally identifiable or user-linked data: + +- User identifiers (user ID, username, email) +- IP address +- Cookies and headers that identify the user or session +- HTTP request data (TBD) + +This tier **MUST** be off by default unless the user opts in via `includeUserInfo` and/or explicit `collect` allowlists. See [`includeUserInfo`](#include-user-info-behavior), [`collect` options](#collect-option-behavior), and [Default Denylist](#default-denylist). + +#### 3. Sensitive Data + +Credentials and secrets that **MUST** never be sent by default: + +- Passwords, tokens, API keys, bearer tokens +- Header or cookie values that match known sensitive names (auth, token, secret, password, key, jwt, etc.) + +SDKs **MUST** never send sensitive **values** through automatic instrumentation — values are replaced with `"[Filtered]"` while keys are retained (see [Default Denylist](#default-denylist)). Users can use `beforeSend` (or equivalent) to remove or redact keys if needed. + + + +--- + +## Behavior + + + +### Configuration Requirements + +All data-collection options live under a single top-level key: `dataCollection`. SDKs **MUST** support at least `includeUserInfo` and the `collect` object. SDKs **MAY** omit options that do not apply to the platform (e.g. no `outgoingRequestBody` on a browser-only SDK). + +`dataCollection` accepts two fields: + +- **`includeUserInfo`** — the primary toggle for Personally Identifiable Information (PII). Controls whether user-identity fields are included in automatic collection, and sets the default for PII-heavy `collect` options (such as HTTP request bodies - TBD). Defaults to `false`. +- **`collect`** — controls which categories of request/response and runtime data are gathered. See [`collect` Option Behavior](#collect-option-behavior) and [How Defaults Cascade](#how-defaults-cascade). + + + + + +### `includeUserInfo` Behavior + +`includeUserInfo` controls whether the SDK automatically attaches user identity fields to events (e.g. `user.id`, `user.email`, `user.username`, `user.ip_address`). This is the primary PII gate: its value also sets the effective default for PII-heavy `collect` options. + +| Value | Behavior | +|-------|----------| +| `true` | Attach all user identity fields captured by automatic instrumentation. Equivalent to the legacy `sendDefaultPii` flag scoped to user data. | +| `false` | Do not attach user identity fields from automatic instrumentation. | + +When user data is set **explicitly** on the scope (or equivalent), it is **always** attached regardless of this setting. See [User-Set Data and Scrubbing](#user-set-data-and-scrubbing). + + + + + +### `collect` Option Behavior + +Each key under `collect` maps to a category of automatically collected data and uses one of two option types, depending on whether the data is structured as key-value pairs. + +**Boolean options** — used where data cannot be meaningfully filtered at the key level. The SDK either collects the entire category or skips it. + +| Value | Behavior | +|-------|----------| +| `true` | Collect and attach this data category. | +| `false` | Do not collect this data category at all. | + +**Collection options** — used for key-value data (cookies, headers, query params), where the SDK can inspect individual keys and apply filtering rules before attaching. + +| Value | Behavior | +|-------|----------| +| `true` | Collect this category. Apply the default denylist — values for sensitive key names are replaced with `"[Filtered]"` (see [Default Denylist](#default-denylist)). | +| `false` | Do not collect this category at all. | +| `{ deny: string[] }` | Collect this category. Apply the default denylist **plus** these additional key names. | +| `{ allow: string[] }` | Collect **only** keys in this list. The default denylist is bypassed, but sensitive values **MUST** still be scrubbed regardless. | + +> **Note:** Sensitive key **values** are always scrubbed — replaced with `"[Filtered]"` — regardless of collection option configuration. The allow/deny lists control which keys are included, not whether scrubbing applies. + + + + + +### How Defaults Cascade + +`includeUserInfo` determines the effective default for PII-related `collect` options. Explicitly set `collect` options always override this default. + +| Option type | Default when `includeUserInfo: true` | Default when `includeUserInfo: false` | +|-------------|--------------------------------------|----------------------------------------| +| Collection (key-value pairs) | `true` — use default denylist | `true` — use default denylist, plus PII keys denied | +| PII Boolean (e.g. `incomingRequestBody`) | `true` — attach | `false` — do not attach | + +Non-PII boolean options (e.g. `stackFrameVariables`) are not affected by `includeUserInfo` and always default to their configured value. + + + + + +### Default Denylist + +For key-value data (HTTP headers, cookies, URL query params), SDKs **MUST** apply a **default denylist** by key name: values for known-sensitive keys are replaced with `"[Filtered]"`; **keys are never scrubbed** by the SDK. + +#### Matching Rule + +SDKs **MUST** perform a **partial, case-insensitive match** when comparing key names against the denylist. A key is treated as sensitive if any denylist term appears as a substring in the key name (e.g. the term `auth` matches `Authorization` and `X-Auth-Token`). + +#### Base Denylist (Sensitive Data) + +The following terms **MUST** be included in the default denylist for headers, and **SHOULD** be applied to cookies and query params where applicable: + +`["auth", "token", "secret", "password", "passwd", "pwd", "key", "jwt", "bearer", "sso", "saml", "csrf", "xsrf", "credentials", "session", "sid", "identity"]` + +Values for keys that match **MUST** be replaced with `"[Filtered]"`. + +#### PII Denylist (when `includeUserInfo` is `false`) + +When `includeUserInfo` is `false`, SDKs **MUST** apply the base denylist **and** additionally treat the following as sensitive: + +- Any data that contains email, user ID, IP address, username, or machine name (if applicable) +- Any key containing **`x-forwarded-`** (e.g. `x-forwarded-for`, `x-forwarded-host`) — often carries client IP or host +- Any key ending with or containing **`-user`** (e.g. `x-user-id`, `remote-user`) — often carries user identifiers + +Effective denylist when PII is disabled: base list + `["x-forwarded-", "-user"]` (partial match, case-insensitive). + +#### Cookies and Cookie Headers + +- SDKs **SHOULD** maintain a default denylist of cookie names using the same matching rule (e.g. `session`, `auth`, `identity`). Values for matching cookie names **MUST** be replaced with `"[Filtered]"`. +- **When individual cookie key-value pairs cannot be extracted** (e.g. malformed or opaque cookie string), the entire `Cookie` or `Set-Cookie` header value **MUST** be replaced with `"[Filtered]"`. Unfiltered raw cookie header values **MUST NOT** be sent. When in doubt, treat the whole cookie header as sensitive. + +#### Request Bodies + +When request or response bodies are collected (`incomingRequestBody` / `outgoingRequestBody`): + +- **Parseable as JSON or form data:** SDKs **MAY** extract key-value pairs and apply the same denylist rules to keys. Values for matching keys **MUST** be replaced with `"[Filtered]"`. This allows selective scrubbing while retaining non-sensitive fields for debugging. +- **Not parseable (raw bodies):** The body **MUST NOT** be attached to the event. When the SDK cannot parse the body into key-value structure, the entire body **MUST** be replaced with `"[Filtered]"`. + +No built-in option scrubs **keys**; users who need to hide header or cookie names **MUST** use `beforeSend` (or equivalent). + + + + + +### User-Set Data and Scrubbing + +When the user **explicitly** sets data on the scope (user, request, response, tags, contexts, etc.) or on a span, log, or other telemetry, that data is **not** gated by `dataCollection`. It **MUST** always be attached to outgoing telemetry. The same applies to data the user provides via `beforeSend` or event processors. + +SDKs **SHOULD** only replace sensitive values with `"[Filtered]"` when the data is gathered **automatically** through instrumentation. If the user explicitly provides data (e.g. by setting a request object on the scope), the SDK **MUST NOT** modify it; the user is responsible for what they attach. + +Users can register callbacks (e.g. `beforeSend`, event processors) to remove or redact any data — including keys — before events are sent. This spec does not replace those hooks; they remain the mechanism for custom filtering and key removal. + + + +--- + +## Public API + +The `dataCollection` option is passed to the SDK's init function. All fields are optional; omitting a field uses the default. + +```pseudocode +init({ + dataCollection: { + includeUserInfo: boolean, // default: false + collect: { + cookies: Collection, // default: true + httpHeaders: Collection, // default: true + queryParams: Collection, // default: true + aiAgentMessages: boolean, // default: true + stackFrameVariables: boolean, // default: true + incomingRequestBody: boolean, // default: TBD + outgoingRequestBody: boolean, // default: TBD + frameContextLines: number, // default: 5 (boolean fallback: true) + }, + }, +}) +``` + +### `dataCollection.includeUserInfo` + +| Property | Value | +|----------|-------| +| Type | Boolean | +| Default | `false` | +| Since | 1.0.0 | +| Description | Primary PII toggle. Enables automatic collection of user identity fields (`user.id`, `user.email`, `user.username`, `user.ip_address`). Also sets the effective default for PII-heavy `collect` options. | + +### `dataCollection.collect` Options + +| Key | Option Type | Default | Since | Description | +|-----|-------------|---------|-------|-------------| +| `cookies` | Collection | `true` | 1.0.0 | Include cookie values; keys filtered by the default denylist or by allow/deny lists. | +| `httpHeaders` | Collection | `true` | 1.0.0 | Include HTTP header values; keys filtered by the default denylist or by allow/deny lists. | +| `queryParams` | Collection | `true` | 1.0.0 | Include URL query parameter values; keys filtered by the default denylist or by allow/deny lists. | +| `aiAgentMessages` | Boolean | `true` | 1.0.0 | Include AI agent input and output messages. | +| `stackFrameVariables` | Boolean | `true` | 1.0.0 | Include local variable values captured within stack frames. | +| `incomingRequestBody` | Boolean | TBD | 1.0.0 | Include full body of the incoming HTTP request. | +| `outgoingRequestBody` | Boolean | TBD | 1.0.0 | Include full body of outgoing HTTP requests. | +| `frameContextLines` | Number (Boolean fallback) | `5` (`true`) | 1.0.0 | Number of lines of context to include around stack frames. | + + + Unlike cookies or headers, some data (e.g. request bodies) has no predictable key structure for the SDK to filter. Data can still be redacted in `beforeSend` or event processors if needed. + + +--- + +## Examples + +### Default Configuration + +An explicit representation of all defaults (with `includeUserInfo: false`): + +```typescript +init({ + dsn: "...", + dataCollection: { + includeUserInfo: false, + collect: { + cookies: true, + httpHeaders: true, + queryParams: true, + aiAgentMessages: true, + stackFrameVariables: true, + incomingRequestBody: false, + outgoingRequestBody: false, + frameContextLines: 5, + }, + }, +}); +``` + +### Maximum PII (Full Collection) + +Enable full PII collection, including request bodies and AI messages: + +```typescript +init({ + dsn: "...", + dataCollection: { + includeUserInfo: true, + collect: { + incomingRequestBody: true, + outgoingRequestBody: true, + }, + }, +}); +``` + +**Result:** Technical context and request/response data (headers, cookies, query params) are collected with the default denylist; request bodies, user identifiers, and AI agent messages are included; sensitive values are still replaced with `"[Filtered]"`. + +### Granular Debugging + +Include user info and only specific headers for debugging; exclude query params entirely: + +```typescript +init({ + dsn: "...", + dataCollection: { + includeUserInfo: true, + collect: { + httpHeaders: { allow: ['x-request-id', 'x-trace-id', 'x-correlation-id'] }, + queryParams: false, + }, + }, +}); +``` + +### Migration from `sendDefaultPii` + +- **`sendDefaultPii: true`** (legacy) → `dataCollection: { includeUserInfo: true, collect: { aiAgentMessages: false } }`, keep most `collect` defaults +- **`sendDefaultPii: false`** (legacy) → `dataCollection: { includeUserInfo: false }` (or omit entirely — same as default) + +SDKs **SHOULD** document this mapping and **MAY** implement `send_default_pii` as a compatibility shim that sets `includeUserInfo`. + +--- + +## Changelog + + diff --git a/develop-docs/sdk/foundations/data-scrubbing.mdx b/develop-docs/sdk/foundations/data-scrubbing.mdx index 103fa9261c55c..3a88210f945f8 100644 --- a/develop-docs/sdk/foundations/data-scrubbing.mdx +++ b/develop-docs/sdk/foundations/data-scrubbing.mdx @@ -3,63 +3,21 @@ title: Data Scrubbing sidebar_order: 6 --- -Data handling is the standardized context in how we want SDKs help users filter data. - -## Sensitive Data - -SDKs should not include PII or other sensitive data in the payload by default. -When building an SDK we can come across some API that can give useful information to debug a problem. -In the event that API returns data considered PII, we guard that behind a flag called _Send Default PII_. -This is an option in the SDK called [_send-default-pii_](https://docs.sentry.io/platforms/python/configuration/options/#send-default-pii) -and is **disabled by default**. That means that data that is naturally sensitive is not sent by default. +Data handling is the standardized context in how we want SDKs to help users filter data. -When a user manually sets the data on the scope (user, contexts, tags, data, request, response, etc.), this data should not be gated by the _Send Default PII_ flag and should always be attached to all outgoing telemetry. This also applies to the data that the user manually sets on a span, log, metric and other types of telemetry (directly or, for example, via `BeforeSend`). +**Data collection and scrubbing:** The canonical spec for what data SDKs collect, default denylists (headers, cookies, query params), request body and cookie scrubbing, user-set data, and `beforeSend` is [Data Collection](/sdk/foundations/client/data-collection/). That spec supersedes the sensitive-data and cookie sections below for SDK behavior. This page retains **Structuring Data** and **Variable Size** and the legacy `send_default_pii` context for reference. -Certain sensitive data must never be sent through SDK instrumentation, regardless of any configuration: - -- HTTP Headers: The keys of known sensitive headers are added, while their values must be replaced with `"[Filtered]"`. - - The SDK performs a **partial, case-insensitive match** against the following headers to determine if they are sensitive: `["auth", "token", "secret", "password", "passwd", "pwd", "key", "jwt", "bearer", "sso", "saml", "csrf", "xsrf", "credentials"]` - -SDKs should only replace sensitive data with `"[Filtered]"` when the data is gathered automatically through instrumentation. -If a user explicitly provides data (for example, by setting a request object on the scope), the SDK must not modify it. - -Some examples of data guarded by `send_default_pii: false`: - -- When attaching data of HTTP requests and/or responses to events - - Request Body: "raw" HTTP bodies (bodies which cannot be parsed as JSON or FormData) are removed - - HTTP Headers: header values, containing information about the user are replaced with `"[Filtered]"` -- User-specific information (e.g. the current user ID according to the used web-framework) is not collected and therefore not sent at all. -- On desktop applications - - The username logged in the device is not included. This is often a person's name. - - The machine name is not included, for example `Bruno's laptop` -- SDKs don't set `{{auto}}` as `user.ip_address`. This instructs the server to keep the connection's IP address. -- Server SDKs remove the IP address of incoming HTTP requests. - -Sentry server is always aware of the connecting IP address and can use it for logging in some platforms. Namely JavaScript and iOS/macOS/tvOS. -All other platforms require the event to include `user.ip_address={{auto}}` which happens if `sendDefaultPii` is set to true. - -Before sending events to Sentry, the SDKs should invokes callbacks. That allows users to remove any sensitive data client-side. - -- [`before-send` and `event-processors`](/sdk/foundations/client/#event-pipeline) can be used to register a callback with custom logic to remove sensitive data. - -### Cookies - -Since `Cookie` and `Set-Cookie` headers can contain a mix of sensitive and non-sensitive data, SDKs should parse the cookie header and filter values on a per-key basis, depending on the SDK setting and the sensitivity of the cookie value. -In case, the SDK cannot parse each cookie key-value pair, the entire cookie header must be replaced with `"[Filtered]"`. An unfiltered, raw cookie header value must never be sent. - -This selective filtering prevents capturing sensitive data while retaining harmless contextual information for debugging. -For example, a sensitive session cookie's value is replaced with "[Filtered]", but a non-sensitive cookie for the theme preference can be sent as-is. +## Sensitive Data -When attached as span attributes, the results should be as follows: +The normative rules for sensitive data, PII, cookies, request bodies, and user-set data are in [Data Collection](/sdk/foundations/client/data-collection/). The following is kept for context: -- `http.request.header.cookie.user_session: "[Filtered]"` -- `http.request.header.cookie.theme: "dark-mode"` -- `http.request.header.set_cookie.theme: "light-mode"` -- `http.request.header.cookie: "[Filtered]"` (Used as a fallback if the cookie header cannot be parsed) +- SDKs should not include PII or other sensitive data in the payload by default. The legacy option [_send-default-pii_](https://docs.sentry.io/platforms/python/configuration/options/#send-default-pii) is **disabled by default**; the replacement is `dataCollection.includeUserInfo` and `dataCollection.collect` (see [Data Collection](/sdk/foundations/client/data-collection/)). +- Certain sensitive data must never be sent through SDK instrumentation: header/cookie/query values matching the default denylist are replaced with `"[Filtered]"`. User-set data is always attached; only automatically gathered data is scrubbed. Users can use `beforeSend` / event processors to remove or redact any data. +- For the exact default denylist (partial, case-insensitive match), PII denylist (`x-forwarded-`, `-user`), cookies when unparseable, and raw request bodies, see [Data Collection — Default Denylist](/sdk/foundations/client/data-collection/#default-denylist) and [User-Set Data and Scrubbing](/sdk/foundations/client/data-collection/#user-set-data-scrubbing). ### Application State