Skip to content

chore: improvements to clickhouse and data masking#2985

Open
marcklingen wants to merge 1 commit into
mainfrom
self-host-docs-improvements
Open

chore: improvements to clickhouse and data masking#2985
marcklingen wants to merge 1 commit into
mainfrom
self-host-docs-improvements

Conversation

@marcklingen
Copy link
Copy Markdown
Member

@marcklingen marcklingen commented May 22, 2026

Greptile Summary

This PR improves two self-hosting documentation pages: it adds a "Direct ClickHouse Access for Custom Tools" section to clickhouse.mdx, and refines data-masking.mdx with a new blob-storage timing warning, a corrected error-handling table (fail-open is the actual default, not fail-closed), and fixed container references (Worker vs Web) throughout the troubleshooting section.

  • clickhouse.mdx: Adds guidance for users querying ClickHouse directly, clarifies the CLICKHOUSE_READ_ONLY_URL scope, and aligns the Cloud/BYOC recommendation paragraph with the updated description.
  • data-masking.mdx: Fixes the inverted error-handling table (default is fail-open/false, not fail-closed), corrects "Langfuse Web container" → "Langfuse Worker container" in troubleshooting, adds a warning that events are written to blob storage before the masking callback runs, and clarifies that only invalid JSON (not structural schema mismatch) triggers error-handling behavior.

Confidence Score: 4/5

Safe to merge for the ClickHouse page; the data-masking page would benefit from one additional sentence clarifying blob-storage retention before operators rely on it for compliance decisions.

Both files are documentation-only changes. The ClickHouse additions are clear and accurate. The data-masking corrections (table column order, container name fixes) are genuine improvements. The one gap is that the new blob-storage warning in data-masking.mdx omits whether unmasked blob-storage data is ever cleaned up — this matters for operators reading this page to evaluate GDPR/HIPAA compliance posture, and a misleading silence there could lead to incorrect architectural decisions.

content/self-hosting/security/data-masking.mdx — the blob-storage data retention clause in the new warning callout (lines 55–61).

Security Review

  • Unmasked data in blob storage: data-masking.mdx now discloses that events are written to blob storage before the masking callback runs, meaning blob storage retains unmasked PII. The documentation does not clarify whether this copy is ever replaced or deleted. Operators deploying server-side masking for GDPR/HIPAA/PCI DSS compliance may incorrectly assume their entire storage layer is protected.

Sequence Diagram

sequenceDiagram
    participant SDK as Langfuse SDK
    participant Web as Langfuse Web
    participant Bucket as Blob Storage (S3)
    participant Worker as Langfuse Worker
    participant Callback as Masking Callback
    participant CH as ClickHouse

    SDK->>Web: Send trace event (may contain PII)
    Web->>Bucket: Store UNMASKED event
    Note over Bucket: ⚠️ Unmasked data persists here
    Web->>Worker: Forward S3 reference (via Redis)
    Worker->>Bucket: Fetch unmasked event
    Worker->>Callback: POST OpenTelemetry object
    alt Callback succeeds
        Callback->>Worker: Return masked object
        Worker->>CH: Write MASKED data
    else Fail open (default)
        Worker->>CH: Write UNMASKED data, log warning
    else "Fail closed (FAIL_CLOSED=true)"
        Worker->>Worker: Drop event, log warning
    end
Loading
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
content/self-hosting/security/data-masking.mdx:55-61
**Blob storage data fate after masking is unspecified**

The new warning correctly flags that events land in blob storage *before* the masking callback runs, but it doesn't say whether that unmasked blob-storage copy is ever replaced or deleted after the Worker processes the masked version. Looking at the sequence diagram (SDK → Web → S3 (unmasked) → Worker → Callback → ClickHouse), there is no "update S3" or "delete from S3" step. If unmasked data persists in blob storage indefinitely, users deploying this feature for GDPR, HIPAA, or PCI DSS compliance may believe their storage layer is protected when only ClickHouse and downstream views receive masked data. A single sentence clarifying the blob-storage retention behavior (e.g., whether it is retained as-is, or whether client-side masking is required to prevent PII reaching blob storage at all) would prevent a compliance blind-spot for operators who read this warning and draw the wrong conclusion.

Reviews (1): Last reviewed commit: "improvements to clickhouse and data mask..." | Re-trigger Greptile

Greptile also left 1 inline comment on this PR.

@dosubot dosubot Bot added the size:M This PR changes 30-99 lines, ignoring generated files. label May 22, 2026
@vercel
Copy link
Copy Markdown

vercel Bot commented May 22, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
langfuse-docs Error Error May 22, 2026 11:28am

Request Review

@dosubot dosubot Bot added the documentation Improvements or additions to documentation label May 22, 2026
@github-actions
Copy link
Copy Markdown

@claude review

Comment on lines +55 to +61
<Callout type="warning">

Server-side masking is a centralized safety net, not a replacement for client-side masking when sensitive data must never leave the application boundary.
In the self-hosted ingestion pipeline, events are written to the event blob storage bucket before the Worker calls the masking callback.
The callback masks data before it is processed into ClickHouse and downstream Langfuse views.

</Callout>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 security Blob storage data fate after masking is unspecified

The new warning correctly flags that events land in blob storage before the masking callback runs, but it doesn't say whether that unmasked blob-storage copy is ever replaced or deleted after the Worker processes the masked version. Looking at the sequence diagram (SDK → Web → S3 (unmasked) → Worker → Callback → ClickHouse), there is no "update S3" or "delete from S3" step. If unmasked data persists in blob storage indefinitely, users deploying this feature for GDPR, HIPAA, or PCI DSS compliance may believe their storage layer is protected when only ClickHouse and downstream views receive masked data. A single sentence clarifying the blob-storage retention behavior (e.g., whether it is retained as-is, or whether client-side masking is required to prevent PII reaching blob storage at all) would prevent a compliance blind-spot for operators who read this warning and draw the wrong conclusion.

Prompt To Fix With AI
This is a comment left during a code review.
Path: content/self-hosting/security/data-masking.mdx
Line: 55-61

Comment:
**Blob storage data fate after masking is unspecified**

The new warning correctly flags that events land in blob storage *before* the masking callback runs, but it doesn't say whether that unmasked blob-storage copy is ever replaced or deleted after the Worker processes the masked version. Looking at the sequence diagram (SDK → Web → S3 (unmasked) → Worker → Callback → ClickHouse), there is no "update S3" or "delete from S3" step. If unmasked data persists in blob storage indefinitely, users deploying this feature for GDPR, HIPAA, or PCI DSS compliance may believe their storage layer is protected when only ClickHouse and downstream views receive masked data. A single sentence clarifying the blob-storage retention behavior (e.g., whether it is retained as-is, or whether client-side masking is required to prevent PII reaching blob storage at all) would prevent a compliance blind-spot for operators who read this warning and draw the wrong conclusion.

How can I resolve this? If you propose a fix, please make it concise.

| `CLICKHOUSE_DB` | `default` | Name of the ClickHouse database to use. |
| `CLICKHOUSE_CLUSTER_ENABLED` | `true` | Whether to run ClickHouse commands `ON CLUSTER`. Set to `false` for single-container setups. |
| `LANGFUSE_AUTO_CLICKHOUSE_MIGRATION_DISABLED` | `false` | Whether to disable automatic ClickHouse migrations. |
| `CLICKHOUSE_READ_ONLY_URL` | | Optional read-only endpoint used for public-API reads and selected UI/filter read queries. Falls back to `CLICKHOUSE_URL` when unset. Reuses `CLICKHOUSE_USER`, `CLICKHOUSE_PASSWORD`, and `CLICKHOUSE_DB`. Only useful on [compute-compute separated](https://clickhouse.com/docs/cloud/reference/warehouses) clusters. See [Scaling](/self-hosting/configuration/scaling#clickhouse-read-only-url). |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The CLICKHOUSE_READ_ONLY_URL description is narrowed here (lines 37 and 120) to 'public-API reads and selected UI/filter read queries', but two sibling docs still carry the original broader wording and now contradict this page: content/self-hosting/configuration/index.mdx:25 (master env-var index) still says 'UI and public-API read queries on traces, observations, scores, and sessions', and content/self-hosting/configuration/scaling.mdx:77 — the page this row links to as 'See [Scaling]' — still says 'route UI and public-API read queries to the given endpoint'. Please update those two locations to match the new narrower wording so users following the link don't get conflicting answers about what the env var actually routes.

Extended reasoning...

The bug

This PR intentionally narrows the description of CLICKHOUSE_READ_ONLY_URL in content/self-hosting/deployment/infrastructure/clickhouse.mdx in two places:

  • Line 37 (env-var table): 'UI and public-API read queries on traces, observations, scores, and sessions' → 'public-API reads and selected UI/filter read queries'
  • Line 120 (Cloud/BYOC narrative): 'read-heavy UI and public-API traffic' → 'public-API reads and selected UI/filter read traffic'

But the same variable is described in two other docs that were not touched and now disagree with the new wording.

Where the contradictions live

  1. content/self-hosting/configuration/index.mdx:25 — the master self-hosting env-var index table — still says:

    Optional read-only endpoint used for UI and public-API read queries on traces, observations, scores, and sessions.

  2. content/self-hosting/configuration/scaling.mdx:72-77 — the deep-dive page that this PR's new row explicitly links to as the canonical reference via See [Scaling](/self-hosting/configuration/scaling#clickhouse-read-only-url) — still says:

    This keeps dashboard and public-API read traffic from contending with ingestion inserts… Set CLICKHOUSE_READ_ONLY_URL and Langfuse will route UI and public-API read queries to the given endpoint while writes, migrations, and ingestion continue to use CLICKHOUSE_URL.

Step-by-step proof a reader gets contradictory information

  1. A self-hosting operator lands on /self-hosting/deployment/infrastructure/clickhouse.mdx and reads the new row at line 37: only public-API reads and selected UI/filter read queries are routed.
  2. They click the in-row link See [Scaling](/self-hosting/configuration/scaling#clickhouse-read-only-url) to learn more.
  3. The target section on scaling.mdx claims all UI and public-API read queries are routed — a strictly broader scope than what the page they came from just promised.
  4. If they cross-check the central env-var index at /self-hosting/configuration, they see yet a third variant of the same description (broader scope plus a specific entity list 'traces, observations, scores, and sessions' that the new wording dropped).

Impact

Documentation-only inconsistency, no runtime behavior change. But it is directly caused by this PR (the prior state of the docs was self-consistent at the broader wording), and the inconsistency is one hop away from the edited row via a link the PR itself places. A user trying to decide whether to provision a separate ClickHouse read compute group gets three different answers about what reads will be served from it.

Fix

Update the two sibling occurrences to the narrowed wording introduced here:

  • content/self-hosting/configuration/index.mdx:25 — replace the env-var row description with wording consistent with the new clickhouse.mdx text.
  • content/self-hosting/configuration/scaling.mdx:72-77 — adjust 'dashboard and public-API read traffic' and 'route UI and public-API read queries' to reflect that only the public API plus selected UI/filter reads are routed.

Decide whether the dropped entity list ('traces, observations, scores, and sessions') should be retained anywhere, or whether 'selected UI/filter read queries' is sufficiently specific in the new framing.

@dosubot dosubot Bot added the lgtm This PR has been approved by a maintainer label May 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation lgtm This PR has been approved by a maintainer size:M This PR changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants