feat: pipes and datasources for CDP aggs (CDP-804) by epipav · Pull Request #3714 · CrowdDotDev/crowd.dev

epipav · 2025-12-19T17:47:27Z

Note

Introduces Tinybird infrastructure for CDP aggregates across members and organizations.

Adds cdp_member_segment_aggregates_ds and cdp_organization_segment_aggregates_ds (AggregatingMergeTree) to store per-segment aggregate states
Materialized view and initial snapshot pipes to populate both datasources from activity relations
New Kafka sink pipes exporting aggregates for leaf/parent/grandparent segments, with daily "changed in previous day" exports and on-demand bucketed backfill variants
Schedules: member sinks at 01:00/01:30; organization sinks at 02:00/02:30; backfill sinks are @on-demand
Replaces formatter script with scripts/format.sh adding --help, --match, and --sequential options

^{Written by Cursor Bugbot for commit 76c850c. This will update automatically on new commits. Configure here.}

github-actions · 2025-12-19T17:47:37Z

⚠️ Jira Issue Key Missing

Your PR title doesn't contain a Jira issue key. Consider adding it for better traceability.

Example:

feat: add user authentication (CM-123)
feat: add user authentication (IN-123)

Projects:

CM: Community Data Platform
IN: Insights

Please add a Jira issue key to your PR title.

github-actions

Conventional Commits FTW!

github-actions · 2025-12-19T17:47:42Z

⚠️ Jira Issue Key Missing

Your PR title doesn't contain a Jira issue key. Consider adding it for better traceability.

Example:

feat: add user authentication (CM-123)
feat: add user authentication (IN-123)

Projects:

CM: Community Data Platform
IN: Insights

Please add a Jira issue key to your PR title.

github-actions · 2025-12-19T17:47:54Z

⚠️ Jira Issue Key Missing

Your PR title doesn't contain a Jira issue key. Consider adding it for better traceability.

Example:

feat: add user authentication (CM-123)
feat: add user authentication (IN-123)

Projects:

CM: Community Data Platform
IN: Insights

Please add a Jira issue key to your PR title.

github-actions · 2025-12-19T17:48:10Z

⚠️ Jira Issue Key Missing

Your PR title doesn't contain a Jira issue key. Consider adding it for better traceability.

Example:

feat: add user authentication (CM-123)
feat: add user authentication (IN-123)

Projects:

CM: Community Data Platform
IN: Insights

Please add a Jira issue key to your PR title.

github-actions · 2025-12-19T17:48:21Z

⚠️ Jira Issue Key Missing

Your PR title doesn't contain a Jira issue key. Consider adding it for better traceability.

Example:

feat: add user authentication (CM-123)
feat: add user authentication (IN-123)

Projects:

CM: Community Data Platform
IN: Insights

Please add a Jira issue key to your PR title.

services/libs/tinybird/pipes/cdp_member_aggregates_bucket_backfiller_sink.pipe

services/libs/tinybird/pipes/cdp_organization_aggregates_changed_parent_segments_sink.pipe

cursor · 2026-01-27T23:13:43Z

services/libs/tinybird/pipes/cdp_member_aggregates_changed_leaf_segments_sink.pipe

+            from cdp_member_segment_aggregates_ds
+            where updatedAt >= toStartOfDay(toTimeZone(now(), 'Europe/Berlin') - INTERVAL 1 DAY)
+        )
+    GROUP BY segmentId, memberId, updatedAt


GROUP BY includes updatedAt causing incorrect aggregation

High Severity

The GROUP BY clause includes updatedAt from the table column, which will produce multiple rows per (segmentId, memberId/organizationId) pair when there are multiple distinct updatedAt values. The parent and grandparent segment pipes correctly use GROUP BY parentId, memberId without updatedAt. The updatedAt in GROUP BY references the table column, not the now() alias, causing partial instead of full aggregations.

Additional Locations (1)

services/libs/tinybird/pipes/cdp_organization_aggregates_changed_leaf_segments_sink.pipe#L22-L23

services/libs/tinybird/pipes/cdp_organization_aggregates_bucket_backfiller_sink.pipe

cursor · 2026-01-27T23:26:49Z

services/libs/tinybird/pipes/cdp_organization_aggregates_bucket_backfiller_sink.pipe

+        groupArrayDistinctMerge(activeOnState) AS activeOn,
+        countMerge(activityCountState) AS activityCount,
+        countDistinctMerge(memberCountState) as memberCount,
+        round(avgMerge(avgContributorEngagement)) AS avgContributorEngagement,


Inconsistent round() usage causes data precision mismatch

Medium Severity

The backfiller pipe uses round(avgMerge(avgContributorEngagement)) while the changed segments sink pipes use avgMerge(avgContributorEngagement) without round(). Both export to the same Kafka topic (organizationSegmentsAgg_sink), causing data from backfills to have different precision than incremental updates. This inconsistency affects data integrity downstream.

Additional Locations (2)

services/libs/tinybird/pipes/cdp_organization_aggregates_changed_leaf_segments_sink.pipe#L12-L13

services/libs/tinybird/pipes/cdp_organization_aggregates_changed_grandparent_segments_sink.pipe#L35-L36

cursor · 2026-01-27T23:26:49Z

...s/libs/tinybird/pipes/cdp_member_aggregates_grandparent_segments_bucket_backfiller_sink.pipe

+                    required=False,
+                )
+            }}
+    {% end %}


Backfillers include NULL parent/grandparent IDs without bucket_id

Medium Severity

When bucket_id is not defined, the member backfiller pipes have no WHERE clause and will include segments with NULL parentId or grandparentId, producing records with NULL segmentId. The daily changed segments sinks filter these out via their IN subqueries (since NULL doesn't match IN clauses), creating inconsistent behavior between backfill and incremental exports.

Additional Locations (1)

services/libs/tinybird/pipes/cdp_member_aggregates_parent_segments_bucket_backfiller_sink.pipe#L15-L27

cursor · 2026-01-27T23:26:49Z

services/libs/tinybird/scripts/format.sh

+    Format files containing "cdp_organization" sequentially
+
+EOF
+  exit 0


Script exits with success code on argument errors

Low Severity

The show_help function always exits with code 0, but it's also called from error conditions (missing --match argument, unknown option). This causes the script to report success when it actually failed due to invalid arguments, which could mislead automated pipelines or CI/CD systems that rely on exit codes.

Additional Locations (2)

services/libs/tinybird/scripts/format.sh#L44-L48

services/libs/tinybird/scripts/format.sh#L52-L56

…wdDotDev/crowd.dev into feat/cdp-aggs-through-tinybird-sinks

cursor

Cursor Bugbot has reviewed your changes and found 3 potential issues.

cursor · 2026-01-27T23:33:59Z

services/libs/tinybird/pipes/cdp_member_aggregates_changed_leaf_segments_sink.pipe

+            from cdp_member_segment_aggregates_ds
+            where updatedAt >= toStartOfDay(toTimeZone(now(), 'Europe/Berlin') - INTERVAL 1 DAY)
+        )
+    GROUP BY segmentId, memberId, updatedAt


GROUP BY includes updatedAt causing incorrect aggregation

High Severity

The GROUP BY clause includes updatedAt, which prevents proper merging of aggregate states. Unlike the parent and grandparent segment pipes which group only by segmentId, memberId, this causes each distinct updatedAt value in the source table to produce a separate row with partial aggregation results. The query outputs now() as updatedAt, but the GROUP BY references the source column, resulting in multiple incorrectly aggregated rows per entity being sent to Kafka.

Additional Locations (1)

services/libs/tinybird/pipes/cdp_organization_aggregates_changed_leaf_segments_sink.pipe#L22-L23

cursor · 2026-01-27T23:34:00Z

services/libs/tinybird/pipes/cdp_organization_aggregates_bucket_backfiller_sink.pipe

+        countMerge(activityCountState) AS activityCount,
+        countDistinctMerge(memberCountState) as memberCount,
+        round(avgMerge(avgContributorEngagement)) AS avgContributorEngagement,
+        max(updatedAt) AS updatedAt


Organization backfiller uses different updatedAt semantics than other pipes

Medium Severity

The organization backfiller uses max(updatedAt) to preserve source timestamps, while all member backfillers and all changed segment sinks use now() for the export timestamp. This inconsistency means the updatedAt field has different semantics depending on how data reaches Kafka, which could confuse downstream systems expecting uniform timestamp behavior.

Additional Locations (2)

services/libs/tinybird/pipes/cdp_organization_aggregates_bucket_backfiller_sink.pipe#L43-L44

services/libs/tinybird/pipes/cdp_organization_aggregates_bucket_backfiller_sink.pipe#L74-L75

cursor · 2026-01-27T23:34:00Z

services/libs/tinybird/pipes/cdp_member_aggregates_changed_grandparent_segments_sink.pipe

+                    from cdp_member_segment_aggregates_ds
+                    where updatedAt >= toStartOfDay(toTimeZone(now(), 'Europe/Berlin') - INTERVAL 1 DAY)
+                )
+        )


Missing filter for empty parentId/grandparentId causes incorrect aggregation

High Severity

The segments datasource defaults parentId and grandparentId to empty string for top-level segments. The parent and grandparent aggregation queries don't filter out empty values, causing all segments without parents/grandparents to be aggregated together into a single row with segmentId = ''. This combines unrelated data and produces invalid segment identifiers for Kafka export.

Additional Locations (2)

services/libs/tinybird/pipes/cdp_member_aggregates_changed_parent_segments_sink.pipe#L11-L21

services/libs/tinybird/pipes/cdp_member_aggregates_grandparent_segments_bucket_backfiller_sink.pipe#L15-L27

epipav added 2 commits December 19, 2025 18:41

feat: aggs processing and sinks for cdp

4357c22

chore: formatting new resources

f3e7aa1

github-actions bot reviewed Dec 19, 2025

View reviewed changes

epipav changed the title ~~Tinybird resources for CDP aggs~~ Tinybird pipes and datasources for CDP aggs Dec 19, 2025

epipav changed the title ~~Tinybird pipes and datasources for CDP aggs~~ Pipes and datasources for CDP aggs Dec 19, 2025

epipav changed the title ~~Pipes and datasources for CDP aggs~~ feat: pipes and datasources for CDP aggs (CDP-804) Dec 19, 2025

cursor bot reviewed Dec 19, 2025

View reviewed changes

epipav self-assigned this Jan 27, 2026

epipav added 3 commits January 28, 2026 00:04

fix: member sinks splitted into different pipes for each type of segment

ebda0c0

fix: org sinks for changed data splitted for segment types

99ad5ac

feat: improved format script with wildcard matching and doc

a075de1

cursor bot reviewed Jan 27, 2026

View reviewed changes

epipav added 2 commits January 28, 2026 00:16

chore: remove unused pipe definition

0b4c11f

Merge branch 'main' into feat/cdp-aggs-through-tinybird-sinks

eea1913

cursor bot reviewed Jan 27, 2026

View reviewed changes

epipav added 2 commits January 28, 2026 00:26

chore: remove wrong pathed commited files

a308758

Merge branch 'feat/cdp-aggs-through-tinybird-sinks' of github.com:Cro…

76c850c

…wdDotDev/crowd.dev into feat/cdp-aggs-through-tinybird-sinks

cursor bot reviewed Jan 27, 2026

View reviewed changes

gaspergrom approved these changes Jan 30, 2026

View reviewed changes

epipav merged commit b54ee37 into main Jan 30, 2026
16 checks passed

epipav deleted the feat/cdp-aggs-through-tinybird-sinks branch January 30, 2026 14:00

Conversation

epipav commented Dec 19, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 19, 2025

Uh oh!

github-actions bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 19, 2025

Uh oh!

github-actions bot commented Dec 19, 2025

Uh oh!

github-actions bot commented Dec 19, 2025

Uh oh!

github-actions bot commented Dec 19, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot Jan 27, 2026

Choose a reason for hiding this comment

GROUP BY includes updatedAt causing incorrect aggregation

Uh oh!

Uh oh!

cursor bot Jan 27, 2026

Choose a reason for hiding this comment

Inconsistent round() usage causes data precision mismatch

Uh oh!

cursor bot Jan 27, 2026

Choose a reason for hiding this comment

Backfillers include NULL parent/grandparent IDs without bucket_id

Uh oh!

cursor bot Jan 27, 2026

Choose a reason for hiding this comment

Script exits with success code on argument errors

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Jan 27, 2026

Choose a reason for hiding this comment

GROUP BY includes updatedAt causing incorrect aggregation

Uh oh!

cursor bot Jan 27, 2026

Choose a reason for hiding this comment

Organization backfiller uses different updatedAt semantics than other pipes

Uh oh!

cursor bot Jan 27, 2026

Choose a reason for hiding this comment

Missing filter for empty parentId/grandparentId causes incorrect aggregation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

epipav commented Dec 19, 2025 •

edited by cursor bot

Loading

github-actions bot left a comment •

edited

Loading

GROUP BY includes `updatedAt` causing incorrect aggregation

Inconsistent `round()` usage causes data precision mismatch