Overview
We should allow cube authors to declare materialization configuration via cube YAML. When deployed, the deployment process orchestrates the full three-step materialization flow behind the scenes: planning pre-agg records, scheduling pre-agg Spark workflows, and scheduling the Druid cube ingestion workflow.
Cube YAML with Materialization
name: ${prefix}my_cube
node_type: cube
metrics: [...]
dimensions: [...]
materialization:
schedule: "0 6 * * *" # cron schedule
strategy: incremental_time # or: full
lookback_window: 1 DAY # only relevant for incremental_time
partition: # required if strategy is incremental_time
dimension: shared.dims.date
granularity: DAY # or: HOUR
backfill_from: "20250101" # optional; if set, backfills from this date to today
Deployment Flow
When a cube with a materialization block is deployed, the deployment process:
POST /preaggs/plan: this plans and creates pre-agg records
POST /preaggs/{id}/materialize: this schedules the Spark workflow for each pre-agg
POST /cubes/{name}/materialize: this schedules the Druid ingestion workflow
Partition Resolution
The user declares the partition once at the cube level as a dimension reference. DJ derives the physical partition column and format for each pre-agg automatically by looking up how that dimension is linked on the upstream node. This means users think in terms of dimensions rather than physical column names, and the correct column is resolved per pre-agg without any additional configuration.
Validation
- partition is required when
strategy: incremental_time and deployment fails with a clear error if omitted
lookback_window is ignored if strategy: full
- If
backfill_from and backfill_to are both set, backfill runs between those two time frames. If backfill_to is not set, it will automatically default to today.
Pre-agg Level Spark Config
Pre-agg names are content-addressed and not user-controllable, so there is no stable handle for attaching per-pre-agg config in YAML. Instead, Spark execution hints for pre-agg computation are declared on dimension links (see #1910)
Overview
We should allow cube authors to declare materialization configuration via cube YAML. When deployed, the deployment process orchestrates the full three-step materialization flow behind the scenes: planning pre-agg records, scheduling pre-agg Spark workflows, and scheduling the Druid cube ingestion workflow.
Cube YAML with Materialization
Deployment Flow
When a cube with a materialization block is deployed, the deployment process:
POST /preaggs/plan: this plans and creates pre-agg recordsPOST /preaggs/{id}/materialize: this schedules the Spark workflow for each pre-aggPOST /cubes/{name}/materialize: this schedules the Druid ingestion workflowPartition Resolution
The user declares the partition once at the cube level as a dimension reference. DJ derives the physical partition column and format for each pre-agg automatically by looking up how that dimension is linked on the upstream node. This means users think in terms of dimensions rather than physical column names, and the correct column is resolved per pre-agg without any additional configuration.
Validation
strategy: incremental_timeand deployment fails with a clear error if omittedlookback_windowis ignored ifstrategy: fullbackfill_fromandbackfill_toare both set, backfill runs between those two time frames. Ifbackfill_tois not set, it will automatically default to today.Pre-agg Level Spark Config
Pre-agg names are content-addressed and not user-controllable, so there is no stable handle for attaching per-pre-agg config in YAML. Instead, Spark execution hints for pre-agg computation are declared on dimension links (see #1910)