[SPARK-56882][SDP] Implement SCD1 Batch Processor; Target Column Projection by AnishMahto · Pull Request #55991 · apache/spark

AnishMahto · 2026-05-19T17:18:20Z

Approved AutoCDC SPIP: https://lists.apache.org/thread/j6sj9wo9odgdpgzlxtvhoy7szs0jplf7

This is a stacked PR. Review incremental diff here: AnishMahto/spark@SPARK-56870-extend-microbatch-with-cdc-metadata...SPARK-56882-SCD1-project-target-columns-onto-microbatch

Preamble:

The SCD type 1 flow is a foreachBatch streaming query on an input change-data-feed, and is responsible for reconciling the incoming change data onto some target table that follows SCD1 replication semantics.

SCD1 flows also maintain an "auxiliary" table to keep track of early-arriving out-of-order received events state. Each microbatch will need to reconcile against this auxiliary table as well, and update the auxiliary table's state appropriately for future microbatches.

Target Column Projection:

As per the SPIP and ChangeArgs.columnSelection, users are allowed to specify the set of columns that actually gets persisted in the target table. Any columns not selected should be dropped before target table merge/persistence.

We should project only these selected columns onto the microbatch so that its dataframe is in the correct shape prior to CDC processing and merging into the target table.

AnishMahto added 21 commits May 12, 2026 21:02

Introduce ChangeArgs

8b08cbe

linting

202f3a5

reorder error condition

4ac75e7

PR feedback

11606c5

linting

d1a38e6

PR feedback

bbe5335

buff error message and revert to case class

95ca0e1

test UnqualifiedColumnName('col')

481ca9f

minor test buff

0126659

address PR feedbak

ac15be5

Implement deduplicateMicrobatch

248b57c

indenting cleanup

ef526af

schema comment

b707511

casing

267e64e

linting

5deb653

validation

2f0865f

buff scaladoc

5eecda7

use spark resolver

032206d

lingint

9a566ff

project target columns onto microbatch

dec7426

reuse applyToSchema

1be3cba

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56882][SDP] Implement SCD1 Batch Processor; Target Column Projection#55991

[SPARK-56882][SDP] Implement SCD1 Batch Processor; Target Column Projection#55991
AnishMahto wants to merge 21 commits into
apache:masterfrom
AnishMahto:SPARK-56882-SCD1-project-target-columns-onto-microbatch

AnishMahto commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AnishMahto commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant