[SPARK-56923][SDP] Implement SCD1 Batch Processor; Merge Microbatch onto Auxiliary#55995
[SPARK-56923][SDP] Implement SCD1 Batch Processor; Merge Microbatch onto Auxiliary#55995AnishMahto wants to merge 21 commits into
Conversation
szehon-ho
left a comment
There was a problem hiding this comment.
Reviewed the incremental diff on top of SPARK-56249-merge-tombstones-onto-microbatch (stacked on #55993). The mergeMicrobatchOntoAuxiliaryTable implementation and Scd1BatchProcessorMergeSuite coverage look solid overall — merge clauses match the stated aux-table semantics, projection to keys + _cdc_metadata is correct, and tie-breaking/idempotency cases are well documented in tests.
Suggestions
-
Add merge tests for composite and dotted keys —
deduplicateMicrobatchandapplyTombstonesToMicrobatchalready have coverage for composite keys (region,customer_id) and literal-dot names (`user.id`), butScd1BatchProcessorMergeSuiteonly uses a singleidkey. SincemergeMicrobatchOntoAuxiliaryTableuses the samek.quoted/doKeysMatchplumbing, a lightweight test (e.g. advance tombstone or insert for composite keys, and one dotted-key case) would lock in MERGE/catalog behavior without duplicating the full merge matrix. -
PR description typo — "reconcile agianst" → "reconcile against".
Thanks for the stacked compare link — much easier to review just this slice.
Approved AutoCDC SPIP: https://lists.apache.org/thread/j6sj9wo9odgdpgzlxtvhoy7szs0jplf7
This is a stacked PR. Review incremental diff here: AnishMahto/spark@SPARK-56249-merge-tombstones-onto-microbatch...SPARK-56923-SCD1-merge-microbatch-onto-auxiliary
Link to previous PR: #55993
Preamble:
The SCD type 1 flow is a foreachBatch streaming query on an input change-data-feed, and is responsible for reconciling the incoming change data onto some target table that follows SCD1 replication semantics.
SCD1 flows also maintain an "auxiliary" table to keep track of early-arriving out-of-order received events state. Each microbatch will need to reconcile against this auxiliary table as well, and update the auxiliary table's state appropriately for future microbatches.
Merge Microbatch onto Auxiliary:
The auxiliary table should be updated such that;
Implement this update operation as a MERGE.