Skip to content

[SPARK-55720][SS] Deprecate MapPartitions in Streaming Real-Time Mode#54523

Open
eason-yuchen-liu wants to merge 2 commits intoapache:masterfrom
eason-yuchen-liu:deprecateMapPartitionsInRTM
Open

[SPARK-55720][SS] Deprecate MapPartitions in Streaming Real-Time Mode#54523
eason-yuchen-liu wants to merge 2 commits intoapache:masterfrom
eason-yuchen-liu:deprecateMapPartitionsInRTM

Conversation

@eason-yuchen-liu
Copy link
Contributor

What changes were proposed in this pull request?

This PR proposes remove operator MapPartitions from the allowlist of Streaming Real-Time Mode. Users are encouraged to use Map + Filter + Explode instead.

Why are the changes needed?

MapPartitions takes in the iterator of the entire input, and produce the iterator of the entire output. Although giving users the full flexibility, these APIs are very easy to misuse in Streaming Real-Time Mode, where the streamlined behavior of an operator is required. Streamlining means no prefetching or buffering that when the operator consumes one input, it outputs all the results one by one. Users can easily violate the streamline requirement by calling input.hasNext twice or input.toSeq right now. Also, this operator makes potential future features difficult, for example Control Message Propagation.

Does this PR introduce any user-facing change?

Yes. Using MapPartitions in RTM now results in STREAMING_REAL_TIME_MODE.OPERATOR_OR_SINK_NOT_IN_ALLOWLIST exception. This is different from the behavior in Spark 4.1, where RTM is first released.

How was this patch tested?

Unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant