Skip to content

[SPARK-55792][PS] Optimize DataFrame diff axis=0#55899

Open
emanhthangngot wants to merge 2 commits into
apache:masterfrom
emanhthangngot:SPARK-55792
Open

[SPARK-55792][PS] Optimize DataFrame diff axis=0#55899
emanhthangngot wants to merge 2 commits into
apache:masterfrom
emanhthangngot:SPARK-55792

Conversation

@emanhthangngot
Copy link
Copy Markdown

What changes were proposed in this pull request?

This PR optimizes pandas-on-Spark DataFrame.diff(axis=0) and Series.diff() to avoid using an unpartitioned Spark Window.

The new implementation range-partitions by the natural order column, computes pandas diff() within each Spark partition, and exchanges only the boundary rows needed to preserve correctness across partition boundaries. It also keeps the existing grouped diff() path unchanged.

Additional tests cover:

  • absence of a Window in the analyzed plan for DataFrame.diff()
  • empty DataFrames
  • MultiIndex rows
  • null values
  • single-partition execution
  • zero, negative, and large periods
  • cross-partition boundary rows
  • Series.diff() delegation

Why are the changes needed?

DataFrame.diff(axis=0) currently delegates to Series._diff() without a partition specification. This creates a Spark Window over the whole DataFrame ordered by the natural order column, which can force all data into a single partition and cause scaling issues for large datasets.

This change removes that unpartitioned Window from the DataFrame.diff(axis=0) / Series.diff() path while preserving pandas-compatible positional diff semantics, including rows at partition boundaries.

Does this PR introduce any user-facing change?

Yes. DataFrame.diff(axis=0) and Series.diff() now avoid the previous unpartitioned Window execution path. The intended result values are unchanged.

How was this patch tested?

Ran:

python/run-tests --python-executables .venv/bin/python --testnames pyspark.pandas.tests.computation.test_compute

The test was run from a temporary path without spaces because the local checkout path contains spaces and Spark's Java launcher fails to start from that path.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Codex (GPT-5)

Codex was used to help inspect the existing implementation, identify the unpartitioned Window path, refine the patch, and prepare tests. The final changes were reviewed and validated by the author.

@emanhthangngot emanhthangngot force-pushed the SPARK-55792 branch 3 times, most recently from 65c6499 to a33e14a Compare May 15, 2026 17:00
… dtype

- Apply ruff format corrections to frame.py (dict comprehension layout, slice spacing)

- Remove rowsBetween from lag window in Series._diff for Spark Connect compatibility

- Update test_groupby_diff expectations to float dtype (remove .astype(int) cast)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant