[SPARK-56448][CONNECT] Fix NPE on Spark Connect client restart due to ammonite compile cache#55720
[SPARK-56448][CONNECT] Fix NPE on Spark Connect client restart due to ammonite compile cache#55720yadavay-amzn wants to merge 1 commit into
Conversation
attilapiros
left a comment
There was a problem hiding this comment.
I think this test should be refined to check for the actual NPE rather than the internal state of dirOpt (while this state currently triggers the crash in this Ammonite version, it's an indirect check).
0789e10 to
da81606
Compare
|
Thanks for the review @attilapiros. Updated the test to exercise the restart scenario directly: it calls Does this approach work? |
|
@yadavay-amzn please doublecheck whether it fails when |
da81606 to
a6485d6
Compare
|
@attilapiros I verified this both in-process (calling Updated the test to directly assert that |
|
@yadavay-amzn You can do something similar to what |
a6485d6 to
ba77fd6
Compare
|
@attilapiros Thanks for the suggestion! Added Kept the existing |
3139545 to
7e4bad9
Compare
7e4bad9 to
b8d51a3
Compare
|
Done -- renamed to |
attilapiros
left a comment
There was a problem hiding this comment.
We are close to finalizing!
… ammonite compile cache The Spark Connect REPL uses Ammonite. Ammonite's default Storage.Folder persists compiled predef classes under ~/.ammonite/<version>/cache. On a subsequent REPL start from the same working directory, the cached CodePredef class is reloaded but its reference to the per-session ArgsPredef helper is stale, producing a NullPointerException during predef initialization. Use Storage.InMemory for the Connect REPL's compile cache so every session starts fresh. Extract the Main construction into a package- private helper to keep the test localised to unit-level. Regression test added: asserts that the ammonite.Main returned by ConnectRepl.newAmmoniteMain exposes no on-disk cache directory (storageBackend.dirOpt.isEmpty) and is an instance of Storage.InMemory. The dirOpt assertion is an observable behavioural check -- if the Storage.InMemory wiring is reverted, ammonite.Main falls back to Storage.Folder(~/.ammonite) and the test fails with a clear message rather than silently compiling.
b8d51a3 to
82d6648
Compare
|
Done -- reverted the method extraction. The fix is now minimal: just the |
|
@yadavay-amzn I have updated the PR description please doublecheck it |
|
@attilapiros Thanks for all the feedback and review! |
|
I’ll leave this open for another 2–3 days to ensure everyone has a chance to provide feedback before I merge. |
… ammonite compile cache ### What changes were proposed in this pull request? The Spark Connect REPL uses Ammonite. Ammonite's default `Storage.Folder` persists compiled predef classes under `~/.ammonite/<version>/cache`. On a subsequent REPL start from the same working directory, the cached `CodePredef` class is reloaded but its reference to the per-session `ArgsPredef` helper is stale, producing a `NullPointerException` during predef initialization. This PR switches the Connect REPL's compile cache to `Storage.InMemory` so every session starts fresh and no stale cache is carried across restarts. ### Why are the changes needed? The stale-cache failure is a user-visible crash on every every subsequent call of `bin/spark-shell --remote sc://...` from the same working directory. Reproduction steps are on the JIRA. ### Does this PR introduce _any_ user-facing change? There is one minor observable tradeoff: because the compile cache is now in-memory rather than persisted, each REPL start recompiles the predef instead of reading the cached classfiles. This adds ~a few hundred milliseconds to subsequent REPL startups but eliminates the NPE. We believe this is the correct tradeoff — a small startup cost is preferable to a hard failure. ### How was this patch tested? Added `AmmoniteReplE2ESuite` with a test starting `bin/spark-shell --remote sc://...` twice and checking both run was successful. I verified the negative case locally by temporarily reverting only the `Storage.InMemory()` line and re-running the test; it fails with: ``` - SPARK-56448: restarting spark-shell --remote does not throw NPE *** FAILED *** 1 did not equal 0 Second spark-shell failed (exit=1): WARNING: Using incubator modules: jdk.incubator.vector Exception in thread "main" java.lang.NullPointerException: Cannot invoke "ammonite.predef.ArgsPredef$Helper.spark()" because the return value of "ammonite.predef.CodePredef.ArgsPredef()" is null at ammonite.predef.CodePredef$Helper.<init>((console):7) at ammonite.predef.CodePredef$.<clinit>((console):6) at ammonite.predef.CodePredef.$main((console)) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:569) at ammonite.runtime.Evaluator$$anon$1.$anonfun$evalMain$1(Evaluator.scala:108) at ammonite.util.Util$.withContextClassloader(Util.scala:21) at ammonite.runtime.Evaluator$$anon$1.evalMain(Evaluator.scala:90) at ammonite.interp.Interpreter.$anonfun$processAllScriptBlocks$10(Interpreter.scala:594) at ammonite.util.Res$Success.map(Res.scala:63) at ammonite.interp.Interpreter.$anonfun$processAllScriptBlocks$9(Interpreter.scala:594) at scala.Option$WithFilter.map(Option.scala:242) at ammonite.interp.Interpreter.loop$1(Interpreter.scala:574) at ammonite.interp.Interpreter.processAllScriptBlocks(Interpreter.scala:644) at ammonite.interp.Interpreter.$anonfun$processModule$6(Interpreter.scala:432) at ammonite.util.Catching.flatMap(Res.scala:110) at ammonite.interp.Interpreter.$anonfun$processModule$5(Interpreter.scala:423) ... ``` Restoring the fix makes the test pass. ### Was this patch authored or co-authored using generative AI tooling? Yes Closes #55720 from yadavay-amzn/fix/spark_SPARK-56448. Authored-by: Anupam Yadav <anupamya@amazon.com> Signed-off-by: attilapiros <piros.attila.zsolt@gmail.com> (cherry picked from commit 3e83503) Signed-off-by: attilapiros <piros.attila.zsolt@gmail.com>
… ammonite compile cache ### What changes were proposed in this pull request? The Spark Connect REPL uses Ammonite. Ammonite's default `Storage.Folder` persists compiled predef classes under `~/.ammonite/<version>/cache`. On a subsequent REPL start from the same working directory, the cached `CodePredef` class is reloaded but its reference to the per-session `ArgsPredef` helper is stale, producing a `NullPointerException` during predef initialization. This PR switches the Connect REPL's compile cache to `Storage.InMemory` so every session starts fresh and no stale cache is carried across restarts. ### Why are the changes needed? The stale-cache failure is a user-visible crash on every every subsequent call of `bin/spark-shell --remote sc://...` from the same working directory. Reproduction steps are on the JIRA. ### Does this PR introduce _any_ user-facing change? There is one minor observable tradeoff: because the compile cache is now in-memory rather than persisted, each REPL start recompiles the predef instead of reading the cached classfiles. This adds ~a few hundred milliseconds to subsequent REPL startups but eliminates the NPE. We believe this is the correct tradeoff — a small startup cost is preferable to a hard failure. ### How was this patch tested? Added `AmmoniteReplE2ESuite` with a test starting `bin/spark-shell --remote sc://...` twice and checking both run was successful. I verified the negative case locally by temporarily reverting only the `Storage.InMemory()` line and re-running the test; it fails with: ``` - SPARK-56448: restarting spark-shell --remote does not throw NPE *** FAILED *** 1 did not equal 0 Second spark-shell failed (exit=1): WARNING: Using incubator modules: jdk.incubator.vector Exception in thread "main" java.lang.NullPointerException: Cannot invoke "ammonite.predef.ArgsPredef$Helper.spark()" because the return value of "ammonite.predef.CodePredef.ArgsPredef()" is null at ammonite.predef.CodePredef$Helper.<init>((console):7) at ammonite.predef.CodePredef$.<clinit>((console):6) at ammonite.predef.CodePredef.$main((console)) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:569) at ammonite.runtime.Evaluator$$anon$1.$anonfun$evalMain$1(Evaluator.scala:108) at ammonite.util.Util$.withContextClassloader(Util.scala:21) at ammonite.runtime.Evaluator$$anon$1.evalMain(Evaluator.scala:90) at ammonite.interp.Interpreter.$anonfun$processAllScriptBlocks$10(Interpreter.scala:594) at ammonite.util.Res$Success.map(Res.scala:63) at ammonite.interp.Interpreter.$anonfun$processAllScriptBlocks$9(Interpreter.scala:594) at scala.Option$WithFilter.map(Option.scala:242) at ammonite.interp.Interpreter.loop$1(Interpreter.scala:574) at ammonite.interp.Interpreter.processAllScriptBlocks(Interpreter.scala:644) at ammonite.interp.Interpreter.$anonfun$processModule$6(Interpreter.scala:432) at ammonite.util.Catching.flatMap(Res.scala:110) at ammonite.interp.Interpreter.$anonfun$processModule$5(Interpreter.scala:423) ... ``` Restoring the fix makes the test pass. ### Was this patch authored or co-authored using generative AI tooling? Yes Closes #55720 from yadavay-amzn/fix/spark_SPARK-56448. Authored-by: Anupam Yadav <anupamya@amazon.com> Signed-off-by: attilapiros <piros.attila.zsolt@gmail.com> (cherry picked from commit 3e83503) Signed-off-by: attilapiros <piros.attila.zsolt@gmail.com>
… ammonite compile cache ### What changes were proposed in this pull request? The Spark Connect REPL uses Ammonite. Ammonite's default `Storage.Folder` persists compiled predef classes under `~/.ammonite/<version>/cache`. On a subsequent REPL start from the same working directory, the cached `CodePredef` class is reloaded but its reference to the per-session `ArgsPredef` helper is stale, producing a `NullPointerException` during predef initialization. This PR switches the Connect REPL's compile cache to `Storage.InMemory` so every session starts fresh and no stale cache is carried across restarts. ### Why are the changes needed? The stale-cache failure is a user-visible crash on every every subsequent call of `bin/spark-shell --remote sc://...` from the same working directory. Reproduction steps are on the JIRA. ### Does this PR introduce _any_ user-facing change? There is one minor observable tradeoff: because the compile cache is now in-memory rather than persisted, each REPL start recompiles the predef instead of reading the cached classfiles. This adds ~a few hundred milliseconds to subsequent REPL startups but eliminates the NPE. We believe this is the correct tradeoff — a small startup cost is preferable to a hard failure. ### How was this patch tested? Added `AmmoniteReplE2ESuite` with a test starting `bin/spark-shell --remote sc://...` twice and checking both run was successful. I verified the negative case locally by temporarily reverting only the `Storage.InMemory()` line and re-running the test; it fails with: ``` - SPARK-56448: restarting spark-shell --remote does not throw NPE *** FAILED *** 1 did not equal 0 Second spark-shell failed (exit=1): WARNING: Using incubator modules: jdk.incubator.vector Exception in thread "main" java.lang.NullPointerException: Cannot invoke "ammonite.predef.ArgsPredef$Helper.spark()" because the return value of "ammonite.predef.CodePredef.ArgsPredef()" is null at ammonite.predef.CodePredef$Helper.<init>((console):7) at ammonite.predef.CodePredef$.<clinit>((console):6) at ammonite.predef.CodePredef.$main((console)) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:569) at ammonite.runtime.Evaluator$$anon$1.$anonfun$evalMain$1(Evaluator.scala:108) at ammonite.util.Util$.withContextClassloader(Util.scala:21) at ammonite.runtime.Evaluator$$anon$1.evalMain(Evaluator.scala:90) at ammonite.interp.Interpreter.$anonfun$processAllScriptBlocks$10(Interpreter.scala:594) at ammonite.util.Res$Success.map(Res.scala:63) at ammonite.interp.Interpreter.$anonfun$processAllScriptBlocks$9(Interpreter.scala:594) at scala.Option$WithFilter.map(Option.scala:242) at ammonite.interp.Interpreter.loop$1(Interpreter.scala:574) at ammonite.interp.Interpreter.processAllScriptBlocks(Interpreter.scala:644) at ammonite.interp.Interpreter.$anonfun$processModule$6(Interpreter.scala:432) at ammonite.util.Catching.flatMap(Res.scala:110) at ammonite.interp.Interpreter.$anonfun$processModule$5(Interpreter.scala:423) ... ``` Restoring the fix makes the test pass. ### Was this patch authored or co-authored using generative AI tooling? Yes Closes #55720 from yadavay-amzn/fix/spark_SPARK-56448. Authored-by: Anupam Yadav <anupamya@amazon.com> Signed-off-by: attilapiros <piros.attila.zsolt@gmail.com> (cherry picked from commit 3e83503) Signed-off-by: attilapiros <piros.attila.zsolt@gmail.com>
… ammonite compile cache ### What changes were proposed in this pull request? The Spark Connect REPL uses Ammonite. Ammonite's default `Storage.Folder` persists compiled predef classes under `~/.ammonite/<version>/cache`. On a subsequent REPL start from the same working directory, the cached `CodePredef` class is reloaded but its reference to the per-session `ArgsPredef` helper is stale, producing a `NullPointerException` during predef initialization. This PR switches the Connect REPL's compile cache to `Storage.InMemory` so every session starts fresh and no stale cache is carried across restarts. ### Why are the changes needed? The stale-cache failure is a user-visible crash on every every subsequent call of `bin/spark-shell --remote sc://...` from the same working directory. Reproduction steps are on the JIRA. ### Does this PR introduce _any_ user-facing change? There is one minor observable tradeoff: because the compile cache is now in-memory rather than persisted, each REPL start recompiles the predef instead of reading the cached classfiles. This adds ~a few hundred milliseconds to subsequent REPL startups but eliminates the NPE. We believe this is the correct tradeoff — a small startup cost is preferable to a hard failure. ### How was this patch tested? Added `AmmoniteReplE2ESuite` with a test starting `bin/spark-shell --remote sc://...` twice and checking both run was successful. I verified the negative case locally by temporarily reverting only the `Storage.InMemory()` line and re-running the test; it fails with: ``` - SPARK-56448: restarting spark-shell --remote does not throw NPE *** FAILED *** 1 did not equal 0 Second spark-shell failed (exit=1): WARNING: Using incubator modules: jdk.incubator.vector Exception in thread "main" java.lang.NullPointerException: Cannot invoke "ammonite.predef.ArgsPredef$Helper.spark()" because the return value of "ammonite.predef.CodePredef.ArgsPredef()" is null at ammonite.predef.CodePredef$Helper.<init>((console):7) at ammonite.predef.CodePredef$.<clinit>((console):6) at ammonite.predef.CodePredef.$main((console)) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:569) at ammonite.runtime.Evaluator$$anon$1.$anonfun$evalMain$1(Evaluator.scala:108) at ammonite.util.Util$.withContextClassloader(Util.scala:21) at ammonite.runtime.Evaluator$$anon$1.evalMain(Evaluator.scala:90) at ammonite.interp.Interpreter.$anonfun$processAllScriptBlocks$10(Interpreter.scala:594) at ammonite.util.Res$Success.map(Res.scala:63) at ammonite.interp.Interpreter.$anonfun$processAllScriptBlocks$9(Interpreter.scala:594) at scala.Option$WithFilter.map(Option.scala:242) at ammonite.interp.Interpreter.loop$1(Interpreter.scala:574) at ammonite.interp.Interpreter.processAllScriptBlocks(Interpreter.scala:644) at ammonite.interp.Interpreter.$anonfun$processModule$6(Interpreter.scala:432) at ammonite.util.Catching.flatMap(Res.scala:110) at ammonite.interp.Interpreter.$anonfun$processModule$5(Interpreter.scala:423) ... ``` Restoring the fix makes the test pass. ### Was this patch authored or co-authored using generative AI tooling? Yes Closes #55720 from yadavay-amzn/fix/spark_SPARK-56448. Authored-by: Anupam Yadav <anupamya@amazon.com> Signed-off-by: attilapiros <piros.attila.zsolt@gmail.com> (cherry picked from commit 3e83503) Signed-off-by: attilapiros <piros.attila.zsolt@gmail.com>
… ammonite compile cache The Spark Connect REPL uses Ammonite. Ammonite's default `Storage.Folder` persists compiled predef classes under `~/.ammonite/<version>/cache`. On a subsequent REPL start from the same working directory, the cached `CodePredef` class is reloaded but its reference to the per-session `ArgsPredef` helper is stale, producing a `NullPointerException` during predef initialization. This PR switches the Connect REPL's compile cache to `Storage.InMemory` so every session starts fresh and no stale cache is carried across restarts. The stale-cache failure is a user-visible crash on every every subsequent call of `bin/spark-shell --remote sc://...` from the same working directory. Reproduction steps are on the JIRA. There is one minor observable tradeoff: because the compile cache is now in-memory rather than persisted, each REPL start recompiles the predef instead of reading the cached classfiles. This adds ~a few hundred milliseconds to subsequent REPL startups but eliminates the NPE. We believe this is the correct tradeoff — a small startup cost is preferable to a hard failure. Added `AmmoniteReplE2ESuite` with a test starting `bin/spark-shell --remote sc://...` twice and checking both run was successful. I verified the negative case locally by temporarily reverting only the `Storage.InMemory()` line and re-running the test; it fails with: ``` - SPARK-56448: restarting spark-shell --remote does not throw NPE *** FAILED *** 1 did not equal 0 Second spark-shell failed (exit=1): WARNING: Using incubator modules: jdk.incubator.vector Exception in thread "main" java.lang.NullPointerException: Cannot invoke "ammonite.predef.ArgsPredef$Helper.spark()" because the return value of "ammonite.predef.CodePredef.ArgsPredef()" is null at ammonite.predef.CodePredef$Helper.<init>((console):7) at ammonite.predef.CodePredef$.<clinit>((console):6) at ammonite.predef.CodePredef.$main((console)) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:569) at ammonite.runtime.Evaluator$$anon$1.$anonfun$evalMain$1(Evaluator.scala:108) at ammonite.util.Util$.withContextClassloader(Util.scala:21) at ammonite.runtime.Evaluator$$anon$1.evalMain(Evaluator.scala:90) at ammonite.interp.Interpreter.$anonfun$processAllScriptBlocks$10(Interpreter.scala:594) at ammonite.util.Res$Success.map(Res.scala:63) at ammonite.interp.Interpreter.$anonfun$processAllScriptBlocks$9(Interpreter.scala:594) at scala.Option$WithFilter.map(Option.scala:242) at ammonite.interp.Interpreter.loop$1(Interpreter.scala:574) at ammonite.interp.Interpreter.processAllScriptBlocks(Interpreter.scala:644) at ammonite.interp.Interpreter.$anonfun$processModule$6(Interpreter.scala:432) at ammonite.util.Catching.flatMap(Res.scala:110) at ammonite.interp.Interpreter.$anonfun$processModule$5(Interpreter.scala:423) ... ``` Restoring the fix makes the test pass. Yes Closes #55720 from yadavay-amzn/fix/spark_SPARK-56448. Authored-by: Anupam Yadav <anupamya@amazon.com> Signed-off-by: attilapiros <piros.attila.zsolt@gmail.com> (cherry picked from commit 3e83503) Signed-off-by: attilapiros <piros.attila.zsolt@gmail.com>
|
Merged into master and branch-4.x, branch-4.2, branch-4.1, branch-4.0, branch-3.5. Thanks @yadavay-amzn! |
| (process.exitValue(), stdout.mkString("\n"), stderr.mkString("\n")) | ||
| } | ||
|
|
||
| test("SPARK-56448: restarting spark-shell --remote does not throw NPE") { |
There was a problem hiding this comment.
This test case seems to introduce a flakiness. I didn't dig much, but have you observed the following failures in your CIs, @yadavay-amzn and @attilapiros ?
[info] AmmoniteReplE2ESuite:
[info] - SPARK-56448: restarting spark-shell --remote does not throw NPE *** FAILED *** (1 second, 975 milliseconds)
[info] 1 did not equal 0 First spark-shell failed (exit=1): WARNING: Using incubator modules: jdk.incubator.vector
[info] Exception in thread "main" java.lang.ExceptionInInitializerError
[info] at org.apache.arrow.memory.netty.DefaultAllocationManagerFactory.<clinit>(DefaultAllocationManagerFactory.java:26)
...
[info] Caused by: java.lang.UnsupportedOperationException
[info] at org.sparkproject.connect.client.io.netty.buffer.EmptyByteBuf.memoryAddress(EmptyByteBuf.java:961)
[info] at org.sparkproject.connect.client.io.netty.buffer.UnsafeDirectLittleEndian.<init>(UnsafeDirectLittleEndian.java:45)
[info] at org.sparkproject.connect.client.io.netty.buffer.PooledByteBufAllocatorL.<init>(PooledByteBufAllocatorL.java:47)
[info] ... 32 more (AmmoniteReplE2ESuite.scala:66)
cc @peter-toth
There was a problem hiding this comment.
@dongjoon-hyun Thanks for taking a look. We haven't seen this specific Arrow/Netty failure in our CI runs.
The CI failures we saw were always the OracleIntegrationSuite Docker test.
The test launches spark-shell --remote as a subprocess, so it's sensitive to the JDK environment the CI runner uses. Would it help to add a retry or mark this test as flaky?
I'll also investigate the Arrow/Netty root cause by running the test locally to see if the CI flaky behavior is reproducible locally.
There was a problem hiding this comment.
Ran the AmmoniteReplE2ESuite test 10 times locally on current master (JDK 17) and its passing consistently. So the flakiness seems specific to the CI run.
The error (ExceptionInInitializerError in Arrow's DefaultAllocationManagerFactory / Netty's PooledByteBufAllocatorL) looks like a JDK/Arrow memory allocator incompatibility which may be JDK version sensitive. The test itself just launches spark-shell --remote as a subprocess and it doesn't do anything Arrow/Netty-specific.
@dongjoon-hyun / @peter-toth Could you share which JDK vendor/version and CI environment produced the failure? Or a link to the CI job if you have it? I can deep dive to figure out the issue and try repro with same setup.
There was a problem hiding this comment.
Thank you for confirming, @yadavay-amzn . For my cases, the failures are consistently observed with Java 25 + UBI 10 combination.
[root@1d494a9a9969 /]# java --version
openjdk 25.0.3 2026-04-21 LTS
...
[root@1d494a9a9969 /]# cat /etc/os-release
NAME="Red Hat Enterprise Linux"
VERSION="10.1 (Coughlan)"
There was a problem hiding this comment.
To @yadavay-amzn and @attilapiros , Apache Spark 4.2.0 RC1 will start tomorrow according to @huaxingao 's announcement.
- Given that, could you provide us some fallback option or environment variable to disable this change for all release branches, @yadavay-amzn and @attilapiros ?
- Personally, I believe it's a little too hasty to backport this to
branch-4.1,branch-4.0, andbranch-3.5. It would be better if we can revert this from those release branches. We can backport this after verifying globally duringApache Spark 4.2.0release cycle. After releasing 4.2.0, we can backport again more safely.
|
@dongjoon-hyun Thanks for confirming the JDK 25 + UBI 10 environment. The production fix (Storage.InMemory) is JDK-independent as it's pure ammonite/Scala code with no Arrow/Netty dependency. The test failure is because spark-shell --remote initializes Arrow, which hits the Netty memoryAddress() incompatibility on JDK 25. Based on my understanding, this would affect any test that launches spark-shell on JDK 25, not just ours. I've submitted PR #55999 to skip the test on JDK 25+ with an assume() guard. Please let me know if that is sufficient for now or if you'd like to disable the test some other way. @attilapiros Could you please keep me honest here on the production change (Storage.InMemory in ConnectRepl.scala) has no JDK-version sensitivity? It should be safe on all JDKs since it's just ammonite cache configuration, right? |
|
@dongjoon-hyun I agree @yadavay-amzn this must be independent from the ammonite fix and it means on JDK 25 But let me doublecheck this by trying this out locally on JDK 25 |
|
@attilapiros Thanks for confirming the analysis! I also ran this locally on JDK
So yes, I've updated SPARK-56955 with the full reproduction details and stack trace. PR |
|
I would not switch off the test! It is very good that we found this error. |
|
@yadavay-amzn for me But the test not. This changes everything! |
|
@attilapiros Oh that's interesting, I reproduced the failure on Corretto 25.0.3+9-LTS (Amazon Linux). May I ask which JDK 25 vendor/build are you using? Also wondering, the test runs
I was disabling the test since it sounded like it was blocking the release, can close PR #55999 and focus on identifying the real root cause instead. |
|
This way as this is an error in the test switching off the test is good workaround for now. |
|
Thanks @dongjoon-hyun for flagging me on this PR. |
|
Oh, this is not a blocker for anything. It would be great but we can investigate more on this and decide on RC2, @huaxingao ~ |
|
@dongjoon-hyun, @huaxingao, @yadavay-amzn Sorry for the confusion! Still I think it is unrelated to the ammonite fix! |
|
We are not alone: apache/iceberg#15930 |
|
The |
|
Opened #56006 with the fix. |
|
@attilapiros Thanks for digging into this and finding the root cause + fix! I updated PR #55999 to add |
What changes were proposed in this pull request?
The Spark Connect REPL uses Ammonite. Ammonite's default
Storage.Folderpersists compiled predef classes under
~/.ammonite/<version>/cache. On asubsequent REPL start from the same working directory, the cached
CodePredefclass is reloaded but its reference to the per-sessionArgsPredefhelper is stale, producing aNullPointerExceptionduringpredef initialization.
This PR switches the Connect REPL's compile cache to
Storage.InMemoryso every session starts fresh and no stale cache is carried across
restarts.
Why are the changes needed?
The stale-cache failure is a user-visible crash on every every subsequent call
of
bin/spark-shell --remote sc://...from the same workingdirectory. Reproduction steps are on the JIRA.
Does this PR introduce any user-facing change?
There is one minor observable tradeoff: because the compile cache is
now in-memory rather than persisted, each REPL start recompiles the
predef instead of reading the cached classfiles. This adds ~a few
hundred milliseconds to subsequent REPL startups but eliminates the
NPE. We believe this is the correct tradeoff — a small startup cost
is preferable to a hard failure.
How was this patch tested?
Added
AmmoniteReplE2ESuitewith a test startingbin/spark-shell --remote sc://...twice and checking both run was successful.
I verified the negative case locally by temporarily reverting only the
Storage.InMemory()line and re-running the test; it fails with:
Restoring the fix makes the test pass.
Was this patch authored or co-authored using generative AI tooling?
Yes