Skip to content

[SPARK-56448][CONNECT] Fix NPE on Spark Connect client restart due to ammonite compile cache#55720

Closed
yadavay-amzn wants to merge 1 commit into
apache:masterfrom
yadavay-amzn:fix/spark_SPARK-56448
Closed

[SPARK-56448][CONNECT] Fix NPE on Spark Connect client restart due to ammonite compile cache#55720
yadavay-amzn wants to merge 1 commit into
apache:masterfrom
yadavay-amzn:fix/spark_SPARK-56448

Conversation

@yadavay-amzn
Copy link
Copy Markdown
Contributor

@yadavay-amzn yadavay-amzn commented May 6, 2026

What changes were proposed in this pull request?

The Spark Connect REPL uses Ammonite. Ammonite's default Storage.Folder
persists compiled predef classes under ~/.ammonite/<version>/cache. On a
subsequent REPL start from the same working directory, the cached
CodePredef class is reloaded but its reference to the per-session
ArgsPredef helper is stale, producing a NullPointerException during
predef initialization.

This PR switches the Connect REPL's compile cache to Storage.InMemory
so every session starts fresh and no stale cache is carried across
restarts.

Why are the changes needed?

The stale-cache failure is a user-visible crash on every every subsequent call
of bin/spark-shell --remote sc://... from the same working
directory. Reproduction steps are on the JIRA.

Does this PR introduce any user-facing change?

There is one minor observable tradeoff: because the compile cache is
now in-memory rather than persisted, each REPL start recompiles the
predef instead of reading the cached classfiles. This adds ~a few
hundred milliseconds to subsequent REPL startups but eliminates the
NPE. We believe this is the correct tradeoff — a small startup cost
is preferable to a hard failure.

How was this patch tested?

Added AmmoniteReplE2ESuite with a test starting bin/spark-shell --remote sc://...
twice and checking both run was successful.

I verified the negative case locally by temporarily reverting only the Storage.InMemory()
line and re-running the test; it fails with:

- SPARK-56448: restarting spark-shell --remote does not throw NPE *** FAILED ***
  1 did not equal 0 Second spark-shell failed (exit=1): WARNING: Using incubator modules: jdk.incubator.vector
  Exception in thread "main" java.lang.NullPointerException: Cannot invoke "ammonite.predef.ArgsPredef$Helper.spark()" because the return value of "ammonite.predef.CodePredef.ArgsPredef()" is null
  	at ammonite.predef.CodePredef$Helper.<init>((console):7)
  	at ammonite.predef.CodePredef$.<clinit>((console):6)
  	at ammonite.predef.CodePredef.$main((console))
  	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
  	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  	at java.base/java.lang.reflect.Method.invoke(Method.java:569)
  	at ammonite.runtime.Evaluator$$anon$1.$anonfun$evalMain$1(Evaluator.scala:108)
  	at ammonite.util.Util$.withContextClassloader(Util.scala:21)
  	at ammonite.runtime.Evaluator$$anon$1.evalMain(Evaluator.scala:90)
  	at ammonite.interp.Interpreter.$anonfun$processAllScriptBlocks$10(Interpreter.scala:594)
  	at ammonite.util.Res$Success.map(Res.scala:63)
  	at ammonite.interp.Interpreter.$anonfun$processAllScriptBlocks$9(Interpreter.scala:594)
  	at scala.Option$WithFilter.map(Option.scala:242)
  	at ammonite.interp.Interpreter.loop$1(Interpreter.scala:574)
  	at ammonite.interp.Interpreter.processAllScriptBlocks(Interpreter.scala:644)
  	at ammonite.interp.Interpreter.$anonfun$processModule$6(Interpreter.scala:432)
  	at ammonite.util.Catching.flatMap(Res.scala:110)
  	at ammonite.interp.Interpreter.$anonfun$processModule$5(Interpreter.scala:423)
...

Restoring the fix makes the test pass.

Was this patch authored or co-authored using generative AI tooling?

Yes

Copy link
Copy Markdown
Contributor

@attilapiros attilapiros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test should be refined to check for the actual NPE rather than the internal state of dirOpt (while this state currently triggers the crash in this Ammonite version, it's an indirect check).

@yadavay-amzn yadavay-amzn force-pushed the fix/spark_SPARK-56448 branch from 0789e10 to da81606 Compare May 8, 2026 22:42
@yadavay-amzn
Copy link
Copy Markdown
Contributor Author

Thanks for the review @attilapiros. Updated the test to exercise the restart scenario directly: it calls ConnectRepl.newAmmoniteMain(...).run(...) twice with the same predef and bind setup, reproducing the "second REPL start from the same working directory" path that previously triggered the NPE. With the fix, both starts complete cleanly; without it the second run throws NullPointerException during predef initialization.

Does this approach work?

@attilapiros
Copy link
Copy Markdown
Contributor

@yadavay-amzn please doublecheck whether it fails when Storage.Folder is used.

@yadavay-amzn yadavay-amzn force-pushed the fix/spark_SPARK-56448 branch from da81606 to a6485d6 Compare May 11, 2026 06:28
@yadavay-amzn
Copy link
Copy Markdown
Contributor Author

yadavay-amzn commented May 11, 2026

@attilapiros I verified this both in-process (calling run() twice in the same JVM) and cross-process (forking separate JVMs with Storage.Folder sharing a temp cache directory). The NPE does not reproduce in either case — it requires a specific Ammonite classloader state that only occurs in real interactive usage (your original reproduction in the JIRA).

Updated the test to directly assert that newAmmoniteMain wires Storage.InMemory (which is the semantic the fix adds), and smoke-tests the restart scenario by calling run() twice. Also fixed the non-ASCII character.

@yadavay-amzn yadavay-amzn requested a review from attilapiros May 11, 2026 19:16
@attilapiros
Copy link
Copy Markdown
Contributor

@yadavay-amzn You can do something similar to what SparkShellSuite does: run two spark spark-shells after each other.

@yadavay-amzn yadavay-amzn force-pushed the fix/spark_SPARK-56448 branch from a6485d6 to ba77fd6 Compare May 11, 2026 22:26
@yadavay-amzn
Copy link
Copy Markdown
Contributor Author

yadavay-amzn commented May 11, 2026

@attilapiros Thanks for the suggestion!

Added ConnectReplE2ESuite that follows the SparkShellSuite pattern: starts a Connect server via RemoteSparkSession, then runs bin/spark-shell --remote sc://localhost:$port twice as a subprocess (closing stdin immediately so it exits on EOF). Asserts both invocations exit cleanly and the second one does not contain NullPointerException in stderr.
Tested locally, both invocations pass cleanly.

Kept the existing ConnectReplSuite unit test as well (fast, no server needed).

@yadavay-amzn yadavay-amzn force-pushed the fix/spark_SPARK-56448 branch 2 times, most recently from 3139545 to 7e4bad9 Compare May 12, 2026 02:04
@yadavay-amzn yadavay-amzn force-pushed the fix/spark_SPARK-56448 branch from 7e4bad9 to b8d51a3 Compare May 12, 2026 05:53
@yadavay-amzn
Copy link
Copy Markdown
Contributor Author

Done -- renamed to AmmoniteReplE2ESuite, removed the unreachable NPE assertion, and dropped ConnectReplSuite since the E2E test covers it.

Copy link
Copy Markdown
Contributor

@attilapiros attilapiros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are close to finalizing!

… ammonite compile cache

The Spark Connect REPL uses Ammonite. Ammonite's default Storage.Folder
persists compiled predef classes under ~/.ammonite/<version>/cache. On
a subsequent REPL start from the same working directory, the cached
CodePredef class is reloaded but its reference to the per-session
ArgsPredef helper is stale, producing a NullPointerException during
predef initialization.

Use Storage.InMemory for the Connect REPL's compile cache so every
session starts fresh. Extract the Main construction into a package-
private helper to keep the test localised to unit-level.

Regression test added: asserts that the ammonite.Main returned by
ConnectRepl.newAmmoniteMain exposes no on-disk cache directory
(storageBackend.dirOpt.isEmpty) and is an instance of Storage.InMemory.
The dirOpt assertion is an observable behavioural check -- if the
Storage.InMemory wiring is reverted, ammonite.Main falls back to
Storage.Folder(~/.ammonite) and the test fails with a clear message
rather than silently compiling.
@yadavay-amzn yadavay-amzn force-pushed the fix/spark_SPARK-56448 branch from b8d51a3 to 82d6648 Compare May 12, 2026 18:33
@yadavay-amzn
Copy link
Copy Markdown
Contributor Author

Done -- reverted the method extraction. The fix is now minimal: just the Storage.InMemory argument + import + inline comment explaining SPARK-56448.

@attilapiros
Copy link
Copy Markdown
Contributor

@yadavay-amzn I have updated the PR description please doublecheck it

Copy link
Copy Markdown
Contributor

@attilapiros attilapiros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yadavay-amzn
Copy link
Copy Markdown
Contributor Author

@attilapiros Thanks for all the feedback and review!
PR description looks good

@attilapiros
Copy link
Copy Markdown
Contributor

I’ll leave this open for another 2–3 days to ensure everyone has a chance to provide feedback before I merge.

attilapiros pushed a commit that referenced this pull request May 16, 2026
… ammonite compile cache

### What changes were proposed in this pull request?

The Spark Connect REPL uses Ammonite. Ammonite's default `Storage.Folder`
persists compiled predef classes under `~/.ammonite/<version>/cache`. On a
subsequent REPL start from the same working directory, the cached
`CodePredef` class is reloaded but its reference to the per-session
`ArgsPredef` helper is stale, producing a `NullPointerException` during
predef initialization.

This PR switches the Connect REPL's compile cache to `Storage.InMemory`
so every session starts fresh and no stale cache is carried across
restarts.

### Why are the changes needed?

The stale-cache failure is a user-visible crash on every every subsequent call
of `bin/spark-shell --remote sc://...` from the same working
directory. Reproduction steps are on the JIRA.

### Does this PR introduce _any_ user-facing change?

There is one minor observable tradeoff: because the compile cache is
now in-memory rather than persisted, each REPL start recompiles the
predef instead of reading the cached classfiles. This adds ~a few
hundred milliseconds to subsequent REPL startups but eliminates the
NPE. We believe this is the correct tradeoff — a small startup cost
is preferable to a hard failure.

### How was this patch tested?

Added `AmmoniteReplE2ESuite` with a test starting `bin/spark-shell --remote sc://...`
twice  and checking both run was successful.

I verified the negative case locally by temporarily reverting only the `Storage.InMemory()`
line and re-running the test; it fails with:
```
- SPARK-56448: restarting spark-shell --remote does not throw NPE *** FAILED ***
  1 did not equal 0 Second spark-shell failed (exit=1): WARNING: Using incubator modules: jdk.incubator.vector
  Exception in thread "main" java.lang.NullPointerException: Cannot invoke "ammonite.predef.ArgsPredef$Helper.spark()" because the return value of "ammonite.predef.CodePredef.ArgsPredef()" is null
  	at ammonite.predef.CodePredef$Helper.<init>((console):7)
  	at ammonite.predef.CodePredef$.<clinit>((console):6)
  	at ammonite.predef.CodePredef.$main((console))
  	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
  	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  	at java.base/java.lang.reflect.Method.invoke(Method.java:569)
  	at ammonite.runtime.Evaluator$$anon$1.$anonfun$evalMain$1(Evaluator.scala:108)
  	at ammonite.util.Util$.withContextClassloader(Util.scala:21)
  	at ammonite.runtime.Evaluator$$anon$1.evalMain(Evaluator.scala:90)
  	at ammonite.interp.Interpreter.$anonfun$processAllScriptBlocks$10(Interpreter.scala:594)
  	at ammonite.util.Res$Success.map(Res.scala:63)
  	at ammonite.interp.Interpreter.$anonfun$processAllScriptBlocks$9(Interpreter.scala:594)
  	at scala.Option$WithFilter.map(Option.scala:242)
  	at ammonite.interp.Interpreter.loop$1(Interpreter.scala:574)
  	at ammonite.interp.Interpreter.processAllScriptBlocks(Interpreter.scala:644)
  	at ammonite.interp.Interpreter.$anonfun$processModule$6(Interpreter.scala:432)
  	at ammonite.util.Catching.flatMap(Res.scala:110)
  	at ammonite.interp.Interpreter.$anonfun$processModule$5(Interpreter.scala:423)
...
```

Restoring the fix makes the test pass.

### Was this patch authored or co-authored using generative AI tooling?

Yes

Closes #55720 from yadavay-amzn/fix/spark_SPARK-56448.

Authored-by: Anupam Yadav <anupamya@amazon.com>
Signed-off-by: attilapiros <piros.attila.zsolt@gmail.com>
(cherry picked from commit 3e83503)
Signed-off-by: attilapiros <piros.attila.zsolt@gmail.com>
attilapiros pushed a commit that referenced this pull request May 16, 2026
… ammonite compile cache

### What changes were proposed in this pull request?

The Spark Connect REPL uses Ammonite. Ammonite's default `Storage.Folder`
persists compiled predef classes under `~/.ammonite/<version>/cache`. On a
subsequent REPL start from the same working directory, the cached
`CodePredef` class is reloaded but its reference to the per-session
`ArgsPredef` helper is stale, producing a `NullPointerException` during
predef initialization.

This PR switches the Connect REPL's compile cache to `Storage.InMemory`
so every session starts fresh and no stale cache is carried across
restarts.

### Why are the changes needed?

The stale-cache failure is a user-visible crash on every every subsequent call
of `bin/spark-shell --remote sc://...` from the same working
directory. Reproduction steps are on the JIRA.

### Does this PR introduce _any_ user-facing change?

There is one minor observable tradeoff: because the compile cache is
now in-memory rather than persisted, each REPL start recompiles the
predef instead of reading the cached classfiles. This adds ~a few
hundred milliseconds to subsequent REPL startups but eliminates the
NPE. We believe this is the correct tradeoff — a small startup cost
is preferable to a hard failure.

### How was this patch tested?

Added `AmmoniteReplE2ESuite` with a test starting `bin/spark-shell --remote sc://...`
twice  and checking both run was successful.

I verified the negative case locally by temporarily reverting only the `Storage.InMemory()`
line and re-running the test; it fails with:
```
- SPARK-56448: restarting spark-shell --remote does not throw NPE *** FAILED ***
  1 did not equal 0 Second spark-shell failed (exit=1): WARNING: Using incubator modules: jdk.incubator.vector
  Exception in thread "main" java.lang.NullPointerException: Cannot invoke "ammonite.predef.ArgsPredef$Helper.spark()" because the return value of "ammonite.predef.CodePredef.ArgsPredef()" is null
  	at ammonite.predef.CodePredef$Helper.<init>((console):7)
  	at ammonite.predef.CodePredef$.<clinit>((console):6)
  	at ammonite.predef.CodePredef.$main((console))
  	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
  	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  	at java.base/java.lang.reflect.Method.invoke(Method.java:569)
  	at ammonite.runtime.Evaluator$$anon$1.$anonfun$evalMain$1(Evaluator.scala:108)
  	at ammonite.util.Util$.withContextClassloader(Util.scala:21)
  	at ammonite.runtime.Evaluator$$anon$1.evalMain(Evaluator.scala:90)
  	at ammonite.interp.Interpreter.$anonfun$processAllScriptBlocks$10(Interpreter.scala:594)
  	at ammonite.util.Res$Success.map(Res.scala:63)
  	at ammonite.interp.Interpreter.$anonfun$processAllScriptBlocks$9(Interpreter.scala:594)
  	at scala.Option$WithFilter.map(Option.scala:242)
  	at ammonite.interp.Interpreter.loop$1(Interpreter.scala:574)
  	at ammonite.interp.Interpreter.processAllScriptBlocks(Interpreter.scala:644)
  	at ammonite.interp.Interpreter.$anonfun$processModule$6(Interpreter.scala:432)
  	at ammonite.util.Catching.flatMap(Res.scala:110)
  	at ammonite.interp.Interpreter.$anonfun$processModule$5(Interpreter.scala:423)
...
```

Restoring the fix makes the test pass.

### Was this patch authored or co-authored using generative AI tooling?

Yes

Closes #55720 from yadavay-amzn/fix/spark_SPARK-56448.

Authored-by: Anupam Yadav <anupamya@amazon.com>
Signed-off-by: attilapiros <piros.attila.zsolt@gmail.com>
(cherry picked from commit 3e83503)
Signed-off-by: attilapiros <piros.attila.zsolt@gmail.com>
attilapiros pushed a commit that referenced this pull request May 16, 2026
… ammonite compile cache

### What changes were proposed in this pull request?

The Spark Connect REPL uses Ammonite. Ammonite's default `Storage.Folder`
persists compiled predef classes under `~/.ammonite/<version>/cache`. On a
subsequent REPL start from the same working directory, the cached
`CodePredef` class is reloaded but its reference to the per-session
`ArgsPredef` helper is stale, producing a `NullPointerException` during
predef initialization.

This PR switches the Connect REPL's compile cache to `Storage.InMemory`
so every session starts fresh and no stale cache is carried across
restarts.

### Why are the changes needed?

The stale-cache failure is a user-visible crash on every every subsequent call
of `bin/spark-shell --remote sc://...` from the same working
directory. Reproduction steps are on the JIRA.

### Does this PR introduce _any_ user-facing change?

There is one minor observable tradeoff: because the compile cache is
now in-memory rather than persisted, each REPL start recompiles the
predef instead of reading the cached classfiles. This adds ~a few
hundred milliseconds to subsequent REPL startups but eliminates the
NPE. We believe this is the correct tradeoff — a small startup cost
is preferable to a hard failure.

### How was this patch tested?

Added `AmmoniteReplE2ESuite` with a test starting `bin/spark-shell --remote sc://...`
twice  and checking both run was successful.

I verified the negative case locally by temporarily reverting only the `Storage.InMemory()`
line and re-running the test; it fails with:
```
- SPARK-56448: restarting spark-shell --remote does not throw NPE *** FAILED ***
  1 did not equal 0 Second spark-shell failed (exit=1): WARNING: Using incubator modules: jdk.incubator.vector
  Exception in thread "main" java.lang.NullPointerException: Cannot invoke "ammonite.predef.ArgsPredef$Helper.spark()" because the return value of "ammonite.predef.CodePredef.ArgsPredef()" is null
  	at ammonite.predef.CodePredef$Helper.<init>((console):7)
  	at ammonite.predef.CodePredef$.<clinit>((console):6)
  	at ammonite.predef.CodePredef.$main((console))
  	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
  	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  	at java.base/java.lang.reflect.Method.invoke(Method.java:569)
  	at ammonite.runtime.Evaluator$$anon$1.$anonfun$evalMain$1(Evaluator.scala:108)
  	at ammonite.util.Util$.withContextClassloader(Util.scala:21)
  	at ammonite.runtime.Evaluator$$anon$1.evalMain(Evaluator.scala:90)
  	at ammonite.interp.Interpreter.$anonfun$processAllScriptBlocks$10(Interpreter.scala:594)
  	at ammonite.util.Res$Success.map(Res.scala:63)
  	at ammonite.interp.Interpreter.$anonfun$processAllScriptBlocks$9(Interpreter.scala:594)
  	at scala.Option$WithFilter.map(Option.scala:242)
  	at ammonite.interp.Interpreter.loop$1(Interpreter.scala:574)
  	at ammonite.interp.Interpreter.processAllScriptBlocks(Interpreter.scala:644)
  	at ammonite.interp.Interpreter.$anonfun$processModule$6(Interpreter.scala:432)
  	at ammonite.util.Catching.flatMap(Res.scala:110)
  	at ammonite.interp.Interpreter.$anonfun$processModule$5(Interpreter.scala:423)
...
```

Restoring the fix makes the test pass.

### Was this patch authored or co-authored using generative AI tooling?

Yes

Closes #55720 from yadavay-amzn/fix/spark_SPARK-56448.

Authored-by: Anupam Yadav <anupamya@amazon.com>
Signed-off-by: attilapiros <piros.attila.zsolt@gmail.com>
(cherry picked from commit 3e83503)
Signed-off-by: attilapiros <piros.attila.zsolt@gmail.com>
attilapiros pushed a commit that referenced this pull request May 16, 2026
… ammonite compile cache

### What changes were proposed in this pull request?

The Spark Connect REPL uses Ammonite. Ammonite's default `Storage.Folder`
persists compiled predef classes under `~/.ammonite/<version>/cache`. On a
subsequent REPL start from the same working directory, the cached
`CodePredef` class is reloaded but its reference to the per-session
`ArgsPredef` helper is stale, producing a `NullPointerException` during
predef initialization.

This PR switches the Connect REPL's compile cache to `Storage.InMemory`
so every session starts fresh and no stale cache is carried across
restarts.

### Why are the changes needed?

The stale-cache failure is a user-visible crash on every every subsequent call
of `bin/spark-shell --remote sc://...` from the same working
directory. Reproduction steps are on the JIRA.

### Does this PR introduce _any_ user-facing change?

There is one minor observable tradeoff: because the compile cache is
now in-memory rather than persisted, each REPL start recompiles the
predef instead of reading the cached classfiles. This adds ~a few
hundred milliseconds to subsequent REPL startups but eliminates the
NPE. We believe this is the correct tradeoff — a small startup cost
is preferable to a hard failure.

### How was this patch tested?

Added `AmmoniteReplE2ESuite` with a test starting `bin/spark-shell --remote sc://...`
twice  and checking both run was successful.

I verified the negative case locally by temporarily reverting only the `Storage.InMemory()`
line and re-running the test; it fails with:
```
- SPARK-56448: restarting spark-shell --remote does not throw NPE *** FAILED ***
  1 did not equal 0 Second spark-shell failed (exit=1): WARNING: Using incubator modules: jdk.incubator.vector
  Exception in thread "main" java.lang.NullPointerException: Cannot invoke "ammonite.predef.ArgsPredef$Helper.spark()" because the return value of "ammonite.predef.CodePredef.ArgsPredef()" is null
  	at ammonite.predef.CodePredef$Helper.<init>((console):7)
  	at ammonite.predef.CodePredef$.<clinit>((console):6)
  	at ammonite.predef.CodePredef.$main((console))
  	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
  	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  	at java.base/java.lang.reflect.Method.invoke(Method.java:569)
  	at ammonite.runtime.Evaluator$$anon$1.$anonfun$evalMain$1(Evaluator.scala:108)
  	at ammonite.util.Util$.withContextClassloader(Util.scala:21)
  	at ammonite.runtime.Evaluator$$anon$1.evalMain(Evaluator.scala:90)
  	at ammonite.interp.Interpreter.$anonfun$processAllScriptBlocks$10(Interpreter.scala:594)
  	at ammonite.util.Res$Success.map(Res.scala:63)
  	at ammonite.interp.Interpreter.$anonfun$processAllScriptBlocks$9(Interpreter.scala:594)
  	at scala.Option$WithFilter.map(Option.scala:242)
  	at ammonite.interp.Interpreter.loop$1(Interpreter.scala:574)
  	at ammonite.interp.Interpreter.processAllScriptBlocks(Interpreter.scala:644)
  	at ammonite.interp.Interpreter.$anonfun$processModule$6(Interpreter.scala:432)
  	at ammonite.util.Catching.flatMap(Res.scala:110)
  	at ammonite.interp.Interpreter.$anonfun$processModule$5(Interpreter.scala:423)
...
```

Restoring the fix makes the test pass.

### Was this patch authored or co-authored using generative AI tooling?

Yes

Closes #55720 from yadavay-amzn/fix/spark_SPARK-56448.

Authored-by: Anupam Yadav <anupamya@amazon.com>
Signed-off-by: attilapiros <piros.attila.zsolt@gmail.com>
(cherry picked from commit 3e83503)
Signed-off-by: attilapiros <piros.attila.zsolt@gmail.com>
attilapiros pushed a commit that referenced this pull request May 16, 2026
… ammonite compile cache

The Spark Connect REPL uses Ammonite. Ammonite's default `Storage.Folder`
persists compiled predef classes under `~/.ammonite/<version>/cache`. On a
subsequent REPL start from the same working directory, the cached
`CodePredef` class is reloaded but its reference to the per-session
`ArgsPredef` helper is stale, producing a `NullPointerException` during
predef initialization.

This PR switches the Connect REPL's compile cache to `Storage.InMemory`
so every session starts fresh and no stale cache is carried across
restarts.

The stale-cache failure is a user-visible crash on every every subsequent call
of `bin/spark-shell --remote sc://...` from the same working
directory. Reproduction steps are on the JIRA.

There is one minor observable tradeoff: because the compile cache is
now in-memory rather than persisted, each REPL start recompiles the
predef instead of reading the cached classfiles. This adds ~a few
hundred milliseconds to subsequent REPL startups but eliminates the
NPE. We believe this is the correct tradeoff — a small startup cost
is preferable to a hard failure.

Added `AmmoniteReplE2ESuite` with a test starting `bin/spark-shell --remote sc://...`
twice  and checking both run was successful.

I verified the negative case locally by temporarily reverting only the `Storage.InMemory()`
line and re-running the test; it fails with:
```
- SPARK-56448: restarting spark-shell --remote does not throw NPE *** FAILED ***
  1 did not equal 0 Second spark-shell failed (exit=1): WARNING: Using incubator modules: jdk.incubator.vector
  Exception in thread "main" java.lang.NullPointerException: Cannot invoke "ammonite.predef.ArgsPredef$Helper.spark()" because the return value of "ammonite.predef.CodePredef.ArgsPredef()" is null
  	at ammonite.predef.CodePredef$Helper.<init>((console):7)
  	at ammonite.predef.CodePredef$.<clinit>((console):6)
  	at ammonite.predef.CodePredef.$main((console))
  	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
  	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  	at java.base/java.lang.reflect.Method.invoke(Method.java:569)
  	at ammonite.runtime.Evaluator$$anon$1.$anonfun$evalMain$1(Evaluator.scala:108)
  	at ammonite.util.Util$.withContextClassloader(Util.scala:21)
  	at ammonite.runtime.Evaluator$$anon$1.evalMain(Evaluator.scala:90)
  	at ammonite.interp.Interpreter.$anonfun$processAllScriptBlocks$10(Interpreter.scala:594)
  	at ammonite.util.Res$Success.map(Res.scala:63)
  	at ammonite.interp.Interpreter.$anonfun$processAllScriptBlocks$9(Interpreter.scala:594)
  	at scala.Option$WithFilter.map(Option.scala:242)
  	at ammonite.interp.Interpreter.loop$1(Interpreter.scala:574)
  	at ammonite.interp.Interpreter.processAllScriptBlocks(Interpreter.scala:644)
  	at ammonite.interp.Interpreter.$anonfun$processModule$6(Interpreter.scala:432)
  	at ammonite.util.Catching.flatMap(Res.scala:110)
  	at ammonite.interp.Interpreter.$anonfun$processModule$5(Interpreter.scala:423)
...
```

Restoring the fix makes the test pass.

Yes

Closes #55720 from yadavay-amzn/fix/spark_SPARK-56448.

Authored-by: Anupam Yadav <anupamya@amazon.com>
Signed-off-by: attilapiros <piros.attila.zsolt@gmail.com>
(cherry picked from commit 3e83503)
Signed-off-by: attilapiros <piros.attila.zsolt@gmail.com>
@attilapiros
Copy link
Copy Markdown
Contributor

Merged into master and branch-4.x, branch-4.2, branch-4.1, branch-4.0, branch-3.5. Thanks @yadavay-amzn!

(process.exitValue(), stdout.mkString("\n"), stderr.mkString("\n"))
}

test("SPARK-56448: restarting spark-shell --remote does not throw NPE") {
Copy link
Copy Markdown
Member

@dongjoon-hyun dongjoon-hyun May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test case seems to introduce a flakiness. I didn't dig much, but have you observed the following failures in your CIs, @yadavay-amzn and @attilapiros ?

    [info] AmmoniteReplE2ESuite:
    [info] - SPARK-56448: restarting spark-shell --remote does not throw NPE *** FAILED *** (1 second, 975 milliseconds)
    [info]   1 did not equal 0 First spark-shell failed (exit=1): WARNING: Using incubator modules: jdk.incubator.vector
    [info]   Exception in thread "main" java.lang.ExceptionInInitializerError
    [info]   	at org.apache.arrow.memory.netty.DefaultAllocationManagerFactory.<clinit>(DefaultAllocationManagerFactory.java:26)
...
    [info]   Caused by: java.lang.UnsupportedOperationException
    [info]   	at org.sparkproject.connect.client.io.netty.buffer.EmptyByteBuf.memoryAddress(EmptyByteBuf.java:961)
    [info]   	at org.sparkproject.connect.client.io.netty.buffer.UnsafeDirectLittleEndian.<init>(UnsafeDirectLittleEndian.java:45)
    [info]   	at org.sparkproject.connect.client.io.netty.buffer.PooledByteBufAllocatorL.<init>(PooledByteBufAllocatorL.java:47)
    [info]   	... 32 more (AmmoniteReplE2ESuite.scala:66)

cc @peter-toth

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun Thanks for taking a look. We haven't seen this specific Arrow/Netty failure in our CI runs.
The CI failures we saw were always the OracleIntegrationSuite Docker test.

The test launches spark-shell --remote as a subprocess, so it's sensitive to the JDK environment the CI runner uses. Would it help to add a retry or mark this test as flaky?

I'll also investigate the Arrow/Netty root cause by running the test locally to see if the CI flaky behavior is reproducible locally.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran the AmmoniteReplE2ESuite test 10 times locally on current master (JDK 17) and its passing consistently. So the flakiness seems specific to the CI run.

The error (ExceptionInInitializerError in Arrow's DefaultAllocationManagerFactory / Netty's PooledByteBufAllocatorL) looks like a JDK/Arrow memory allocator incompatibility which may be JDK version sensitive. The test itself just launches spark-shell --remote as a subprocess and it doesn't do anything Arrow/Netty-specific.

@dongjoon-hyun / @peter-toth Could you share which JDK vendor/version and CI environment produced the failure? Or a link to the CI job if you have it? I can deep dive to figure out the issue and try repro with same setup.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for confirming, @yadavay-amzn . For my cases, the failures are consistently observed with Java 25 + UBI 10 combination.

[root@1d494a9a9969 /]# java --version
openjdk 25.0.3 2026-04-21 LTS
...

[root@1d494a9a9969 /]# cat /etc/os-release
NAME="Red Hat Enterprise Linux"
VERSION="10.1 (Coughlan)"

Copy link
Copy Markdown
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To @yadavay-amzn and @attilapiros , Apache Spark 4.2.0 RC1 will start tomorrow according to @huaxingao 's announcement.

  • Given that, could you provide us some fallback option or environment variable to disable this change for all release branches, @yadavay-amzn and @attilapiros ?
  • Personally, I believe it's a little too hasty to backport this to branch-4.1, branch-4.0, and branch-3.5. It would be better if we can revert this from those release branches. We can backport this after verifying globally during Apache Spark 4.2.0 release cycle. After releasing 4.2.0, we can backport again more safely.

@yadavay-amzn
Copy link
Copy Markdown
Contributor Author

@dongjoon-hyun Thanks for confirming the JDK 25 + UBI 10 environment.

The production fix (Storage.InMemory) is JDK-independent as it's pure ammonite/Scala code with no Arrow/Netty dependency. The test failure is because spark-shell --remote initializes Arrow, which hits the Netty memoryAddress() incompatibility on JDK 25. Based on my understanding, this would affect any test that launches spark-shell on JDK 25, not just ours.

I've submitted PR #55999 to skip the test on JDK 25+ with an assume() guard. Please let me know if that is sufficient for now or if you'd like to disable the test some other way.
Also filed SPARK-56955 to track the underlying Arrow/Netty JDK 25 issue separately.

@attilapiros Could you please keep me honest here on the production change (Storage.InMemory in ConnectRepl.scala) has no JDK-version sensitivity? It should be safe on all JDKs since it's just ammonite cache configuration, right?

@attilapiros
Copy link
Copy Markdown
Contributor

@dongjoon-hyun I agree @yadavay-amzn this must be independent from the ammonite fix and it means on JDK 25 bin/spark-shell --remote is not working as this could be easily the first test where we call spark-shell with the remote flag.

But let me doublecheck this by trying this out locally on JDK 25

@yadavay-amzn
Copy link
Copy Markdown
Contributor Author

@attilapiros Thanks for confirming the analysis! I also ran this locally on JDK
25 (Corretto 25.0.3) and can confirm:

  • AmmoniteReplE2ESuite fails consistently on JDK 25 with the same
    EmptyByteBuf.memoryAddress() error
  • Passes 10/10 on JDK 17 and passes on JDK 21

So yes, spark-shell --remote itself cannot start on JDK 25 due to the Arrow/Netty
issue, independent of our ammonite fix.

I've updated SPARK-56955 with the full reproduction details and stack trace. PR
#55999 has the test skip workaround.
cc @dongjoon-hyun

@attilapiros
Copy link
Copy Markdown
Contributor

I would not switch off the test! It is very good that we found this error.

@attilapiros
Copy link
Copy Markdown
Contributor

@yadavay-amzn for me spark-shell --remote works (with the Ammonite fix):

$ ./bin/spark-shell sc://localhost:15002
WARNING: Using incubator modules: jdk.incubator.vector
WARNING: package sun.security.action not in java.base
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 4.2.0-SNAPSHOT
      /_/

Using Scala version 2.13.18 (OpenJDK 64-Bit Server VM, Java 25.0.3)
Type in expressions to have them evaluated.
Type :help for more information.
26/05/19 15:15:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
WARNING: A terminally deprecated method in sun.misc.Unsafe has been called
WARNING: sun.misc.Unsafe::allocateMemory has been called by io.netty.util.internal.PlatformDependent0$2 (file:/Users/apiros/git/attilapiros/spark3_II/assembly/target/scala-2.13/jars/netty-common-4.2.13.Final.jar)
WARNING: Please consider reporting this to the maintainers of class io.netty.util.internal.PlatformDependent0$2
WARNING: sun.misc.Unsafe::allocateMemory will be removed in a future release
26/05/19 15:15:24 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
26/05/19 15:15:24 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
Spark context Web UI available at http://10.96.96.252:4042
Spark context available as 'sc' (master = local[*], app id = local-1779228924892).
Spark session available as 'spark'.

scala>

But the test not. This changes everything!

@yadavay-amzn
Copy link
Copy Markdown
Contributor Author

@attilapiros Oh that's interesting, I reproduced the failure on Corretto 25.0.3+9-LTS (Amazon Linux). May I ask which JDK 25 vendor/build are you using?

Also wondering, the test runs spark-shell --remote as a subprocess via ProcessBuilder. Could there be an environment difference between running it interactively vs as a subprocess (e.g., different classpath assembly, missing JVM flags etc)?

I would not switch off the test! It is very good that we found this error.

I was disabling the test since it sounded like it was blocking the release, can close PR #55999 and focus on identifying the real root cause instead.

@attilapiros
Copy link
Copy Markdown
Contributor

This way as this is an error in the test switching off the test is good workaround for now.

@huaxingao
Copy link
Copy Markdown
Contributor

Thanks @dongjoon-hyun for flagging me on this PR.
Do we need the fallback switch in branch-4.2 before i cut RC1 tomorrow?

@dongjoon-hyun
Copy link
Copy Markdown
Member

Oh, this is not a blocker for anything. It would be great but we can investigate more on this and decide on RC2, @huaxingao ~

@attilapiros
Copy link
Copy Markdown
Contributor

attilapiros commented May 20, 2026

@dongjoon-hyun, @huaxingao, @yadavay-amzn

Sorry for the confusion!
I was wrong before: this is NOT a simple test issue but a bug in the spark-shell --remote on JDK25:

./bin/spark-shell --remote sc://localhost:15002
WARNING: Using incubator modules: jdk.incubator.vector
WARNING: package sun.security.action not in java.base
26/05/19 17:38:31 INFO BaseAllocator: Debug mode disabled. Enable with the VM option -Darrow.memory.debug.allocator=true.
26/05/19 17:38:31 INFO DefaultAllocationManagerOption: allocation manager type not specified, using netty as the default type
26/05/19 17:38:31 INFO CheckAllocator: Using DefaultAllocationManager at memory/netty/DefaultAllocationManagerFactory.class
Exception in thread "main" java.lang.ExceptionInInitializerError
	at org.sparkproject.org.apache.arrow.memory.netty.DefaultAllocationManagerFactory.<clinit>(DefaultAllocationManagerFactory.java:26)
	at java.base/java.lang.Class.forName0(Native Method)
	at java.base/java.lang.Class.forName(Class.java:467)
	at java.base/java.lang.Class.forName(Class.java:458)
	at org.sparkproject.org.apache.arrow.memory.DefaultAllocationManagerOption.getFactory(DefaultAllocationManagerOption.java:105)
	at org.sparkproject.org.apache.arrow.memory.DefaultAllocationManagerOption.getDefaultAllocationManagerFactory(DefaultAllocationManagerOption.java:92)
	at org.sparkproject.org.apache.arrow.memory.BaseAllocator$Config.getAllocationManagerFactory(BaseAllocator.java:826)
	at org.sparkproject.org.apache.arrow.memory.ImmutableConfig.access$001(ImmutableConfig.java:20)
	at org.sparkproject.org.apache.arrow.memory.ImmutableConfig$InitShim.getAllocationManagerFactory(ImmutableConfig.java:80)
	at org.sparkproject.org.apache.arrow.memory.ImmutableConfig.<init>(ImmutableConfig.java:43)
	at org.sparkproject.org.apache.arrow.memory.ImmutableConfig$Builder.build(ImmutableConfig.java:492)
	at org.sparkproject.org.apache.arrow.memory.BaseAllocator.<clinit>(BaseAllocator.java:72)
	at org.apache.spark.sql.connect.SparkSession.<init>(SparkSession.scala:89)
	at org.apache.spark.sql.connect.SparkSession$Builder.tryCreateSessionFromClient(SparkSession.scala:1059)
	at org.apache.spark.sql.connect.SparkSession$Builder.$anonfun$getOrCreate$1(SparkSession.scala:1119)
	at org.apache.spark.sql.connect.SparkSession$.withLocalConnectServer(SparkSession.scala:949)
	at org.apache.spark.sql.connect.SparkSession$Builder.getOrCreate(SparkSession.scala:1118)
	at org.apache.spark.sql.application.ConnectRepl$.$anonfun$doMain$1(ConnectRepl.scala:91)
	at org.apache.spark.sql.connect.SparkSession$.withLocalConnectServer(SparkSession.scala:949)
	at org.apache.spark.sql.application.ConnectRepl$.doMain(ConnectRepl.scala:68)
	at org.apache.spark.sql.application.ConnectRepl$.main(ConnectRepl.scala:58)
	at org.apache.spark.sql.application.ConnectRepl.main(ConnectRepl.scala)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
	at java.base/java.lang.reflect.Method.invoke(Method.java:565)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1033)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:203)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:226)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:95)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1171)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1180)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.UnsupportedOperationException
	at org.sparkproject.io.netty.buffer.EmptyByteBuf.memoryAddress(EmptyByteBuf.java:961)
	at org.sparkproject.io.netty.buffer.DuplicatedByteBuf.memoryAddress(DuplicatedByteBuf.java:115)
	at org.sparkproject.io.netty.buffer.UnsafeDirectLittleEndian.<init>(UnsafeDirectLittleEndian.java:45)
	at org.sparkproject.io.netty.buffer.PooledByteBufAllocatorL.<init>(PooledByteBufAllocatorL.java:47)
	at org.sparkproject.org.apache.arrow.memory.netty.NettyAllocationManager.<clinit>(NettyAllocationManager.java:54)
	... 32 more

Still I think it is unrelated to the ammonite fix!

@attilapiros
Copy link
Copy Markdown
Contributor

We are not alone: apache/iceberg#15930

@attilapiros
Copy link
Copy Markdown
Contributor

The --sun-misc-unsafe-memory-access=allow flag solves the problem:

[INFO] --- scalatest:2.2.0:test (test) @ spark-connect-client-jvm_2.13 ---
[INFO] ScalaTest report directory: /Users/apiros/git/attilapiros/spark3_II/sql/connect/client/jvm/target/surefire-reports
WARNING: Using incubator modules: jdk.incubator.vector
WARNING: package sun.security.action not in java.base
Discovery starting.
Discovery completed in 110 milliseconds.
Run starting. Expected test count is: 1
AmmoniteReplE2ESuite:
WARNING: Using incubator modules: jdk.incubator.vector
WARNING: package sun.security.action not in java.base
WARNING: A terminally deprecated method in sun.misc.Unsafe has been called
WARNING: sun.misc.Unsafe::allocateMemory has been called by io.netty.util.internal.PlatformDependent0$2 (file:/Users/apiros/.m2/repository/io/netty/netty-common/4.2.13.Final/netty-common-4.2.13.Final.jar)
WARNING: Please consider reporting this to the maintainers of class io.netty.util.internal.PlatformDependent0$2
WARNING: sun.misc.Unsafe::allocateMemory will be removed in a future release
Ready for client connections.
- SPARK-56448: restarting spark-shell --remote does not throw NPE
Run completed in 19 seconds, 109 milliseconds.
Total number of tests run: 1
Suites: completed 1, aborted 0
Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
No more client connections.
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  33.246 s
[INFO] Finished at: 2026-05-19T18:21:40-07:00
[INFO] ------------------------------------------------------------------------
➜  spark3_II git:(bd8872a0cc7) ✗ java -version
openjdk version "25.0.3" 2026-04-21 LTS
OpenJDK Runtime Environment Temurin-25.0.3+9 (build 25.0.3+9-LTS)

@attilapiros
Copy link
Copy Markdown
Contributor

attilapiros commented May 20, 2026

Opened #56006 with the fix.

@yadavay-amzn
Copy link
Copy Markdown
Contributor Author

@attilapiros Thanks for digging into this and finding the root cause + fix! I updated PR #55999 to add --sun-misc-unsafe-memory-access=allow in JavaModuleOptions.java but see that you've created a PR with the fix so closing mine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants