JAVA-6071 by strogiyotec · Pull Request #1885 · mongodb/mongo-java-driver

strogiyotec · 2026-02-11T19:40:29Z

JAVA-6071
~~JAVA-5907~~ (this ticket is be addressed in #1893)

Problem statement

Original failed test case
The test case only fails on a standalone variant with SSL and ordered = false
After adding a try catch to the submit in AsynchronousTlsChannelGroup executor I was able to catch an error

14:23:28.914 [async-channel-group-0-handler-executor] ERROR org.mongodb.driver.connection.tls - error in submitted task
java.lang.StackOverflowError: null
	at com.mongodb.internal.async.AsyncSupplier.lambda$finish$0(AsyncSupplier.java:73)
	at com.mongodb.internal.async.SingleResultCallback$1.completed(SingleResultCallback.java:51)
	at com.mongodb.connection.AsyncCompletionHandler.lambda$asCallback$0(AsyncCompletionHandler.java:52)
	at com.mongodb.internal.async.AsyncSupplier.lambda$finish$0(AsyncSupplier.java:73)
	at com.mongodb.internal.async.AsyncSupplier.lambda$finish$0(AsyncSupplier.java:73)
	at com.mongodb.internal.async.SingleResultCallback$1.completed(SingleResultCallback.java:51)
	at com.mongodb.connection.AsyncCompletionHandler.lambda$asCallback$0(AsyncCompletionHandler.java:52)
	at com.mongodb.internal.async.AsyncSupplier.lambda$finish$0(AsyncSupplier.java:73)
	at com.mongodb.internal.async.AsyncSupplier.lambda$finish$0(AsyncSupplier.java:73)
	at com.mongodb.internal.async.SingleResultCallback$1.completed(SingleResultCallback.java:51)
	at com.mongodb.connection.AsyncCompletionHandler.lambda$asCallback$0(AsyncCompletionHandler.java:52)
	at com.mongodb.internal.async.AsyncSupplier.lambda$finish$0(AsyncSupplier.java:73)
	at com.mongodb.internal.async.AsyncSupplier.lambda$finish$0(AsyncSupplier.java:73)
	at com.mongodb.internal.async.SingleResultCallback$1.completed(SingleResultCallback.java:51)
	at com.mongodb.connection.AsyncCompletionHandler.lambda$asCallback$0(AsyncCompletionHandler.java:52)
	at com.mongodb.internal.async.AsyncSupplier.lambda$finish$0(AsyncSupplier.java:73)
	at com.mongodb.internal.async.AsyncSupplier.lambda$finish$0(AsyncSupplier.java:73)
	at com.mongodb.internal.async.SingleResultCallback$1.completed(SingleResultCallback.java:51)
	at com.mongodb.connection.AsyncCompletionHandler.lambda$asCallback$0(AsyncCompletionHandler.java:52)
	at com.mongodb.internal.async.AsyncSupplier.lambda$finish$0(AsyncSupplier.java:73)
	at com.mongodb.internal.async.AsyncSupplier.lambda$finish$0(AsyncSupplier.java:73)

The exception happened because the maxBatchSize is used to generate a huge payload, then once client reads the payload from the server it wraps original callback as handler and handler to callback

Callback to handler
Handler to callback

Every time it happens we create a new BasicCompletionHandler

Once whole payload is read and client starts to unwind the callback chain it fails occasionally on overflow error

The fix

The fix is to avoid the creation of chained callbacks and only pass a top this callback to the chain of reads

Tests

To test the fix I deliberately used a small stack size locally Xss=512kb compared to default 1M
This PR also changes the stack size for all evergreen tests so that we can catch any excessive call stacks during testing
You also have to run SSL variant locally to reproduce the error

stIncMale · 2026-02-12T03:09:10Z

As far as I can tell, this PR addresses JAVA-5907, so I

added that ticket to the description of the PR;
linked that ticket with JAVA-6071;
added a link to this PR to that ticket;
updated the "Reviewers" on that ticket to be the same as for JAVA-6071;
assigned @strogiyotec to that ticket.

stIncMale · 2026-02-12T02:57:45Z

...e/src/main/com/mongodb/internal/connection/tlschannel/async/AsynchronousTlsChannelGroup.java

+        executor.submit(() -> {
+            try {
+                r.run();
+            } catch (Throwable e) {
+                // catch Throwable to also catch unexpected StackoverflowErrors and similar,
+                // to avoid silent loss of exceptions in asynchronous code
+                // see JAVA-6071 for an example of such a case
+                LOGGER.error("error in submitted runnable in AsynchronousTlsChannelGroup", e);


I don't think this comment is useful to a reader. What we do here seems obvious without any comments.

Is it not the case that the stack trace of e has information about AsynchronousTlsChannelGroup? If it is the case, then there is no reason to duplicate that info in the log message, and we should just do LOGGER.error(null, e).

stIncMale · 2026-02-12T03:24:50Z

...e/src/main/com/mongodb/internal/connection/tlschannel/async/AsynchronousTlsChannelGroup.java

+        executor.submit(() -> {
+            try {
+                r.run();
+            } catch (Throwable e) {
+                // catch Throwable to also catch unexpected StackoverflowErrors and similar,
+                // to avoid silent loss of exceptions in asynchronous code
+                // see JAVA-6071 for an example of such a case
+                LOGGER.error("error in submitted runnable in AsynchronousTlsChannelGroup", e);


@vbabanin, in JAVA-5907 you said

Adding defensive try/catch with explicit assertion logging ensures these errors appear in logs, avoiding silent failures that make debugging and identification of such issues hard.

Could you please clarify what you meant by "assertion logging"? Note that ErrorHandlingResultCallback simply does

logger.error("Callback onResult call produced an error", e);

while AsyncFunction.finish is more sophisticated, but when it reaches the "there is nothing we can do" state, it throws the unexpected exception instead of logging it. The latter allows an application to register a default Thread.UncaughtExceptionHandler, and react to the exception thrown.

P.S. @katcharov, Do you remember why we decided not to log t in AsyncFunction.finish? I wonder if it's because by default such exceptions are logged to the standard error stream (@strogiyotec, do you know this, or maybe you could check what happens to uncaught exceptions when no explicit Thread.UncaughtExceptionHandler is configured?).

I wonder if it's because by default such exceptions are logged to the standard error stream

I checked this with the following "shebang" script:

#! /usr/bin/env -S java --source 23 import java.util.concurrent.ExecutorService; import java.util.concurrent.Executors; final class Script { public static final void main(final String... args) throws InterruptedException { Thread t = new Thread(() -> { throw new RuntimeException("from bare thread"); }); t.start(); t.join(); try (ExecutorService executorService = Executors.newFixedThreadPool(1)) { executorService.execute(() -> { throw new RuntimeException("from ExecutorService.execute"); }); executorService.submit(() -> { throw new RuntimeException("from ExecutorService.submit"); }); } } }

It produces the following output:

Exception in thread "Thread-0" java.lang.RuntimeException: from bare thread at Script.lambda$main$0(script:9) at java.base/java.lang.Thread.run(Thread.java:1575) Exception in thread "pool-1-thread-1" java.lang.RuntimeException: from ExecutorService.execute at Script.lambda$main$1(script:15) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) at java.base/java.lang.Thread.run(Thread.java:1575)

That is, even when an application does not explicitly register a default Thread.UncaughtExceptionHandler, uncaught exceptions are still printed to the standard error stream (I checked that it's the error and not the output stream separately).

However, it would still be useful for us to log them, because we could suggest in the message that the application should register a default UncaughtExceptionHandler, at the very lest for the purpose of logging the uncaught exceptions via the means more suitable to the particular application.

thanks @stIncMale , my take is that I would like to have the error printed explicitly in catch when we will use execute

Currently, I think that we should log in both the place you are suggesting, and in AsyncFunction/AsyncSupplier.finish. We should come back to discussing this.

stIncMale · 2026-02-12T03:36:48Z

...e/src/main/com/mongodb/internal/connection/tlschannel/async/AsynchronousTlsChannelGroup.java


    void submit(final Runnable r) {
-        executor.submit(r);
+        executor.submit(() -> {


We should replace submit with execute, and rename AsynchronousTlsChannelGroup.submit accordingly (this is our custom method, it is not present in the TLS Channel library.

Executor.execute is void, which means that when the passed Runnable.run throws, the thread that was executing it has to be abruptly terminated, provided that it was a thread managed by the executor (depending on the executor implementation, it may not be the case). ExecutorService.submit returns a Future, which we ignore, thus hiding the exception from anyone.

Us using Executor.execute allows an application to register a default Thread.UncaughtExceptionHandler, and react to the exception thrown.

P.S. The majority of applications should terminate if VirtualMachineError is thrown, because such exceptions may be thrown from unexpected places, leading to violating application invariants. These places may be especially surprising given that these exceptions are asynchronous exceptions.

P.P.S. Let's check the codebase for other inappropriate submit/invokeAll usages where we ignore the returned Futures. We may also look at the usages of ScheduledExecutorService/ScheduledThreadPoolExecutor.

stIncMale · 2026-02-12T07:33:10Z

driver-core/src/main/com/mongodb/internal/connection/AsynchronousChannelStream.java

This needs to be discussed in-person either during one of our catchup meetings, or at a dedicated meeting (the latter may be a 1on1 meeting).

I still don't understand the issue clearly:

[just to make sure I understand this part correctly] Is it true that the graph of callbacks and handlers is created successfully, and only when this c.complete(localByteBuf) is eventually called, do we construct the long stack of method calls?

The above means that the actual stack trance, which is not fully provided in the PR description as it's too large, has the aforementioned complete call in it? Answer: yes.

Why does the issue reproduce only on a standalone deployment?

Answer: Because for some reason the standalone deployment responds to the bulkWrite command with all the documents in the created cursor, but a replicaset responds with only a subset, requiring the driver to execute getMores, which means, the driver does not read enough data via each AsynchronousChannelStream.readAsync to cause the StackOverflowError. We have https://jira.mongodb.org/plugins/servlet/desk/portal/13/HELP-88629, but it looks like it should be completely reformulated with our current understanding.

Does TLS results in us reading the server response in smaller chunks (leading to a larger number of chunks) compared to no TLS?

Answer: we don't know, this wasn't investigated.

Nonetheless

The immediate reaction I have is that we should try to preserve the use of the beginAsync API usage here. One of the reasons it was introduced is that implementing callbacks/handlers without it is too error-prone: we still throw exceptions/assertions in callback-based code, and unexpected exceptions are still being thrown, yet we can't practically always properly handle them.

That being said, if keeping the beginAsync API usage is impossible, the idea of the fix proposed seems right, though I haven't thoroughly reviewed the change to the BasicCompletionHandler.completed method.

Alternative ideas:

a) I have a hunch that passing localHandler instead of c.asHandler() here may solve the issue.

Even if it does, I don't think it's correct, because if we do that, the beginAsync API won't be able to know that we have "completed" localHandler in a situation when we do it, and localHandler throws an exception. This may lead to localHandler being "completed" more than once by the AsyncSupplier.finish method, which absolutely must not be done. This also leads to not executing any code that is "placed between" thenSupply and finish (currently, there is none).

I left this note only to warn us from trying this.

b) Instead of passing localHandler.asCallback() into finish here, pass a callback that at first extracts localHandler via getHandlerAndClear, and then completes it. This way we should be able to pass the current BasicCompletionHandler (this) into getChannel().read(...) here, similarly to how it is proposed in the current PR. But this approach may have the same problem as a).

c) Instead of creating a new BasicCompletionHandler here, we could re-set BasicCompletionHandler.byteBufReference and BaseCompletionHandler.handlerReference on the existing BasicCompletionHandler (this) to localByteBuf and c.asHandler() respectively. This way we will save on creating a new BasicCompletionHandler / a new stack frame per each chunk of data received.

d) Use the beginAsync API at a higher level than BasicCompletionHandler.completed. Maybe in AsynchronousChannelStream.readAsync? Will it give us the equivalent guarantees?

e) Somehow change the implementation of AsyncFunction/AsyncSupplier.finish such that executing all the callbacks in the onion of callbacks we currently build does not require growing the call stack by at least one frame per each callback. I do now know if this is doable, and currently fail to imagine how this could even be achieved.

c) and d) seem to be the most promising to me, with d) being very nice and simple, provided that it preserves the existing guarantees, which is an open question.

We discussed this with @katcharov and @strogiyotec via Zoom, and the idea we ended up with is neither of the aforementioned. It is closest to a combination of c) and e), but is still quite different. What we should do:

Get rid of BasicCompletionHandler and rewrite AsynchronousChannelStream.readAsync via the beginAsync API using AsyncRunnable.thenRunDoWhileLoop or a similar method.

Change AsyncCallbackLoop and RetryingAsyncCallbackSupplier such that the amount of stack memory they consume does not depend on the number of loop iterations / retry attempts.

Currently:

@katcharov has an AI-assisted PoC that passes the below test written by @katcharov in both SameThreadAsyncFunctionsTest and SeparateThreadAsyncFunctionsTest.

I have a no-AI-involved PoC that passes the below test written by @katcharov in both SameThreadAsyncFunctionsTest and SeparateThreadAsyncFunctionsTest.

@vbabanin had ideas when we talked, maybe he'll also have something.

The aforementioned test is

// AsyncFunctionsAbstractTest.java @Test void testThenRunDoWhileLoop() { assertBehavesSameVariations(8, () -> { int i = 0; do { i++; sync(i); } while (i < 3 && plainTest(i)); }, (callback) -> { final int[] i = new int[1]; beginAsync().thenRunDoWhileLoop((c) -> { i[0]++; async(i[0], c); }, () -> i[0] < 3 && plainTest(i[0])).finish(callback); }); }

P.S. ~~I selfishly want the AI-assisted solution to turn out incorrect once we inspect the code, because I don't want to lose this battle to AI—it would be sorrowful.~~ At least I didn't lose to the damned AI.
P.P.S. I inputted that — (em dash) above via option + shift + - in macOS. It wasn't generated by AI.

stIncMale · 2026-02-12T07:36:28Z

@strogiyotec Let's give this PR a descriptive name, which "JAVA-6071" is not.

stIncMale · 2026-02-12T08:34:37Z

driver-core/src/main/com/mongodb/internal/connection/AsynchronousChannelStream.java

We should check other implementations of com.mongodb.connection.AsyncCompletionHandler, java.nio.channels.CompletionHandler we have to see if their completed/failed methods may throw exceptions (they must try their best not to, and the beginAsync API helps with that).

Examples:

AsynchronousChannelStream.AsyncWritableByteChannelAdapter.WriteCompletionHandler seems fine.

com.mongodb.internal.connection.FutureAsyncCompletionHandler is dead code. Let's remove it.

etc.

stIncMale · 2026-02-12T09:10:00Z

driver-core/src/main/com/mongodb/internal/connection/AsynchronousChannelStream.java

@katcharov, it seems to me that AsyncSupplier.finish should be implemented simply as finish(null, callback). Not only this seems to be trivially equivalent to the current code, but also all the AsyncFunctionsAbstractTest tests pass when I do such a change.

- mongodb#1885 - mongodb#1890 - https://github.com/katcharov/mongo-java-driver/commits/async-loop/

[JAVA-6071](https://jira.mongodb.org/browse/JAVA-6071)

fbda5fa

strogiyotec requested a review from a team as a code owner February 11, 2026 19:40

strogiyotec requested review from rozza and vbabanin February 11, 2026 19:40

stIncMale requested changes Feb 12, 2026

View reviewed changes

stIncMale reviewed Feb 12, 2026

View reviewed changes

rename submit to execute

2ee2bda

stIncMale mentioned this pull request Feb 14, 2026

[PoC] Manual tail-call optimization for AsyncCallbackLoop #1889

Draft

stIncMale added a commit to stIncMale/mongo-java-driver that referenced this pull request Feb 18, 2026

combine tests from

e55822a

- mongodb#1885 - mongodb#1890 - https://github.com/katcharov/mongo-java-driver/commits/async-loop/

stIncMale mentioned this pull request Feb 19, 2026

JAVA-5907 #1893

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JAVA-6071#1885

JAVA-6071#1885
strogiyotec wants to merge 2 commits intomongodb:mainfrom
strogiyotec:JAVA-6071

strogiyotec commented Feb 11, 2026 •

edited by stIncMale

Loading

Uh oh!

stIncMale commented Feb 12, 2026 •

edited

Loading

Uh oh!

stIncMale Feb 12, 2026

Uh oh!

stIncMale Feb 12, 2026 •

edited

Loading

Uh oh!

stIncMale Feb 13, 2026

Uh oh!

strogiyotec Feb 14, 2026

Uh oh!

stIncMale Feb 14, 2026 •

edited

Loading

Uh oh!

stIncMale Feb 12, 2026

Uh oh!

stIncMale Feb 12, 2026 •

edited

Loading

Uh oh!

stIncMale Feb 14, 2026 •

edited

Loading

Uh oh!

stIncMale commented Feb 12, 2026

Uh oh!

stIncMale Feb 12, 2026

Uh oh!

stIncMale Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

strogiyotec commented Feb 11, 2026 • edited by stIncMale Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem statement

The fix

Tests

Uh oh!

stIncMale commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stIncMale Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

stIncMale Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stIncMale Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

strogiyotec Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

stIncMale Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stIncMale Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

stIncMale Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stIncMale Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stIncMale commented Feb 12, 2026

Uh oh!

stIncMale Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

stIncMale Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

strogiyotec commented Feb 11, 2026 •

edited by stIncMale

Loading

stIncMale commented Feb 12, 2026 •

edited

Loading

stIncMale Feb 12, 2026 •

edited

Loading

stIncMale Feb 14, 2026 •

edited

Loading

stIncMale Feb 12, 2026 •

edited

Loading

stIncMale Feb 14, 2026 •

edited

Loading