Skip to content

JAVA-6071#1885

Open
strogiyotec wants to merge 2 commits intomongodb:mainfrom
strogiyotec:JAVA-6071
Open

JAVA-6071#1885
strogiyotec wants to merge 2 commits intomongodb:mainfrom
strogiyotec:JAVA-6071

Conversation

@strogiyotec
Copy link
Contributor

@strogiyotec strogiyotec commented Feb 11, 2026

JAVA-6071
JAVA-5907 (this ticket is be addressed in #1893)

Problem statement

Original failed test case
The test case only fails on a standalone variant with SSL and ordered = false
After adding a try catch to the submit in AsynchronousTlsChannelGroup executor I was able to catch an error

14:23:28.914 [async-channel-group-0-handler-executor] ERROR org.mongodb.driver.connection.tls - error in submitted task
java.lang.StackOverflowError: null
	at com.mongodb.internal.async.AsyncSupplier.lambda$finish$0(AsyncSupplier.java:73)
	at com.mongodb.internal.async.SingleResultCallback$1.completed(SingleResultCallback.java:51)
	at com.mongodb.connection.AsyncCompletionHandler.lambda$asCallback$0(AsyncCompletionHandler.java:52)
	at com.mongodb.internal.async.AsyncSupplier.lambda$finish$0(AsyncSupplier.java:73)
	at com.mongodb.internal.async.AsyncSupplier.lambda$finish$0(AsyncSupplier.java:73)
	at com.mongodb.internal.async.SingleResultCallback$1.completed(SingleResultCallback.java:51)
	at com.mongodb.connection.AsyncCompletionHandler.lambda$asCallback$0(AsyncCompletionHandler.java:52)
	at com.mongodb.internal.async.AsyncSupplier.lambda$finish$0(AsyncSupplier.java:73)
	at com.mongodb.internal.async.AsyncSupplier.lambda$finish$0(AsyncSupplier.java:73)
	at com.mongodb.internal.async.SingleResultCallback$1.completed(SingleResultCallback.java:51)
	at com.mongodb.connection.AsyncCompletionHandler.lambda$asCallback$0(AsyncCompletionHandler.java:52)
	at com.mongodb.internal.async.AsyncSupplier.lambda$finish$0(AsyncSupplier.java:73)
	at com.mongodb.internal.async.AsyncSupplier.lambda$finish$0(AsyncSupplier.java:73)
	at com.mongodb.internal.async.SingleResultCallback$1.completed(SingleResultCallback.java:51)
	at com.mongodb.connection.AsyncCompletionHandler.lambda$asCallback$0(AsyncCompletionHandler.java:52)
	at com.mongodb.internal.async.AsyncSupplier.lambda$finish$0(AsyncSupplier.java:73)
	at com.mongodb.internal.async.AsyncSupplier.lambda$finish$0(AsyncSupplier.java:73)
	at com.mongodb.internal.async.SingleResultCallback$1.completed(SingleResultCallback.java:51)
	at com.mongodb.connection.AsyncCompletionHandler.lambda$asCallback$0(AsyncCompletionHandler.java:52)
	at com.mongodb.internal.async.AsyncSupplier.lambda$finish$0(AsyncSupplier.java:73)
	at com.mongodb.internal.async.AsyncSupplier.lambda$finish$0(AsyncSupplier.java:73)

The exception happened because the maxBatchSize is used to generate a huge payload, then once client reads the payload from the server it wraps original callback as handler and handler to callback

  1. Callback to handler
  2. Handler to callback

Every time it happens we create a new BasicCompletionHandler

Once whole payload is read and client starts to unwind the callback chain it fails occasionally on overflow error

The fix

The fix is to avoid the creation of chained callbacks and only pass a top this callback to the chain of reads

Tests

To test the fix I deliberately used a small stack size locally Xss=512kb compared to default 1M
This PR also changes the stack size for all evergreen tests so that we can catch any excessive call stacks during testing
You also have to run SSL variant locally to reproduce the error

@strogiyotec strogiyotec requested a review from a team as a code owner February 11, 2026 19:40
@stIncMale
Copy link
Member

stIncMale commented Feb 12, 2026

As far as I can tell, this PR addresses JAVA-5907, so I

  • added that ticket to the description of the PR;
  • linked that ticket with JAVA-6071;
  • added a link to this PR to that ticket;
  • updated the "Reviewers" on that ticket to be the same as for JAVA-6071;
  • assigned @strogiyotec to that ticket.

Comment on lines 228 to 235
executor.submit(() -> {
try {
r.run();
} catch (Throwable e) {
// catch Throwable to also catch unexpected StackoverflowErrors and similar,
// to avoid silent loss of exceptions in asynchronous code
// see JAVA-6071 for an example of such a case
LOGGER.error("error in submitted runnable in AsynchronousTlsChannelGroup", e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I don't think this comment is useful to a reader. What we do here seems obvious without any comments.
  2. Is it not the case that the stack trace of e has information about AsynchronousTlsChannelGroup? If it is the case, then there is no reason to duplicate that info in the log message, and we should just do LOGGER.error(null, e).

Comment on lines 228 to 235
executor.submit(() -> {
try {
r.run();
} catch (Throwable e) {
// catch Throwable to also catch unexpected StackoverflowErrors and similar,
// to avoid silent loss of exceptions in asynchronous code
// see JAVA-6071 for an example of such a case
LOGGER.error("error in submitted runnable in AsynchronousTlsChannelGroup", e);
Copy link
Member

@stIncMale stIncMale Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vbabanin, in JAVA-5907 you said

Adding defensive try/catch with explicit assertion logging ensures these errors appear in logs, avoiding silent failures that make debugging and identification of such issues hard.

Could you please clarify what you meant by "assertion logging"? Note that ErrorHandlingResultCallback simply does

logger.error("Callback onResult call produced an error", e);

while AsyncFunction.finish is more sophisticated, but when it reaches the "there is nothing we can do" state, it throws the unexpected exception instead of logging it. The latter allows an application to register a default Thread.UncaughtExceptionHandler, and react to the exception thrown.

P.S. @katcharov, Do you remember why we decided not to log t in AsyncFunction.finish? I wonder if it's because by default such exceptions are logged to the standard error stream (@strogiyotec, do you know this, or maybe you could check what happens to uncaught exceptions when no explicit Thread.UncaughtExceptionHandler is configured?).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it's because by default such exceptions are logged to the standard error stream

I checked this with the following "shebang" script:

#! /usr/bin/env -S java --source 23

import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;

final class Script {
    public static final void main(final String... args) throws InterruptedException {
        Thread t = new Thread(() -> {
            throw new RuntimeException("from bare thread");
        });
        t.start();
        t.join();
        try (ExecutorService executorService = Executors.newFixedThreadPool(1)) {
            executorService.execute(() -> {
                throw new RuntimeException("from ExecutorService.execute");
            });
            executorService.submit(() -> {
                throw new RuntimeException("from ExecutorService.submit");
            });
        }
    }
}

It produces the following output:

Exception in thread "Thread-0" java.lang.RuntimeException: from bare thread
	at Script.lambda$main$0(script:9)
	at java.base/java.lang.Thread.run(Thread.java:1575)
Exception in thread "pool-1-thread-1" java.lang.RuntimeException: from ExecutorService.execute
	at Script.lambda$main$1(script:15)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1575)

That is, even when an application does not explicitly register a default Thread.UncaughtExceptionHandler, uncaught exceptions are still printed to the standard error stream (I checked that it's the error and not the output stream separately).

However, it would still be useful for us to log them, because we could suggest in the message that the application should register a default UncaughtExceptionHandler, at the very lest for the purpose of logging the uncaught exceptions via the means more suitable to the particular application.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @stIncMale , my take is that I would like to have the error printed explicitly in catch when we will use execute

Copy link
Member

@stIncMale stIncMale Feb 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, I think that we should log in both the place you are suggesting, and in AsyncFunction/AsyncSupplier.finish. We should come back to discussing this.


void submit(final Runnable r) {
executor.submit(r);
executor.submit(() -> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should replace submit with execute, and rename AsynchronousTlsChannelGroup.submit accordingly (this is our custom method, it is not present in the TLS Channel library.

Executor.execute is void, which means that when the passed Runnable.run throws, the thread that was executing it has to be abruptly terminated, provided that it was a thread managed by the executor (depending on the executor implementation, it may not be the case). ExecutorService.submit returns a Future, which we ignore, thus hiding the exception from anyone.

Us using Executor.execute allows an application to register a default Thread.UncaughtExceptionHandler, and react to the exception thrown.

P.S. The majority of applications should terminate if VirtualMachineError is thrown, because such exceptions may be thrown from unexpected places, leading to violating application invariants. These places may be especially surprising given that these exceptions are asynchronous exceptions.

P.P.S. Let's check the codebase for other inappropriate submit/invokeAll usages where we ignore the returned Futures. We may also look at the usages of ScheduledExecutorService/ScheduledThreadPoolExecutor.

Copy link
Member

@stIncMale stIncMale Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be discussed in-person either during one of our catchup meetings, or at a dedicated meeting (the latter may be a 1on1 meeting).

  • I still don't understand the issue clearly:
    • [just to make sure I understand this part correctly] Is it true that the graph of callbacks and handlers is created successfully, and only when this c.complete(localByteBuf) is eventually called, do we construct the long stack of method calls?
      • The above means that the actual stack trance, which is not fully provided in the PR description as it's too large, has the aforementioned complete call in it? Answer: yes.
    • Why does the issue reproduce only on a standalone deployment?
      • Answer: Because for some reason the standalone deployment responds to the bulkWrite command with all the documents in the created cursor, but a replicaset responds with only a subset, requiring the driver to execute getMores, which means, the driver does not read enough data via each AsynchronousChannelStream.readAsync to cause the StackOverflowError. We have https://jira.mongodb.org/plugins/servlet/desk/portal/13/HELP-88629, but it looks like it should be completely reformulated with our current understanding.
    • Does TLS results in us reading the server response in smaller chunks (leading to a larger number of chunks) compared to no TLS?
      • Answer: we don't know, this wasn't investigated.

Nonetheless

  • The immediate reaction I have is that we should try to preserve the use of the beginAsync API usage here. One of the reasons it was introduced is that implementing callbacks/handlers without it is too error-prone: we still throw exceptions/assertions in callback-based code, and unexpected exceptions are still being thrown, yet we can't practically always properly handle them.
    • That being said, if keeping the beginAsync API usage is impossible, the idea of the fix proposed seems right, though I haven't thoroughly reviewed the change to the BasicCompletionHandler.completed method.

Alternative ideas:

  • a) I have a hunch that passing localHandler instead of c.asHandler() here may solve the issue.
    • Even if it does, I don't think it's correct, because if we do that, the beginAsync API won't be able to know that we have "completed" localHandler in a situation when we do it, and localHandler throws an exception. This may lead to localHandler being "completed" more than once by the AsyncSupplier.finish method, which absolutely must not be done. This also leads to not executing any code that is "placed between" thenSupply and finish (currently, there is none).
      • I left this note only to warn us from trying this.
  • b) Instead of passing localHandler.asCallback() into finish here, pass a callback that at first extracts localHandler via getHandlerAndClear, and then completes it. This way we should be able to pass the current BasicCompletionHandler (this) into getChannel().read(...) here, similarly to how it is proposed in the current PR. But this approach may have the same problem as a).
  • c) Instead of creating a new BasicCompletionHandler here, we could re-set BasicCompletionHandler.byteBufReference and BaseCompletionHandler.handlerReference on the existing BasicCompletionHandler (this) to localByteBuf and c.asHandler() respectively. This way we will save on creating a new BasicCompletionHandler / a new stack frame per each chunk of data received.
  • d) Use the beginAsync API at a higher level than BasicCompletionHandler.completed. Maybe in AsynchronousChannelStream.readAsync? Will it give us the equivalent guarantees?
  • e) Somehow change the implementation of AsyncFunction/AsyncSupplier.finish such that executing all the callbacks in the onion of callbacks we currently build does not require growing the call stack by at least one frame per each callback. I do now know if this is doable, and currently fail to imagine how this could even be achieved.

c) and d) seem to be the most promising to me, with d) being very nice and simple, provided that it preserves the existing guarantees, which is an open question.

Copy link
Member

@stIncMale stIncMale Feb 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this with @katcharov and @strogiyotec via Zoom, and the idea we ended up with is neither of the aforementioned. It is closest to a combination of c) and e), but is still quite different. What we should do:

  1. Get rid of BasicCompletionHandler and rewrite AsynchronousChannelStream.readAsync via the beginAsync API using AsyncRunnable.thenRunDoWhileLoop or a similar method.
  2. Change AsyncCallbackLoop and RetryingAsyncCallbackSupplier such that the amount of stack memory they consume does not depend on the number of loop iterations / retry attempts.

Currently:

  • @katcharov has an AI-assisted PoC that passes the below test written by @katcharov in both SameThreadAsyncFunctionsTest and SeparateThreadAsyncFunctionsTest.
  • I have a no-AI-involved PoC that passes the below test written by @katcharov in both SameThreadAsyncFunctionsTest and SeparateThreadAsyncFunctionsTest.
  • @vbabanin had ideas when we talked, maybe he'll also have something.

The aforementioned test is

// AsyncFunctionsAbstractTest.java
@Test
void testThenRunDoWhileLoop() {
    assertBehavesSameVariations(8,
            () -> {
                int i = 0;
                do {
                    i++;
                    sync(i);
                } while (i < 3 && plainTest(i));
            },
            (callback) -> {
                final int[] i = new int[1];
                beginAsync().thenRunDoWhileLoop((c) -> {
                    i[0]++;
                    async(i[0], c);
                }, () -> i[0] < 3 && plainTest(i[0])).finish(callback);
            });
}

P.S. I selfishly want the AI-assisted solution to turn out incorrect once we inspect the code, because I don't want to lose this battle to AI—it would be sorrowful. At least I didn't lose to the damned AI.
P.P.S. I inputted that (em dash) above via option + shift + - in macOS. It wasn't generated by AI.

@stIncMale
Copy link
Member

@strogiyotec Let's give this PR a descriptive name, which "JAVA-6071" is not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should check other implementations of com.mongodb.connection.AsyncCompletionHandler, java.nio.channels.CompletionHandler we have to see if their completed/failed methods may throw exceptions (they must try their best not to, and the beginAsync API helps with that).

Examples:

  1. AsynchronousChannelStream.AsyncWritableByteChannelAdapter.WriteCompletionHandler seems fine.
  2. com.mongodb.internal.connection.FutureAsyncCompletionHandler is dead code. Let's remove it.
  3. etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@katcharov, it seems to me that AsyncSupplier.finish should be implemented simply as finish(null, callback). Not only this seems to be trivially equivalent to the current code, but also all the AsyncFunctionsAbstractTest tests pass when I do such a change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments