Conversation
|
As far as I can tell, this PR addresses JAVA-5907, so I
|
| executor.submit(() -> { | ||
| try { | ||
| r.run(); | ||
| } catch (Throwable e) { | ||
| // catch Throwable to also catch unexpected StackoverflowErrors and similar, | ||
| // to avoid silent loss of exceptions in asynchronous code | ||
| // see JAVA-6071 for an example of such a case | ||
| LOGGER.error("error in submitted runnable in AsynchronousTlsChannelGroup", e); |
There was a problem hiding this comment.
- I don't think this comment is useful to a reader. What we do here seems obvious without any comments.
- Is it not the case that the stack trace of
ehas information aboutAsynchronousTlsChannelGroup? If it is the case, then there is no reason to duplicate that info in the log message, and we should just doLOGGER.error(null, e).
| executor.submit(() -> { | ||
| try { | ||
| r.run(); | ||
| } catch (Throwable e) { | ||
| // catch Throwable to also catch unexpected StackoverflowErrors and similar, | ||
| // to avoid silent loss of exceptions in asynchronous code | ||
| // see JAVA-6071 for an example of such a case | ||
| LOGGER.error("error in submitted runnable in AsynchronousTlsChannelGroup", e); |
There was a problem hiding this comment.
@vbabanin, in JAVA-5907 you said
Adding defensive try/catch with explicit assertion logging ensures these errors appear in logs, avoiding silent failures that make debugging and identification of such issues hard.
Could you please clarify what you meant by "assertion logging"? Note that ErrorHandlingResultCallback simply does
logger.error("Callback onResult call produced an error", e);while AsyncFunction.finish is more sophisticated, but when it reaches the "there is nothing we can do" state, it throws the unexpected exception instead of logging it. The latter allows an application to register a default Thread.UncaughtExceptionHandler, and react to the exception thrown.
P.S. @katcharov, Do you remember why we decided not to log t in AsyncFunction.finish? I wonder if it's because by default such exceptions are logged to the standard error stream (@strogiyotec, do you know this, or maybe you could check what happens to uncaught exceptions when no explicit Thread.UncaughtExceptionHandler is configured?).
There was a problem hiding this comment.
I wonder if it's because by default such exceptions are logged to the standard error stream
I checked this with the following "shebang" script:
#! /usr/bin/env -S java --source 23
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
final class Script {
public static final void main(final String... args) throws InterruptedException {
Thread t = new Thread(() -> {
throw new RuntimeException("from bare thread");
});
t.start();
t.join();
try (ExecutorService executorService = Executors.newFixedThreadPool(1)) {
executorService.execute(() -> {
throw new RuntimeException("from ExecutorService.execute");
});
executorService.submit(() -> {
throw new RuntimeException("from ExecutorService.submit");
});
}
}
}It produces the following output:
Exception in thread "Thread-0" java.lang.RuntimeException: from bare thread
at Script.lambda$main$0(script:9)
at java.base/java.lang.Thread.run(Thread.java:1575)
Exception in thread "pool-1-thread-1" java.lang.RuntimeException: from ExecutorService.execute
at Script.lambda$main$1(script:15)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1575)
That is, even when an application does not explicitly register a default Thread.UncaughtExceptionHandler, uncaught exceptions are still printed to the standard error stream (I checked that it's the error and not the output stream separately).
However, it would still be useful for us to log them, because we could suggest in the message that the application should register a default UncaughtExceptionHandler, at the very lest for the purpose of logging the uncaught exceptions via the means more suitable to the particular application.
There was a problem hiding this comment.
thanks @stIncMale , my take is that I would like to have the error printed explicitly in catch when we will use execute
There was a problem hiding this comment.
Currently, I think that we should log in both the place you are suggesting, and in AsyncFunction/AsyncSupplier.finish. We should come back to discussing this.
|
|
||
| void submit(final Runnable r) { | ||
| executor.submit(r); | ||
| executor.submit(() -> { |
There was a problem hiding this comment.
We should replace submit with execute, and rename AsynchronousTlsChannelGroup.submit accordingly (this is our custom method, it is not present in the TLS Channel library.
Executor.execute is void, which means that when the passed Runnable.run throws, the thread that was executing it has to be abruptly terminated, provided that it was a thread managed by the executor (depending on the executor implementation, it may not be the case). ExecutorService.submit returns a Future, which we ignore, thus hiding the exception from anyone.
Us using Executor.execute allows an application to register a default Thread.UncaughtExceptionHandler, and react to the exception thrown.
P.S. The majority of applications should terminate if VirtualMachineError is thrown, because such exceptions may be thrown from unexpected places, leading to violating application invariants. These places may be especially surprising given that these exceptions are asynchronous exceptions.
P.P.S. Let's check the codebase for other inappropriate submit/invokeAll usages where we ignore the returned Futures. We may also look at the usages of ScheduledExecutorService/ScheduledThreadPoolExecutor.
There was a problem hiding this comment.
This needs to be discussed in-person either during one of our catchup meetings, or at a dedicated meeting (the latter may be a 1on1 meeting).
- I still don't understand the issue clearly:
- [just to make sure I understand this part correctly] Is it true that the graph of callbacks and handlers is created successfully, and only when this
c.complete(localByteBuf)is eventually called, do we construct the long stack of method calls?- The above means that the actual stack trance, which is not fully provided in the PR description as it's too large, has the aforementioned
completecall in it? Answer: yes.
- The above means that the actual stack trance, which is not fully provided in the PR description as it's too large, has the aforementioned
- Why does the issue reproduce only on a standalone deployment?
- Answer: Because for some reason the standalone deployment responds to the
bulkWritecommand with all the documents in the created cursor, but a replicaset responds with only a subset, requiring the driver to executegetMores, which means, the driver does not read enough data via eachAsynchronousChannelStream.readAsyncto cause theStackOverflowError. We have https://jira.mongodb.org/plugins/servlet/desk/portal/13/HELP-88629, but it looks like it should be completely reformulated with our current understanding.
- Answer: Because for some reason the standalone deployment responds to the
- Does TLS results in us reading the server response in smaller chunks (leading to a larger number of chunks) compared to no TLS?
- Answer: we don't know, this wasn't investigated.
- [just to make sure I understand this part correctly] Is it true that the graph of callbacks and handlers is created successfully, and only when this
Nonetheless
- The immediate reaction I have is that we should try to preserve the use of the
beginAsyncAPI usage here. One of the reasons it was introduced is that implementing callbacks/handlers without it is too error-prone: we still throw exceptions/assertions in callback-based code, and unexpected exceptions are still being thrown, yet we can't practically always properly handle them.- That being said, if keeping the
beginAsyncAPI usage is impossible, the idea of the fix proposed seems right, though I haven't thoroughly reviewed the change to theBasicCompletionHandler.completedmethod.
- That being said, if keeping the
Alternative ideas:
- a) I have a hunch that passing
localHandlerinstead ofc.asHandler()here may solve the issue.- Even if it does, I don't think it's correct, because if we do that, the
beginAsyncAPI won't be able to know that we have "completed"localHandlerin a situation when we do it, andlocalHandlerthrows an exception. This may lead tolocalHandlerbeing "completed" more than once by theAsyncSupplier.finishmethod, which absolutely must not be done. This also leads to not executing any code that is "placed between"thenSupplyandfinish(currently, there is none).- I left this note only to warn us from trying this.
- Even if it does, I don't think it's correct, because if we do that, the
- b) Instead of passing
localHandler.asCallback()intofinishhere, pass a callback that at first extractslocalHandlerviagetHandlerAndClear, and then completes it. This way we should be able to pass the currentBasicCompletionHandler(this) intogetChannel().read(...)here, similarly to how it is proposed in the current PR. But this approach may have the same problem as a). - c) Instead of creating a new
BasicCompletionHandlerhere, we could re-setBasicCompletionHandler.byteBufReferenceandBaseCompletionHandler.handlerReferenceon the existingBasicCompletionHandler(this) tolocalByteBufandc.asHandler()respectively. This way we will save on creating a newBasicCompletionHandler/ a new stack frame per each chunk of data received. - d) Use the
beginAsyncAPI at a higher level thanBasicCompletionHandler.completed. Maybe inAsynchronousChannelStream.readAsync? Will it give us the equivalent guarantees? - e) Somehow change the implementation of
AsyncFunction/AsyncSupplier.finishsuch that executing all the callbacks in the onion of callbacks we currently build does not require growing the call stack by at least one frame per each callback. I do now know if this is doable, and currently fail to imagine how this could even be achieved.
c) and d) seem to be the most promising to me, with d) being very nice and simple, provided that it preserves the existing guarantees, which is an open question.
There was a problem hiding this comment.
We discussed this with @katcharov and @strogiyotec via Zoom, and the idea we ended up with is neither of the aforementioned. It is closest to a combination of c) and e), but is still quite different. What we should do:
- Get rid of
BasicCompletionHandlerand rewriteAsynchronousChannelStream.readAsyncvia thebeginAsyncAPI usingAsyncRunnable.thenRunDoWhileLoopor a similar method. - Change
AsyncCallbackLoopandRetryingAsyncCallbackSuppliersuch that the amount of stack memory they consume does not depend on the number of loop iterations / retry attempts.
Currently:
- @katcharov has an AI-assisted PoC that passes the below test written by @katcharov in both
SameThreadAsyncFunctionsTestandSeparateThreadAsyncFunctionsTest. - I have a no-AI-involved PoC that passes the below test written by @katcharov in both
SameThreadAsyncFunctionsTestandSeparateThreadAsyncFunctionsTest. - @vbabanin had ideas when we talked, maybe he'll also have something.
The aforementioned test is
// AsyncFunctionsAbstractTest.java
@Test
void testThenRunDoWhileLoop() {
assertBehavesSameVariations(8,
() -> {
int i = 0;
do {
i++;
sync(i);
} while (i < 3 && plainTest(i));
},
(callback) -> {
final int[] i = new int[1];
beginAsync().thenRunDoWhileLoop((c) -> {
i[0]++;
async(i[0], c);
}, () -> i[0] < 3 && plainTest(i[0])).finish(callback);
});
}P.S. I selfishly want the AI-assisted solution to turn out incorrect once we inspect the code, because I don't want to lose this battle to AI—it would be sorrowful. At least I didn't lose to the damned AI.
P.P.S. I inputted that — (em dash) above via option + shift + - in macOS. It wasn't generated by AI.
|
@strogiyotec Let's give this PR a descriptive name, which "JAVA-6071" is not. |
There was a problem hiding this comment.
We should check other implementations of com.mongodb.connection.AsyncCompletionHandler, java.nio.channels.CompletionHandler we have to see if their completed/failed methods may throw exceptions (they must try their best not to, and the beginAsync API helps with that).
Examples:
AsynchronousChannelStream.AsyncWritableByteChannelAdapter.WriteCompletionHandlerseems fine.com.mongodb.internal.connection.FutureAsyncCompletionHandleris dead code. Let's remove it.- etc.
There was a problem hiding this comment.
@katcharov, it seems to me that AsyncSupplier.finish should be implemented simply as finish(null, callback). Not only this seems to be trivially equivalent to the current code, but also all the AsyncFunctionsAbstractTest tests pass when I do such a change.
JAVA-6071
JAVA-5907(this ticket is be addressed in #1893)Problem statement
Original failed test case
The test case only fails on a standalone variant with SSL and
ordered = falseAfter adding a try catch to the
submitinAsynchronousTlsChannelGroupexecutor I was able to catch an errorThe exception happened because the maxBatchSize is used to generate a huge payload, then once client reads the payload from the server it wraps original callback as handler and handler to callback
Every time it happens we create a new BasicCompletionHandler
Once whole payload is read and client starts to unwind the callback chain it fails occasionally on overflow error
The fix
The fix is to avoid the creation of chained callbacks and only pass a top
thiscallback to the chain of readsTests
To test the fix I deliberately used a small stack size locally
Xss=512kbcompared to default 1MThis PR also changes the stack size for all evergreen tests so that we can catch any excessive call stacks during testing
You also have to run SSL variant locally to reproduce the error