enhancement(core): Do not terminate when accept fails#24722
Open
fbs wants to merge 2 commits intovectordotdev:masterfrom
Open
enhancement(core): Do not terminate when accept fails#24722fbs wants to merge 2 commits intovectordotdev:masterfrom
fbs wants to merge 2 commits intovectordotdev:masterfrom
Conversation
The `accept` syscall can fail[0], e.g. when hitting the open FD limit. This doesn't have to be a fatal error and the accept can be retried. Currently it is not retried. Instead the listener is terminated and all subsequent requests are rejected. The implementation is simple: if accept fails its retried 1 second later and an error is logged. This can probably be improved, but its already an improvement over the current behaviour of terminating the listener. The retry behaviour is opt in. In some cases it might be reasonable to not retry and instead terminate. The whole thing is a bit ugly and it looks hard to test in a unit test setup, as fault injection is hard to do there. [0]: the docs for tokio's TcpListener say: > Note that accepting a connection can lead to various errors and not > all of them are necessarily fatal ‒ for example having too many open > file descriptors or the other side closing the connection while it > waits in an accept queue. These would terminate the stream if not > handled in any way
This prevents the otlp listener from terminating when a busy log aggregator hits its open file limit.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Errors from the
acceptsyscall are currently treated as fatal and terminate the listener, instead of retrying when possible. There is no indication that this happens, the source just stop. This is especially bad when vector acts as a 'central' aggregator.This first changes the core
MaybeTlsListenerto support retries:The 2nd one enables it for the opentelemetry source only. If it works well there it can be used in other places
Vector configuration
How did you test this PR?
before:
vector stops listening after getting an EMFILE:
new
Keeps working:
Change Type
Is this a breaking change?
Does this PR include user facing changes?
no-changeloglabel to this PR.Not sure if this qualifies as user facing.
References
Related: #24705
Notes
@vectordotdev/vectorto reach out to us regarding this PR.pre-pushhook, please see this template.make fmtmake check-clippy(if there are failures it's possible some of them can be fixed withmake clippy-fix)make testgit merge origin masterandgit push.Cargo.lock), pleaserun
make build-licensesto regenerate the license inventory and commit the changes (if any). More details here.