Skip to content

enhancement(core): Do not terminate when accept fails#24722

Open
fbs wants to merge 2 commits intovectordotdev:masterfrom
fbs:24705_retry_accept
Open

enhancement(core): Do not terminate when accept fails#24722
fbs wants to merge 2 commits intovectordotdev:masterfrom
fbs:24705_retry_accept

Conversation

@fbs
Copy link
Contributor

@fbs fbs commented Feb 24, 2026

Summary

Errors from the accept syscall are currently treated as fatal and terminate the listener, instead of retrying when possible. There is no indication that this happens, the source just stop. This is especially bad when vector acts as a 'central' aggregator.

This first changes the core MaybeTlsListener to support retries:

The `accept` syscall can fail[0], e.g. when hitting the open FD limit. This
doesn't have to be a fatal error and the accept can be retried.
Currently it is not retried. Instead the listener is terminated and all
subsequent requests are rejected.

The implementation is simple: if accept fails its retried 1 second
later and an error is logged. This can probably be improved, but its
already an improvement over the current behaviour of terminating the
listener.

The retry behaviour is opt in. In some cases it might be reasonable to
not retry and instead terminate.

The whole thing is a bit ugly and it looks hard to test in a unit test
setup, as fault injection is hard to do there.

[0]: the docs for tokio's TcpListener say:

> Note that accepting a connection can lead to various errors and not
> all of them are necessarily fatal ‒ for example having too many open
> file descriptors or the other side closing the connection while it
> waits in an accept queue. These would terminate the stream if not
> handled in any way

The 2nd one enables it for the opentelemetry source only. If it works well there it can be used in other places

Vector configuration

How did you test this PR?

before:

$ bash -c "ulimit -n 12; vector --config vector-otel.yaml"
2026-02-24T15:28:23.770354Z  INFO source{component_kind="source" component_id=otel component_type=opentelemetry}: vector::sources::util::grpc: Building gRPC server. address=127.0.0.1:4317
2026-02-24T15:28:23.770808Z  INFO source{component_kind="source" component_id=otel component_type=opentelemetry}: vector::sources::opentelemetry::http: Building HTTP server. address=127.0.0.1:4318

vector stops listening after getting an EMFILE:

$ nc -v localhost 4318
Connection to localhost port 4318 [tcp/*] succeeded!

$ nc -v localhost 4318
nc: connectx to localhost port 4318 (tcp) failed: Connection refused
nc: connectx to localhost port 4318 (tcp) failed: Connection refused

$ lsof -a -p $(pgrep vector) -iTCP -nn -P | grep LIST
vector  58184 bsmit   10u  IPv4 0xdc9280a9a8422fc1      0t0  TCP 127.0.0.1:4317 (LISTEN)

new

$ bash -c "ulimit -n 12; target/debug/vector --config vector-otel.yaml"
2026-02-24T15:27:40.457658Z  INFO source{component_kind="source" component_id=otel component_type=opentelemetry}: vector::sources::util::grpc: Building gRPC server. address=127.0.0.1:4317
2026-02-24T15:27:40.459152Z  INFO source{component_kind="source" component_id=otel component_type=opentelemetry}: vector::sources::opentelemetry::http: Building HTTP server. address=127.0.0.1:4318
2026-02-24T15:27:43.001975Z ERROR source{component_kind="source" component_id=otel component_type=opentelemetry}: vector_core::tls::incoming: accept error: Too many open files (os error 24)
2026-02-24T15:27:46.107262Z ERROR source{component_kind="source" component_id=otel component_type=opentelemetry}: vector_core::tls::incoming: Internal log [accept error: Too many open files (os error 24)] is being suppressed to avoid flooding.

Keeps working:

$ nc -v localhost 4318
Connection to localhost port 4318 [tcp/*] succeeded!
$ nc -v localhost 4318
Connection to localhost port 4318 [tcp/*] succeeded!

Change Type

  • Bug fix
  • New feature
  • Dependencies
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

Does this PR include user facing changes?

  • Yes. Please add a changelog fragment based on our guidelines.
  • No. A maintainer will apply the no-changelog label to this PR.

Not sure if this qualifies as user facing.

References

Related: #24705

Notes

  • Please read our Vector contributor resources.
  • Do not hesitate to use @vectordotdev/vector to reach out to us regarding this PR.
  • Some CI checks run only after we manually approve them.
    • We recommend adding a pre-push hook, please see this template.
    • Alternatively, we recommend running the following locally before pushing to the remote branch:
      • make fmt
      • make check-clippy (if there are failures it's possible some of them can be fixed with make clippy-fix)
      • make test
  • After a review is requested, please avoid force pushes to help us review incrementally.
    • Feel free to push as many commits as you want. They will be squashed into one before merging.
    • For example, you can run git merge origin master and git push.
  • If this PR introduces changes Vector dependencies (modifies Cargo.lock), please
    run make build-licenses to regenerate the license inventory and commit the changes (if any). More details here.

fbs added 2 commits February 23, 2026 22:25
The `accept` syscall can fail[0], e.g. when hitting the open FD limit. This
doesn't have to be a fatal error and the accept can be retried.
Currently it is not retried. Instead the listener is terminated and all
subsequent requests are rejected.

The implementation is simple: if accept fails its retried 1 second
later and an error is logged. This can probably be improved, but its
already an improvement over the current behaviour of terminating the
listener.

The retry behaviour is opt in. In some cases it might be reasonable to
not retry and instead terminate.

The whole thing is a bit ugly and it looks hard to test in a unit test
setup, as fault injection is hard to do there.

[0]: the docs for tokio's TcpListener say:

> Note that accepting a connection can lead to various errors and not
> all of them are necessarily fatal ‒ for example having too many open
> file descriptors or the other side closing the connection while it
> waits in an accept queue. These would terminate the stream if not
> handled in any way
This prevents the otlp listener from terminating when a busy log
aggregator hits its open file limit.
@fbs fbs requested a review from a team as a code owner February 24, 2026 15:30
@github-actions github-actions bot added domain: sources Anything related to the Vector's sources domain: core Anything related to core crates i.e. vector-core, core-common, etc labels Feb 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

domain: core Anything related to core crates i.e. vector-core, core-common, etc domain: sources Anything related to the Vector's sources

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant