Skip to content

Reconnect stale vCenter service instance on socket-level errors (fixes #61983)#69233

Open
ggiesen wants to merge 1 commit into
saltstack:3006.xfrom
ggiesen:fix-61983-vmware-ssl-reconnect
Open

Reconnect stale vCenter service instance on socket-level errors (fixes #61983)#69233
ggiesen wants to merge 1 commit into
saltstack:3006.xfrom
ggiesen:fix-61983-vmware-ssl-reconnect

Conversation

@ggiesen
Copy link
Copy Markdown
Contributor

@ggiesen ggiesen commented May 27, 2026

What does this PR do?

Reconnects the cached vCenter/ESXi service instance when its connection has gone stale at the socket level, instead of letting the error escape and fail the operation.

salt.utils.vmware.get_service_instance caches the pyVmomi ServiceInstance (pyVmomi's process global, GetSi()), which holds open, pooled TLS sockets. After fetching it, the function probes the connection with service_instance.CurrentTime() and reconnects if the probe fails. That probe only caught vim.fault.NotAuthenticated, so a connection that has been broken/corrupted at the socket level raised ssl.SSLError, BrokenPipeError or ConnectionResetError straight through to the caller.

The common trigger is salt-cloud: Cloud.do_action calls map_providers_parallel, which uses a multiprocessing.Pool. On Linux that forks, and the workers inherit the parent's cached service instance and reuse/tear down the shared TLS sockets (the inherited atexit Disconnect also runs on worker exit). The parent's next request on those sockets then fails the TLS record MAC check. Because pyVmomi keeps a pool of connections per stub, the request that fails alternates -- which matches the "every other cloud.present state ID fails" report in #61983.

This PR treats those socket-level errors the same as an expired session: drop the dead connection and reconnect. The Disconnect is guarded because logging out over an already broken socket can raise as well.

This is a recovery-level fix that self-heals every error variant reported. A deeper prevention (not sharing the pyVmomi connection across the pool fork) would be a larger, riskier change; I'm happy to follow up on that separately if the maintainers want it.

Note: if the plan is to retire the VMware code from Salt core in favor of the saltext.vmware extension, feel free to discard this PR and close #61983. I checked the extension and it is not affected -- its live modules connect fresh via connect.get_service_instance (no GetSi() cache) and there is no salt-cloud fork path there; the matching cached get_service_instance in utils/vsphere.py is unused.

What issues does this PR fix or reference?

Fixes #61983
References #58869 (same ssl.SSLError / BrokenPipeError / ConnectionResetError symptoms and workarounds).

Previous Behavior

A cached service instance with a socket-level-broken connection caused get_service_instance to raise ssl.SSLError (SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC), BrokenPipeError or ConnectionResetError, failing cloud.present / cloud.has_instance and other vSphere operations.

New Behavior

Those socket-level errors are caught by the existing reconnect path; the stale connection is dropped and a fresh one is established transparently.

Merge requirements satisfied?

[NOTICE] Bug fixes or features added to Salt require tests.

  • Docs
  • Changelog - changelog/61983.fixed.md
  • Tests written/updated - tests/pytests/unit/utils/test_vmware.py (verified failing on current master, passing with this change)

Commits signed with GPG?

No

The cached pyVmomi service instance health check in get_service_instance
only reconnected on vim.fault.NotAuthenticated. When the cached connection
is corrupted at the socket level - notably when salt-cloud's
map_providers_parallel inherits the cached service instance across an
os.fork() and the shared TLS socket is used from more than one process -
CurrentTime() raises ssl.SSLError, BrokenPipeError or ConnectionResetError
instead. Those propagated to the caller and failed the operation, which is
the SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC seen on alternating
cloud.present state IDs in issue saltstack#61983.

Treat those socket-level errors the same as an expired session: drop the
dead connection and reconnect. Disconnect is guarded because logging out
over an already broken socket can raise as well.
@ggiesen ggiesen force-pushed the fix-61983-vmware-ssl-reconnect branch from ead7b5f to 0b8a40d Compare May 27, 2026 15:39
@ggiesen ggiesen changed the base branch from master to 3006.x May 27, 2026 15:39
@ggiesen
Copy link
Copy Markdown
Contributor Author

ggiesen commented May 27, 2026

Re-targeted to 3006.x (was master): this is a bug fix and 3006.x is the oldest supported branch that contains the affected salt.utils.vmware code, per Salt's contributing guide. It will merge forward to 3007.x/3008.x/master.

@dwoz dwoz added the test:full Run the full test suite label May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:full Run the full test suite

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC on alternating cloud.present state IDs during state.apply on vSphere 7.0

2 participants