Reconnect stale vCenter service instance on socket-level errors (fixes #61983)#69233
Open
ggiesen wants to merge 1 commit into
Open
Reconnect stale vCenter service instance on socket-level errors (fixes #61983)#69233ggiesen wants to merge 1 commit into
ggiesen wants to merge 1 commit into
Conversation
The cached pyVmomi service instance health check in get_service_instance only reconnected on vim.fault.NotAuthenticated. When the cached connection is corrupted at the socket level - notably when salt-cloud's map_providers_parallel inherits the cached service instance across an os.fork() and the shared TLS socket is used from more than one process - CurrentTime() raises ssl.SSLError, BrokenPipeError or ConnectionResetError instead. Those propagated to the caller and failed the operation, which is the SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC seen on alternating cloud.present state IDs in issue saltstack#61983. Treat those socket-level errors the same as an expired session: drop the dead connection and reconnect. Disconnect is guarded because logging out over an already broken socket can raise as well.
ead7b5f to
0b8a40d
Compare
Contributor
Author
|
Re-targeted to 3006.x (was master): this is a bug fix and 3006.x is the oldest supported branch that contains the affected |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Reconnects the cached vCenter/ESXi service instance when its connection has gone stale at the socket level, instead of letting the error escape and fail the operation.
salt.utils.vmware.get_service_instancecaches the pyVmomiServiceInstance(pyVmomi's process global,GetSi()), which holds open, pooled TLS sockets. After fetching it, the function probes the connection withservice_instance.CurrentTime()and reconnects if the probe fails. That probe only caughtvim.fault.NotAuthenticated, so a connection that has been broken/corrupted at the socket level raisedssl.SSLError,BrokenPipeErrororConnectionResetErrorstraight through to the caller.The common trigger is salt-cloud:
Cloud.do_actioncallsmap_providers_parallel, which uses amultiprocessing.Pool. On Linux that forks, and the workers inherit the parent's cached service instance and reuse/tear down the shared TLS sockets (the inheritedatexitDisconnect also runs on worker exit). The parent's next request on those sockets then fails the TLS record MAC check. Because pyVmomi keeps a pool of connections per stub, the request that fails alternates -- which matches the "every othercloud.presentstate ID fails" report in #61983.This PR treats those socket-level errors the same as an expired session: drop the dead connection and reconnect. The
Disconnectis guarded because logging out over an already broken socket can raise as well.This is a recovery-level fix that self-heals every error variant reported. A deeper prevention (not sharing the pyVmomi connection across the pool fork) would be a larger, riskier change; I'm happy to follow up on that separately if the maintainers want it.
Note: if the plan is to retire the VMware code from Salt core in favor of the
saltext.vmwareextension, feel free to discard this PR and close #61983. I checked the extension and it is not affected -- its live modules connect fresh viaconnect.get_service_instance(noGetSi()cache) and there is no salt-cloud fork path there; the matching cachedget_service_instanceinutils/vsphere.pyis unused.What issues does this PR fix or reference?
Fixes #61983
References #58869 (same
ssl.SSLError/BrokenPipeError/ConnectionResetErrorsymptoms and workarounds).Previous Behavior
A cached service instance with a socket-level-broken connection caused
get_service_instanceto raisessl.SSLError(SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC),BrokenPipeErrororConnectionResetError, failingcloud.present/cloud.has_instanceand other vSphere operations.New Behavior
Those socket-level errors are caught by the existing reconnect path; the stale connection is dropped and a fresh one is established transparently.
Merge requirements satisfied?
[NOTICE] Bug fixes or features added to Salt require tests.
changelog/61983.fixed.mdtests/pytests/unit/utils/test_vmware.py(verified failing on current master, passing with this change)Commits signed with GPG?
No