feat(server): support TLS certificate hot-reload#1870
Conversation
|
All contributors have signed the DCO ✍️ ✅ |
|
I have read the DCO document and I hereby sign the DCO. |
|
recheck |
|
Thanks for taking this on and working to resolve the cert-manager renewal issue. One design concern: this currently implements TLS hot-reload as a timed polling loop rather than a file watcher. The worker uses Can we make this event-driven instead? A file watcher over the parent directories of That would make renewal detection immediate, reduce configuration surface, and avoid polling races where a tick lands during a projected-volume update and then waits a full interval before recovering. |
|
@TaylorMutch Appreciate your key design review and pointers. Now it has been refactored from polling-based to watch-based. PR description has been updated. I have tested the new code changes end-end locally, please find the steps and results in "E2E test record" toggle under Testing in the description. Refactor summaryChanges from polling-based to watch-based
Key behavioral differences vs polling approach
|
|
/ok to test 7d17a98 |
|
I found one issue worth fixing before relying on this test coverage:
Could you update this test to assert the served cert DER/fingerprint changed, similar to |
|
Label |
Signed-off-by: Yuedong Wu <dwcn22@outlook.com>
Signed-off-by: Yuedong Wu <dwcn22@outlook.com>
Incorporated by capturing the original cert DER before the file change and comparing the served cert DER in the retry loop. Now the loop only breaks when the cert actually differs. See diff: https://github.com/NVIDIA/OpenShell/compare/7d17a9819f52855a8e9c2c8e27b062ca7a60eb45..29d57cc38969e340b3700edcadad32b1e629a5fc |
Summary
Add file-watcher-based TLS certificate hot-reload to the gateway, allowing cert/key/CA rotation without restarting. Uses
notifyto watch parent directories andArcSwapfor atomic config swapping so in-flight TLS handshakes are never blocked.Related Issue
Fixes #1836
Changes
TlsAcceptorinternals withArcSwap<ServerConfig>for lock-free atomic config swapsnotify::recommended_watcher— watches cert/key/CA parent directories and reloads on filesystem changes with a 1-second debounce (handles Kubernetes Secret atomic-swap patterns)AtomicBoolguard to prevent duplicatespawn_reload_worker()callsConfigStateChangeevents on reload success (Informational) and failure (Medium)rustls::crypto::ring::sign— surfaces bad key types at startup instead of handshake timegenerate_test_certs_with_ca,install_rustls_provider,write_test_file) intotls_test_utils.rsTesting
mise run pre-commitpassesE2E test record
Step 1: Create cluster
Step 2: Enable TLS
Create
deploy/helm/openshell/ci/values-reload-test.yaml:Add to
deploy/helm/openshell/skaffold.yamlafter- ci/values-skaffold.yaml:- ci/values-reload-test.yamlStep 3: Deploy
Output:
Step 4: Verify file watcher started
KUBECONFIG=kubeconfig kubectl -n openshell logs openshell-0 | grep -i watcherOutput:
Step 5: Capture initial cert fingerprint
Output:
Step 6: Overwrite TLS secret (simulate cert-manager renewal)
Output:
Step 7: Verify new cert is served — no pod restart
Wait for kubelet sync + watcher detection:
7a. Pod restarts
Output:
7b. Current cert from gateway
Output:
7c. Cert stored in Secret (should match gateway)
Output:
7d. Reload logs
Output:
Results
AC:47:<...>:2C4F:01:<...>:FF/etc/openshell-tls/server,/etc/openshell-tls/client-caCONFIG:RELOADED [INFO]The file watcher detected the updated certificate in the watched directory, triggered a reload after the 1-second debounce window, and atomically swapped the active TLS config. Zero pod restarts. The OCSF log entry provides observability that was completely missing before.
Checklist