fix: add retry logic for tunnel reconnection in jmp shell proxy#679
fix: add retry logic for tunnel reconnection in jmp shell proxy#679ambient-code[bot] wants to merge 9 commits into
Conversation
|
CI is failing due to diff coverage being at 8% (threshold is 80%). The new retry logic in Adding tests for:
This brings diff coverage to 100%. |
|
Agreed with the review feedback. The duplicated Dial call in the router retry block is messy and needs refactoring. Fix plan:
Working on it now. |
Review Status UpdateThe refactoring requested by @mangelajo has been completed in commit 534f7b6. The duplicated Dial code has been eliminated by:
CI StatusAll checks are passing except for a flaky e2e test on ARM64:
The flaky test is in the hooks execution path, while this PR modifies the tunnel connection/retry logic. These are separate concerns. Would you like me to:
|
Review Update (2026-05-26)Checking the PR status: Comments & Reviews
CI StatusThe only failing check is an e2e test that appears unrelated to this PR:
Both ARM64 and AMD64 e2e test runs failed with the same issue, but all other checks are passing (lint, unit tests, type checks, other e2e compatibility tests). This appears to be a flaky test unrelated to the PR changes. Would you like me to:
@mangelajo - Since you requested the refactoring changes and they've been completed, please let me know your preference for handling the flaky e2e test. |
|
I've re-triggered the failed e2e tests to see if they pass on retry. |
|
The e2e test re-run failed because the build artifacts are no longer available (GitHub Actions artifact limitation when re-running individual jobs). To properly re-run the e2e tests, we would need to trigger a full workflow re-run from the beginning, which would rebuild all artifacts. However, given that:
I recommend one of:
|
CI Update (2026-05-27)All CI checks are now passing, including the previously flaky e2e tests:
The flaky All review feedback from @mangelajo has been addressed (code refactored in commit 534f7b6 to eliminate duplicated Dial code). This PR is ready for re-review. |
|
@ambient-code rebase onto main |
When the tunnel between the local proxy and the jumpstarter-router drops during a jmp shell session, subsequent j commands would time out with SETTINGS frame timeout errors because: 1. The Dial call to the controller could fail with transient UNAVAILABLE errors, but there was no retry logic for these (only FAILED_PRECONDITION was retried). 2. The connect_router_stream could hang indefinitely trying to establish an HTTP/2 connection to an unreachable router endpoint, with no timeout on the channel readiness check. This commit fixes both issues: - Adds retry with exponential backoff for transient gRPC errors (UNAVAILABLE, RESOURCE_EXHAUSTED, ABORTED, INTERNAL) in the Dial call within handle_async. - Adds retry with exponential backoff for the router connection establishment, including re-dialing to get fresh router tokens when retrying. - Adds a channel_ready() timeout (10s) in connect_router_stream so that connections to unreachable routers fail fast instead of hanging on the HTTP/2 SETTINGS frame exchange. Fixes #638 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add unit tests covering the new retry logic in handle_async (lease.py) and the channel_ready timeout in connect_router_stream (streams.py) to satisfy the 80% diff coverage requirement. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extract _dial_and_connect() to perform Dial + router connection as a single atomic operation. This eliminates the duplicated Dial call that was in the separate router retry block, addressing the code review feedback about entangled and repeated code. The single retry loop in handle_async now retries the full _dial_and_connect() on transient errors, which naturally handles both Dial failures and router connection failures with the same backoff logic and always gets a fresh router token on retry. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
534f7b6 to
4b8e7b4
Compare
|
Done — rebased onto main (3 commits, no conflicts). CI should kick off shortly on the new head |
CI Status Update (post-rebase)After rebasing onto main, CI is mostly passing except for one test failure on macOS: Failed Test: Analysis:
Passing Checks:
The test failure is unrelated to the PR changes. Would you like me to:
|
CI Update (2026-05-28)The post-rebase CI run had one failure in Another PR ( Pushed an empty commit ( |
| grpc.StatusCode.UNAVAILABLE, | ||
| grpc.StatusCode.RESOURCE_EXHAUSTED, | ||
| grpc.StatusCode.ABORTED, | ||
| grpc.StatusCode.INTERNAL, |
There was a problem hiding this comment.
Good question! UNKNOWN is a bit of a gray area for retries. In gRPC, UNKNOWN is returned when:
- The server raises an exception without an explicit status code (often a bug/unhandled exception)
- An RPC receives a status it doesn't understand
- The server crashes mid-response
Unlike UNAVAILABLE or RESOURCE_EXHAUSTED which are clearly transient network/load conditions, UNKNOWN can signal a persistent server-side bug that retrying won't fix. Including it risks masking real errors by silently retrying indefinitely until the timeout expires.
That said, there are scenarios where a transient server crash could surface as UNKNOWN, and in those cases a retry would help. I'd lean toward adding it with a comment noting the trade-off, since the retry is bounded by dial_timeout anyway. If you feel it's worth including, I'll add it.
What's your preference?
There was a problem hiding this comment.
Specifically with "watch channel closed" error
There was a problem hiding this comment.
Good point -- the "watch channel closed" error surfacing as UNKNOWN is clearly a transient tunnel teardown, not a permanent server bug. I'll add StatusCode.UNKNOWN to _TRANSIENT_GRPC_CODES with a comment noting this specific scenario.
There was a problem hiding this comment.
no, but we can't do this for any UNKNOWN status
There was a problem hiding this comment.
Agreed -- blindly retrying every UNKNOWN error is too broad and could mask real bugs. I'll remove UNKNOWN from the blanket _TRANSIENT_GRPC_CODES set and instead add a targeted check that only retries UNKNOWN when the error details contain the specific "watch channel closed" message. That way we handle the tunnel teardown case you flagged without swallowing unrelated errors.
Include grpc.StatusCode.UNKNOWN in _TRANSIENT_GRPC_CODES since tunnel teardowns (e.g. "watch channel closed") surface as UNKNOWN rather than UNAVAILABLE. The retry is still bounded by dial_timeout so this won't mask persistent server bugs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fix planAddressing @bennyz's feedback and CI failures:
|
Remove StatusCode.UNKNOWN from the blanket _TRANSIENT_GRPC_CODES set to avoid masking unrelated server-side errors. Instead, add a targeted check via _TRANSIENT_UNKNOWN_MESSAGES that only retries UNKNOWN when the error details contain known transient messages like "watch channel closed". Also fixes lint E501 line-too-long in tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rage - Add inline comments explaining why transient-error and OSError paths return silently instead of raising (handle_async runs inside TemporaryUnixListener.serve's task group) - Add test_exponential_backoff_delay_values to verify sleep delays follow the expected pattern (0.3, 0.6, 1.2, 2.4, 4.8, 5.0) - Add type annotation for channel_ready_timeout parameter in connect_router_stream Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Addressing @raballew's review feedbackPushed commit 661fa73 with the following changes:
|
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Lint fix pushed (commit 6c9f1db)The Fixed by adding |
CI Status Update (2026-05-29)After the lint fix (commit 6c9f1db), the latest CI run completed with two unrelated failures: 1. e2e-tests (ARM64) --
2. linkcheck -- Timeout on
Passing checks: lint-python, type-check, all pytest-matrix jobs (Python 3.11/3.12/3.13 on Ubuntu and macOS), e2e-compat-old-client, e2e-compat-old-controller, build, all image builds. Review status: All feedback from @mangelajo, @bennyz, and @raballew has been addressed with code changes and reactions. The PR is ready for re-review. |
| response = await self.controller.Dial(jumpstarter_pb2.DialRequest(lease_name=self.name)) | ||
| async with connect_router_stream( | ||
| response.router_endpoint, response.router_token, stream, self.tls_config, self.grpc_options | ||
| ): | ||
| pass |
There was a problem hiding this comment.
[HIGH] channel_ready_timeout is not bounded by the remaining dial_timeout deadline. _dial_and_connect calls connect_router_stream without passing channel_ready_timeout, so it defaults to 10s. The retry loop caps backoff sleep by remaining time, but channel.channel_ready() inside connect_router_stream is independent. When only 1-2s remain on dial_timeout, a single _dial_and_connect call can block for up to 10s, causing total wall-clock time to overshoot dial_timeout by up to 10s.
Suggestion: Pass remaining as an upper bound: await self._dial_and_connect(stream, channel_ready_timeout=max(min(10, remaining), 0.5)).
AI-generated, human reviewed
There was a problem hiding this comment.
Good catch. Will pass min(10, remaining) (with a 0.5s floor) as channel_ready_timeout to _dial_and_connect so the channel-ready wait is bounded by the overall deadline.
| # Return instead of raising: handle_async runs inside | ||
| # TemporaryUnixListener.serve's task group, so an | ||
| # unhandled exception would crash the listener. | ||
| return |
There was a problem hiding this comment.
[MEDIUM] Inconsistent raise vs return on timeout exhaustion. When remaining <= 0: the FAILED_PRECONDITION path raises (line 367), but transient gRPC errors return silently here (line 396), and OSError returns silently (line 436). The FAILED_PRECONDITION raise could crash the listener and terminate all sibling connections, while the other paths just exit quietly.
Suggestion: Either change the pre-existing raise to return with a warning log for consistency, or document the intentional distinction near the FAILED_PRECONDITION branch.
AI-generated, human reviewed
There was a problem hiding this comment.
Agreed. Will change the FAILED_PRECONDITION timeout path to return with a warning log for consistency -- since handle_async runs inside TemporaryUnixListener.serve's task group, raising would crash the listener and all sibling connections.
| "Connection failed with %s, retrying in %.1fs (attempt %d, %.1fs remaining): %s", | ||
| e.code().name, | ||
| delay, | ||
| attempt + 1, | ||
| remaining, | ||
| e.details(), |
There was a problem hiding this comment.
[MEDIUM] Three exception handlers repeat the same 4-line backoff pattern: delay = min(base_delay * (2**attempt), max_delay, remaining), log, await sleep(delay), attempt += 1, continue. The handle_async method is around 95 lines with deep nesting. Changing the backoff strategy (e.g., adding jitter) would require coordinated edits in three places.
Suggestion: Extract a helper like _compute_retry_delay(attempt, remaining) -> float to deduplicate the delay calculation while keeping the distinct log messages and timeout behaviors.
AI-generated, human reviewed
There was a problem hiding this comment.
Agreed. Will extract _retry_delay(attempt, remaining) to centralize the backoff computation. The distinct log messages and timeout behaviors will stay in the caller.
| @@ -1,10 +1,13 @@ | |||
| import asyncio | |||
There was a problem hiding this comment.
[MEDIUM] No test covers FAILED_PRECONDITION retry or timeout behavior. The refactoring changed this path materially: before, it retried only Dial(); after, it retries _dial_and_connect() (Dial + router connection). No test exercises FAILED_PRECONDITION "not ready" retry-then-success, and no test verifies that FAILED_PRECONDITION timeout raises.
Suggestion: Add (1) a test that raises FAILED_PRECONDITION "not ready" on first attempt, succeeds on second, and verifies retry happens; and (2) a test with dial_timeout=0.0 that asserts pytest.raises(AioRpcError) on FAILED_PRECONDITION.
AI-generated, human reviewed
There was a problem hiding this comment.
Agreed. Will add both: (1) FAILED_PRECONDITION "not ready" retry-then-success, and (2) FAILED_PRECONDITION with dial_timeout=0.0 that returns gracefully (matching the new consistent behavior).
| @@ -356,10 +416,24 @@ async def handle_async(self, stream): | |||
| else: | |||
| logger.warning("Connection to exporter lost: %s", e.details()) | |||
There was a problem hiding this comment.
[MEDIUM] No test verifies that PERMISSION_DENIED sets lease_transferred = True. When handle_async receives an AioRpcError with "permission denied" in the details, it sets self.lease_transferred = True. The surrounding control flow was restructured (errors now flow through _dial_and_connect and the new is_transient gate), making regression possible.
Suggestion: Add a test that raises AioRpcError(PERMISSION_DENIED, "permission denied") from _dial_and_connect and asserts lease.lease_transferred is True.
AI-generated, human reviewed
There was a problem hiding this comment.
Good point. Will add a test that raises AioRpcError(PERMISSION_DENIED, "permission denied") from _dial_and_connect and asserts lease.lease_transferred is True.
| """Dial the controller and connect to the router stream. | ||
|
|
||
| Performs a single Dial + router connection attempt. Raises on failure | ||
| so the caller can decide whether to retry. | ||
| """ |
There was a problem hiding this comment.
[LOW] The docstring's first line "Dial the controller and connect to the router stream" restates the method name _dial_and_connect. The second sentence ("Raises on failure so the caller can decide whether to retry") provides useful contract info though.
Suggestion: Trim to something like: """Single attempt; raises on failure for caller-driven retry."""
AI-generated, human reviewed
There was a problem hiding this comment.
Fair. Will trim to: """Single attempt; raises on failure for caller-driven retry."""
| with self.portal.wrap_async_context_manager(self) as value: | ||
| yield value | ||
|
|
||
| # gRPC status codes that indicate transient network failures worth retrying. |
There was a problem hiding this comment.
[LOW] The comment "gRPC status codes that indicate transient network failures worth retrying" restates what _TRANSIENT_GRPC_CODES already communicates. The name is self-explanatory.
Suggestion: Remove the comment, or replace with context not captured by the name (e.g., why DEADLINE_EXCEEDED or CANCELLED are excluded).
AI-generated, human reviewed
There was a problem hiding this comment.
Agreed. Will replace with context about why DEADLINE_EXCEEDED and CANCELLED are excluded.
| # Retry on transient network errors (e.g. tunnel to router dropped). | ||
| # Also retry UNKNOWN when the message matches a known transient | ||
| # tunnel teardown (e.g. "watch channel closed"). | ||
| is_transient = e.code() in self._TRANSIENT_GRPC_CODES or ( | ||
| e.code() == grpc.StatusCode.UNKNOWN | ||
| and any(msg in str(e.details()).lower() for msg in self._TRANSIENT_UNKNOWN_MESSAGES) |
There was a problem hiding this comment.
[LOW] Several comments restate what the adjacent code already says: (1) The three-line comment at lines 378-380 paraphrases the is_transient expression below it, when _TRANSIENT_GRPC_CODES and _TRANSIENT_UNKNOWN_MESSAGES already convey the intent. (2) The comment at line 420 "OSError can occur when the router endpoint is unreachable" states standard Python behavior.
Suggestion: Remove the restating portions. In streams.py, keep only the warning about HTTP/2 hang behavior (lines 46-48) as that provides valuable non-obvious context.
AI-generated, human reviewed
There was a problem hiding this comment.
Agreed. Will remove the restating comments at lines 378-380 and 420. Will keep the HTTP/2 hang warning in streams.py.
| # known to occur during tunnel reconnection. | ||
| _TRANSIENT_UNKNOWN_MESSAGES = ("watch channel closed",) | ||
|
|
||
| async def _dial_and_connect(self, stream): |
There was a problem hiding this comment.
[LOW] _dial_and_connect(self, stream) is a newly introduced method with no type annotation on stream (should be anyio.abc.SocketStream) and no return type (-> None). connect_router_stream was also modified but its existing parameters and return type remain unannotated.
Suggestion: Add annotations to _dial_and_connect and the existing parameters of connect_router_stream.
AI-generated, human reviewed
There was a problem hiding this comment.
Will add stream: anyio.abc.SocketStream and -> None to _dial_and_connect, plus channel_ready_timeout annotation on connect_router_stream (already done in the prior commit).
Fix planAddressing all 9 items from @raballew's review: Code changes (lease.py)
Test additions (lease_test.py)
Working on it now. |
- Bound channel_ready_timeout by remaining dial_timeout deadline - Change FAILED_PRECONDITION timeout from raise to return for consistency - Extract _retry_delay() to deduplicate backoff computation - Add type annotations to _dial_and_connect and handle_async - Trim restating docstrings and comments - Add tests for FAILED_PRECONDITION retry/timeout, PERMISSION_DENIED flag, and _retry_delay helper Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Review feedback addressed (commit f906fa2)All 9 items from @raballew's review have been implemented: Code changes
New tests
CI statusThe two pre-existing CI failures (e2e hooks_test.go, linkcheck kubernetes.io timeout) are unrelated to this PR's changes. CI should re-run on the new commit. |
Summary
Lease.handle_async, so that when the router tunnel drops, newjcommands retry connecting instead of failing immediately._dial_and_connect()to perform the Dial and router connection as a single atomic operation, keeping the retry logic in one unified loop (no duplicated Dial calls).channel_ready()timeout inconnect_router_streamso connections to unreachable routers fail fast with UNAVAILABLE instead of hanging indefinitely on the HTTP/2 SETTINGS frame exchange. The timeout is bounded by the remainingdial_timeoutdeadline to prevent overshooting._retry_delay()helper to centralize exponential backoff computation across all three exception handlers.handle_asyncconsistently return (never raise), since the method runs insideTemporaryUnixListener.serve's task group where an unhandled exception would crash the listener and terminate sibling connections.All retries are bounded by the existing
dial_timeout(default 30s).Fixes #638
Test plan
make lint)make pkg-test-jumpstarter,make pkg-test-jumpstarter-cli)lease_transferred_retry_delayhelperjmp shell, kill the router/network, verifyjcommands retry and recover when the network comes backGenerated with Claude Code