fix: add retry logic for tunnel reconnection in jmp shell proxy by ambient-code[bot] · Pull Request #679 · jumpstarter-dev/jumpstarter

ambient-code · 2026-05-12T17:54:02Z

Summary

Adds retry with exponential backoff for transient gRPC errors (UNAVAILABLE, RESOURCE_EXHAUSTED, ABORTED, INTERNAL) in the Dial + router connection within Lease.handle_async, so that when the router tunnel drops, new j commands retry connecting instead of failing immediately.
Extracts _dial_and_connect() to perform the Dial and router connection as a single atomic operation, keeping the retry logic in one unified loop (no duplicated Dial calls).
Adds a channel_ready() timeout in connect_router_stream so connections to unreachable routers fail fast with UNAVAILABLE instead of hanging indefinitely on the HTTP/2 SETTINGS frame exchange. The timeout is bounded by the remaining dial_timeout deadline to prevent overshooting.
Extracts _retry_delay() helper to centralize exponential backoff computation across all three exception handlers.
All error paths in handle_async consistently return (never raise), since the method runs inside TemporaryUnixListener.serve's task group where an unhandled exception would crash the listener and terminate sibling connections.

All retries are bounded by the existing dial_timeout (default 30s).

Fixes #638

Test plan

Lint passes (make lint)
Unit tests pass (make pkg-test-jumpstarter, make pkg-test-jumpstarter-cli)
Tests for FAILED_PRECONDITION retry-then-success and timeout
Tests for PERMISSION_DENIED setting lease_transferred
Tests for _retry_delay helper
Tests for exponential backoff delay values
Manual testing: start jmp shell, kill the router/network, verify j commands retry and recover when the network comes back

Generated with Claude Code

ambient-code · 2026-05-12T18:37:03Z

CI is failing due to diff coverage being at 8% (threshold is 80%). The new retry logic in handle_async (lease.py) and the channel_ready timeout in connect_router_stream (streams.py) have no test coverage.

Adding tests for:

lease.py: Transient Dial retry (UNAVAILABLE, RESOURCE_EXHAUSTED, ABORTED, INTERNAL), timeout after dial_timeout, router connection retry with transient errors, re-dial failure handling, OSError retry/timeout, and _TRANSIENT_GRPC_CODES validation.
streams.py: channel_ready() timeout raising AioRpcError(UNAVAILABLE), and successful channel ready flow.

This brings diff coverage to 100%.

ambient-code · 2026-05-13T15:32:25Z

Agreed with the review feedback. The duplicated Dial call in the router retry block is messy and needs refactoring.

Fix plan:

Refactor handle_async to use a single unified retry loop that covers both the Dial and router connection phases, eliminating the duplicated Dial code
Extract the retry/backoff logic to reduce nesting and improve readability
Update the tests to match the refactored structure

Working on it now.

ambient-code · 2026-05-13T18:33:50Z

Review Status Update

The refactoring requested by @mangelajo has been completed in commit 534f7b6. The duplicated Dial code has been eliminated by:

Extracting _dial_and_connect() as a single atomic operation for Dial + router connection
Unifying the retry loop to handle both transient errors and "not ready" states in one place

CI Status

All checks are passing except for a flaky e2e test on ARM64:

Test: F2: .py file auto-detects Python and uses driver API (hooks_test.go:302)
Issue: The test expects to find "PYTHON_HOOK: driver API works" in the output, which IS present in the logs, but the assertion is failing (likely an ANSI code or timing issue)
Impact: This appears to be unrelated to the PR changes (retry logic in lease.py and streams.py)
AMD64: The same test passed on AMD64 (the AMD64 run was cancelled, not failed)
Main branch: e2e tests are passing on main

The flaky test is in the hooks execution path, while this PR modifies the tunnel connection/retry logic. These are separate concerns.

Would you like me to:

Re-run the failed e2e-tests job to see if it passes on retry?
Investigate the flaky test in a separate issue/PR?
Proceed with merge if the code changes look good to you?

ambient-code · 2026-05-26T13:32:42Z

Review Update (2026-05-26)

Checking the PR status:

Comments & Reviews

All review comments from @mangelajo have been addressed (reacted to with 👍/👀)
The code refactoring requested in the review has been completed (commit 534f7b6)

CI Status

The only failing check is an e2e test that appears unrelated to this PR:

Test: F2: .py file auto-detects Python and uses driver API
Issue: The test expects "PYTHON_HOOK: driver API works" but the output contains "PYTHON_HOOK: driver API works" (with extra whitespace)
Analysis: This PR only modifies tunnel connection/retry logic in lease.py and streams.py. The failing test is in the hooks execution path (hooks_test.go), which is completely separate from the changes in this PR.
Evidence: The string IS present in the logs, just with different whitespace formatting, suggesting this is a flaky test assertion issue rather than a functional failure.

Both ARM64 and AMD64 e2e test runs failed with the same issue, but all other checks are passing (lint, unit tests, type checks, other e2e compatibility tests).

This appears to be a flaky test unrelated to the PR changes. Would you like me to:

Re-run the e2e tests to see if they pass on retry?
File a separate issue to track the flaky test?
Proceed if the code changes look good to you?

@mangelajo - Since you requested the refactoring changes and they've been completed, please let me know your preference for handling the flaky e2e test.

ambient-code · 2026-05-26T13:33:05Z

I've re-triggered the failed e2e tests to see if they pass on retry.

ambient-code · 2026-05-26T13:33:56Z

The e2e test re-run failed because the build artifacts are no longer available (GitHub Actions artifact limitation when re-running individual jobs).

To properly re-run the e2e tests, we would need to trigger a full workflow re-run from the beginning, which would rebuild all artifacts. However, given that:

The test failure is clearly unrelated to this PR's changes (tunnel retry logic vs hook execution formatting)
The test output shows the expected string IS present, just with different whitespace
All other CI checks are passing

I recommend one of:

Waiting for a maintainer to review and approve the PR despite the flaky test
Push an empty commit to trigger a fresh full workflow run
Merge if @mangelajo is satisfied with the refactoring changes

ambient-code · 2026-05-27T19:31:32Z

CI Update (2026-05-27)

All CI checks are now passing, including the previously flaky e2e tests:

e2e-tests (ubuntu-24.04, amd64): pass
e2e-tests (ubuntu-24.04-arm, arm64): pass
e2e-compat-old-client: pass
e2e-compat-old-controller: pass
All pytest-matrix jobs: pass (Python 3.11, 3.12, 3.13 on ubuntu and macOS)
Linters, type-check, docs: pass

The flaky F2: .py file auto-detects Python and uses driver API test that was failing previously has now passed on both architectures, confirming it was unrelated to this PR's changes.

All review feedback from @mangelajo has been addressed (code refactored in commit 534f7b6 to eliminate duplicated Dial code). This PR is ready for re-review.

raballew · 2026-05-27T20:13:24Z

@ambient-code rebase onto main

When the tunnel between the local proxy and the jumpstarter-router drops during a jmp shell session, subsequent j commands would time out with SETTINGS frame timeout errors because: 1. The Dial call to the controller could fail with transient UNAVAILABLE errors, but there was no retry logic for these (only FAILED_PRECONDITION was retried). 2. The connect_router_stream could hang indefinitely trying to establish an HTTP/2 connection to an unreachable router endpoint, with no timeout on the channel readiness check. This commit fixes both issues: - Adds retry with exponential backoff for transient gRPC errors (UNAVAILABLE, RESOURCE_EXHAUSTED, ABORTED, INTERNAL) in the Dial call within handle_async. - Adds retry with exponential backoff for the router connection establishment, including re-dialing to get fresh router tokens when retrying. - Adds a channel_ready() timeout (10s) in connect_router_stream so that connections to unreachable routers fail fast instead of hanging on the HTTP/2 SETTINGS frame exchange. Fixes #638 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add unit tests covering the new retry logic in handle_async (lease.py) and the channel_ready timeout in connect_router_stream (streams.py) to satisfy the 80% diff coverage requirement. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Extract _dial_and_connect() to perform Dial + router connection as a single atomic operation. This eliminates the duplicated Dial call that was in the separate router retry block, addressing the code review feedback about entangled and repeated code. The single retry loop in handle_async now retries the full _dial_and_connect() on transient errors, which naturally handles both Dial failures and router connection failures with the same backoff logic and always gets a fresh router token on retry. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ambient-code · 2026-05-27T20:33:09Z

Done — rebased onto main (3 commits, no conflicts). CI should kick off shortly on the new head 4b8e7b43.

ambient-code · 2026-05-27T22:32:03Z

CI Status Update (post-rebase)

After rebasing onto main, CI is mostly passing except for one test failure on macOS:

Failed Test: test_infrastructure_messages_at_debug_not_info in hooks_test.py

Analysis:

This PR does NOT modify hooks_test.py or any files in the hooks execution path
The PR only changes: lease.py, lease_test.py, streams.py, streams_test.py
The failing test is checking log message verbosity levels, which is unrelated to tunnel reconnection retry logic
This appears to be a flaky test or an issue introduced in main after the rebase

Passing Checks:

All e2e tests (amd64 and arm64)
e2e compatibility tests
Python 3.13 tests on both Ubuntu and macOS
Linters, type-check, builds

The test failure is unrelated to the PR changes. Would you like me to:

Push an empty commit to re-trigger CI and see if it passes on retry?
Investigate the flaky test separately?
Proceed with merge if the code changes look good?

ambient-code · 2026-05-28T06:33:02Z

CI Update (2026-05-28)

The post-rebase CI run had one failure in test_infrastructure_messages_at_debug_not_info (hooks_test.py, macOS, Python 3.12). This test is not related to this PR -- the PR only modifies lease.py, lease_test.py, streams.py, and streams_test.py. The hooks_test.py file is identical on both main and this branch (same SHA 1b13862).

Another PR (fix/356-standalone-auth-default) ran the same test suite successfully ~2 hours after our failure, confirming this is an intermittent/flaky test.

Pushed an empty commit (cff7bf05) to retrigger CI. All other checks (e2e tests on both architectures, compatibility tests, linters, type-check, Python 3.13) are passing.

bennyz · 2026-05-28T07:02:31Z

+        grpc.StatusCode.UNAVAILABLE,
+        grpc.StatusCode.RESOURCE_EXHAUSTED,
+        grpc.StatusCode.ABORTED,
+        grpc.StatusCode.INTERNAL,


what about StatusCode.UNKNOWN?

Good question! UNKNOWN is a bit of a gray area for retries. In gRPC, UNKNOWN is returned when:

The server raises an exception without an explicit status code (often a bug/unhandled exception)

An RPC receives a status it doesn't understand

The server crashes mid-response

Unlike UNAVAILABLE or RESOURCE_EXHAUSTED which are clearly transient network/load conditions, UNKNOWN can signal a persistent server-side bug that retrying won't fix. Including it risks masking real errors by silently retrying indefinitely until the timeout expires.

That said, there are scenarios where a transient server crash could surface as UNKNOWN, and in those cases a retry would help. I'd lean toward adding it with a comment noting the trade-off, since the retry is bounded by dial_timeout anyway. If you feel it's worth including, I'll add it.

What's your preference?

Specifically with "watch channel closed" error

Good point -- the "watch channel closed" error surfacing as UNKNOWN is clearly a transient tunnel teardown, not a permanent server bug. I'll add StatusCode.UNKNOWN to _TRANSIENT_GRPC_CODES with a comment noting this specific scenario.

no, but we can't do this for any UNKNOWN status

Agreed -- blindly retrying every UNKNOWN error is too broad and could mask real bugs. I'll remove UNKNOWN from the blanket _TRANSIENT_GRPC_CODES set and instead add a targeted check that only retries UNKNOWN when the error details contain the specific "watch channel closed" message. That way we handle the tunnel teardown case you flagged without swallowing unrelated errors.

Include grpc.StatusCode.UNKNOWN in _TRANSIENT_GRPC_CODES since tunnel teardowns (e.g. "watch channel closed") surface as UNKNOWN rather than UNAVAILABLE. The retry is still bounded by dial_timeout so this won't mask persistent server bugs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ambient-code · 2026-05-28T09:33:01Z

Fix plan

Addressing @bennyz's feedback and CI failures:

Remove UNKNOWN from _TRANSIENT_GRPC_CODES -- Instead, add a targeted check that only retries UNKNOWN when the error details contain "watch channel closed" (the specific tunnel teardown scenario). This avoids masking unrelated server-side errors.
Fix lint-python failure -- lease_test.py:636:121 has a line exceeding 120 chars. Will wrap it.

Remove StatusCode.UNKNOWN from the blanket _TRANSIENT_GRPC_CODES set to avoid masking unrelated server-side errors. Instead, add a targeted check via _TRANSIENT_UNKNOWN_MESSAGES that only retries UNKNOWN when the error details contain known transient messages like "watch channel closed". Also fixes lint E501 line-too-long in tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…rage - Add inline comments explaining why transient-error and OSError paths return silently instead of raising (handle_async runs inside TemporaryUnixListener.serve's task group) - Add test_exponential_backoff_delay_values to verify sleep delays follow the expected pattern (0.3, 0.6, 1.2, 2.4, 4.8, 5.0) - Add type annotation for channel_ready_timeout parameter in connect_router_stream Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ambient-code · 2026-05-28T11:34:19Z

Addressing @raballew's review feedback

Pushed commit 661fa73 with the following changes:

Silent return comments (re: comment): Added inline code comments at both the transient-error return (line 393) and the OSError return (line 435) explaining why these paths return silently instead of raising -- handle_async runs inside TemporaryUnixListener.serve's task group, so an unhandled exception would crash the listener.
Exponential backoff delay test (re: comment): Added test_exponential_backoff_delay_values that triggers 6 transient failures, captures the sleep call arguments via mock_sleep.call_args_list, and asserts the delay values follow the expected exponential pattern: [0.3, 0.6, 1.2, 2.4, 4.8, 5.0] (capped at max_delay=5.0).
Type annotation (re: comment): Added channel_ready_timeout: float = 10 type annotation to connect_router_stream in streams.py.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ambient-code · 2026-05-28T12:32:39Z

Lint fix pushed (commit `6c9f1db`)

The lint-python CI job was failing due to ruff rule B905: zip() without an explicit strict= parameter at lease_test.py:906.

Fixed by adding strict=True to the zip(actual_delays, expected_delays) call in test_exponential_backoff_delay_values. CI should re-run shortly.

ambient-code · 2026-05-29T05:33:05Z

CI Status Update (2026-05-29)

After the lint fix (commit 6c9f1db), the latest CI run completed with two unrelated failures:

1. e2e-tests (ARM64) -- A2: afterLease hook executes (hooks_test.go:123)

This PR only modifies lease.py, lease_test.py, streams.py, and streams_test.py. The failing test is in the hooks execution path, which is completely separate.
The AMD64 e2e run was cancelled (not failed), and all e2e-compat tests passed.
Main branch e2e tests were green as of May 28.

2. linkcheck -- Timeout on https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodeselector

External URL timeout, not related to any code changes.

Passing checks: lint-python, type-check, all pytest-matrix jobs (Python 3.11/3.12/3.13 on Ubuntu and macOS), e2e-compat-old-client, e2e-compat-old-controller, build, all image builds.

Review status: All feedback from @mangelajo, @bennyz, and @raballew has been addressed with code changes and reactions. The PR is ready for re-review.

raballew · 2026-05-28T14:59:56Z

+        response = await self.controller.Dial(jumpstarter_pb2.DialRequest(lease_name=self.name))
+        async with connect_router_stream(
+            response.router_endpoint, response.router_token, stream, self.tls_config, self.grpc_options
+        ):
+            pass


[HIGH] channel_ready_timeout is not bounded by the remaining dial_timeout deadline. _dial_and_connect calls connect_router_stream without passing channel_ready_timeout, so it defaults to 10s. The retry loop caps backoff sleep by remaining time, but channel.channel_ready() inside connect_router_stream is independent. When only 1-2s remain on dial_timeout, a single _dial_and_connect call can block for up to 10s, causing total wall-clock time to overshoot dial_timeout by up to 10s.

Suggestion: Pass remaining as an upper bound: await self._dial_and_connect(stream, channel_ready_timeout=max(min(10, remaining), 0.5)).

AI-generated, human reviewed

Good catch. Will pass min(10, remaining) (with a 0.5s floor) as channel_ready_timeout to _dial_and_connect so the channel-ready wait is bounded by the overall deadline.

raballew · 2026-05-28T14:59:56Z

+                        # Return instead of raising: handle_async runs inside
+                        # TemporaryUnixListener.serve's task group, so an
+                        # unhandled exception would crash the listener.
+                        return


[MEDIUM] Inconsistent raise vs return on timeout exhaustion. When remaining <= 0: the FAILED_PRECONDITION path raises (line 367), but transient gRPC errors return silently here (line 396), and OSError returns silently (line 436). The FAILED_PRECONDITION raise could crash the listener and terminate all sibling connections, while the other paths just exit quietly.

Suggestion: Either change the pre-existing raise to return with a warning log for consistency, or document the intentional distinction near the FAILED_PRECONDITION branch.

AI-generated, human reviewed

Agreed. Will change the FAILED_PRECONDITION timeout path to return with a warning log for consistency -- since handle_async runs inside TemporaryUnixListener.serve's task group, raising would crash the listener and all sibling connections.

raballew · 2026-05-28T14:59:56Z

+                        "Connection failed with %s, retrying in %.1fs (attempt %d, %.1fs remaining): %s",
+                        e.code().name,
                        delay,
                        attempt + 1,
                        remaining,
+                        e.details(),


[MEDIUM] Three exception handlers repeat the same 4-line backoff pattern: delay = min(base_delay * (2**attempt), max_delay, remaining), log, await sleep(delay), attempt += 1, continue. The handle_async method is around 95 lines with deep nesting. Changing the backoff strategy (e.g., adding jitter) would require coordinated edits in three places.

Suggestion: Extract a helper like _compute_retry_delay(attempt, remaining) -> float to deduplicate the delay calculation while keeping the distinct log messages and timeout behaviors.

AI-generated, human reviewed

Agreed. Will extract _retry_delay(attempt, remaining) to centralize the backoff computation. The distinct log messages and timeout behaviors will stay in the caller.

raballew · 2026-05-28T14:59:56Z

@@ -1,10 +1,13 @@
 import asyncio


[MEDIUM] No test covers FAILED_PRECONDITION retry or timeout behavior. The refactoring changed this path materially: before, it retried only Dial(); after, it retries _dial_and_connect() (Dial + router connection). No test exercises FAILED_PRECONDITION "not ready" retry-then-success, and no test verifies that FAILED_PRECONDITION timeout raises.

Suggestion: Add (1) a test that raises FAILED_PRECONDITION "not ready" on first attempt, succeeds on second, and verifies retry happens; and (2) a test with dial_timeout=0.0 that asserts pytest.raises(AioRpcError) on FAILED_PRECONDITION.

AI-generated, human reviewed

Agreed. Will add both: (1) FAILED_PRECONDITION "not ready" retry-then-success, and (2) FAILED_PRECONDITION with dial_timeout=0.0 that returns gracefully (matching the new consistent behavior).

raballew · 2026-05-28T14:59:56Z

@@ -356,10 +416,24 @@ async def handle_async(self, stream):
                else:
                    logger.warning("Connection to exporter lost: %s", e.details())


[MEDIUM] No test verifies that PERMISSION_DENIED sets lease_transferred = True. When handle_async receives an AioRpcError with "permission denied" in the details, it sets self.lease_transferred = True. The surrounding control flow was restructured (errors now flow through _dial_and_connect and the new is_transient gate), making regression possible.

Suggestion: Add a test that raises AioRpcError(PERMISSION_DENIED, "permission denied") from _dial_and_connect and asserts lease.lease_transferred is True.

AI-generated, human reviewed

Good point. Will add a test that raises AioRpcError(PERMISSION_DENIED, "permission denied") from _dial_and_connect and asserts lease.lease_transferred is True.

raballew · 2026-05-28T14:59:56Z

+        """Dial the controller and connect to the router stream.
+
+        Performs a single Dial + router connection attempt. Raises on failure
+        so the caller can decide whether to retry.
+        """


[LOW] The docstring's first line "Dial the controller and connect to the router stream" restates the method name _dial_and_connect. The second sentence ("Raises on failure so the caller can decide whether to retry") provides useful contract info though.

Suggestion: Trim to something like: """Single attempt; raises on failure for caller-driven retry."""

AI-generated, human reviewed

Fair. Will trim to: """Single attempt; raises on failure for caller-driven retry."""

raballew · 2026-05-28T14:59:56Z

        with self.portal.wrap_async_context_manager(self) as value:
            yield value

+    # gRPC status codes that indicate transient network failures worth retrying.


[LOW] The comment "gRPC status codes that indicate transient network failures worth retrying" restates what _TRANSIENT_GRPC_CODES already communicates. The name is self-explanatory.

Suggestion: Remove the comment, or replace with context not captured by the name (e.g., why DEADLINE_EXCEEDED or CANCELLED are excluded).

AI-generated, human reviewed

Agreed. Will replace with context about why DEADLINE_EXCEEDED and CANCELLED are excluded.

raballew · 2026-05-28T14:59:56Z

+                # Retry on transient network errors (e.g. tunnel to router dropped).
+                # Also retry UNKNOWN when the message matches a known transient
+                # tunnel teardown (e.g. "watch channel closed").
+                is_transient = e.code() in self._TRANSIENT_GRPC_CODES or (
+                    e.code() == grpc.StatusCode.UNKNOWN
+                    and any(msg in str(e.details()).lower() for msg in self._TRANSIENT_UNKNOWN_MESSAGES)


[LOW] Several comments restate what the adjacent code already says: (1) The three-line comment at lines 378-380 paraphrases the is_transient expression below it, when _TRANSIENT_GRPC_CODES and _TRANSIENT_UNKNOWN_MESSAGES already convey the intent. (2) The comment at line 420 "OSError can occur when the router endpoint is unreachable" states standard Python behavior.

Suggestion: Remove the restating portions. In streams.py, keep only the warning about HTTP/2 hang behavior (lines 46-48) as that provides valuable non-obvious context.

AI-generated, human reviewed

Agreed. Will remove the restating comments at lines 378-380 and 420. Will keep the HTTP/2 hang warning in streams.py.

raballew · 2026-05-28T14:59:56Z

+    # known to occur during tunnel reconnection.
+    _TRANSIENT_UNKNOWN_MESSAGES = ("watch channel closed",)
+
+    async def _dial_and_connect(self, stream):


[LOW] _dial_and_connect(self, stream) is a newly introduced method with no type annotation on stream (should be anyio.abc.SocketStream) and no return type (-> None). connect_router_stream was also modified but its existing parameters and return type remain unannotated.

Suggestion: Add annotations to _dial_and_connect and the existing parameters of connect_router_stream.

AI-generated, human reviewed

Will add stream: anyio.abc.SocketStream and -> None to _dial_and_connect, plus channel_ready_timeout annotation on connect_router_stream (already done in the prior commit).

ambient-code · 2026-05-29T08:34:22Z

Fix plan

Addressing all 9 items from @raballew's review:

Code changes (lease.py)

Bound channel_ready_timeout by remaining deadline -- Pass max(min(10, remaining), 0.5) to _dial_and_connect so channel_ready() cannot overshoot dial_timeout
Consistent return-on-timeout -- Change FAILED_PRECONDITION timeout from raise to return with warning log, matching transient/OSError paths (all run in the task group)
Extract _retry_delay helper -- Deduplicate the 4-line backoff pattern across the three exception handlers
Trim docstring on _dial_and_connect -- """Single attempt; raises on failure for caller-driven retry."""
Replace restating comments -- Remove comment above _TRANSIENT_GRPC_CODES (replace with why DEADLINE_EXCEEDED/CANCELLED excluded), remove is_transient paraphrase, remove OSError restatement
Add type annotations -- _dial_and_connect(self, stream: anyio.abc.SocketStream, ...) -> None

Test additions (lease_test.py)

FAILED_PRECONDITION retry-then-success -- First attempt raises "not ready", second succeeds
FAILED_PRECONDITION timeout returns -- dial_timeout=0.0, verify returns (not raises)
PERMISSION_DENIED sets lease_transferred -- Verify the flag is set to True

Working on it now.

- Bound channel_ready_timeout by remaining dial_timeout deadline - Change FAILED_PRECONDITION timeout from raise to return for consistency - Extract _retry_delay() to deduplicate backoff computation - Add type annotations to _dial_and_connect and handle_async - Trim restating docstrings and comments - Add tests for FAILED_PRECONDITION retry/timeout, PERMISSION_DENIED flag, and _retry_delay helper Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ambient-code · 2026-05-29T08:36:45Z

Review feedback addressed (commit `f906fa2`)

All 9 items from @raballew's review have been implemented:

Code changes

channel_ready_timeout bounded by deadline -- handle_async now computes max(min(10.0, remaining), 0.5) and passes it to _dial_and_connect, which forwards it to connect_router_stream. This prevents the channel-ready wait from overshooting dial_timeout.
Consistent return-on-timeout -- FAILED_PRECONDITION timeout now returns with a warning log instead of raising, matching the transient and OSError paths. Added a top-level comment explaining why all error paths return.
Extracted _retry_delay helper -- Static method _retry_delay(attempt, remaining, base=0.3, cap=5.0) centralizes the backoff computation. Changing the strategy (e.g. adding jitter) now requires a single edit.
Trimmed docstring -- _dial_and_connect docstring simplified to """Single attempt; raises on failure for caller-driven retry."""
Replaced restating comments -- _TRANSIENT_GRPC_CODES comment now explains why DEADLINE_EXCEEDED/CANCELLED are excluded. Removed is_transient paraphrase and OSError restatement.
Added type annotations -- _dial_and_connect(self, stream: SocketStream, channel_ready_timeout: float = 10.0) -> None and handle_async(self, stream: SocketStream) -> None

New tests

test_failed_precondition_not_ready_retries_then_succeeds -- First attempt raises "not ready", second succeeds
test_failed_precondition_returns_after_timeout -- dial_timeout=0.0, verifies it returns (not raises)
test_permission_denied_sets_lease_transferred -- Verifies lease.lease_transferred is True
TestRetryDelay -- Tests for the extracted helper (basic exponential, cap, remaining bound)

CI status

The two pre-existing CI failures (e2e hooks_test.go, linkcheck kubernetes.io timeout) are unrelated to this PR's changes. CI should re-run on the new commit.

ambient-code Bot mentioned this pull request May 12, 2026

jmp shell Unix socket becomes unreachable when router tunnel drops; tunnel should auto-reconnect #638

Open

bennyz self-requested a review May 13, 2026 08:55

mangelajo reviewed May 13, 2026

View reviewed changes

Comment thread python/packages/jumpstarter/jumpstarter/client/lease.py Outdated

mangelajo requested changes May 13, 2026

View reviewed changes

Ambient Code Bot and others added 3 commits May 27, 2026 20:32

ambient-code Bot force-pushed the fix/tunnel-reconnect-638 branch from 534f7b6 to 4b8e7b4 Compare May 27, 2026 20:33

ci: retrigger CI for flaky hooks_test on macOS

cff7bf0

bennyz reviewed May 28, 2026

View reviewed changes

raballew requested changes May 28, 2026

View reviewed changes

Comment thread python/packages/jumpstarter/jumpstarter/client/lease.py

Comment thread python/packages/jumpstarter/jumpstarter/client/lease_test.py

Comment thread python/packages/jumpstarter/jumpstarter/common/streams.py Outdated

fix: add strict=True to zip() call to satisfy ruff B905 lint rule

6c9f1db

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

raballew requested changes May 29, 2026

View reviewed changes

		@@ -356,10 +416,24 @@ async def handle_async(self, stream):
		else:
		logger.warning("Connection to exporter lost: %s", e.details())

Conversation

ambient-code Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

ambient-code Bot commented May 12, 2026

Uh oh!

Uh oh!

ambient-code Bot commented May 13, 2026

Uh oh!

ambient-code Bot commented May 13, 2026

Review Status Update

CI Status

Uh oh!

ambient-code Bot commented May 26, 2026

Review Update (2026-05-26)

Comments & Reviews

CI Status

Uh oh!

ambient-code Bot commented May 26, 2026

Uh oh!

ambient-code Bot commented May 26, 2026

Uh oh!

ambient-code Bot commented May 27, 2026

CI Update (2026-05-27)

Uh oh!

raballew commented May 27, 2026

Uh oh!

ambient-code Bot commented May 27, 2026

Uh oh!

ambient-code Bot commented May 27, 2026

CI Status Update (post-rebase)

Uh oh!

ambient-code Bot commented May 28, 2026

CI Update (2026-05-28)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ambient-code Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ambient-code Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ambient-code Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

ambient-code Bot commented May 28, 2026

Fix plan

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ambient-code Bot commented May 28, 2026

Addressing @raballew's review feedback

Uh oh!

ambient-code Bot commented May 28, 2026

Lint fix pushed (commit 6c9f1db)

Uh oh!

ambient-code Bot commented May 29, 2026

CI Status Update (2026-05-29)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ambient-code Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ambient-code Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ambient-code Bot May 29, 2026

ambient-code Bot commented May 12, 2026 •

edited

Loading

Lint fix pushed (commit `6c9f1db`)

Review feedback addressed (commit `f906fa2`)