Skip to content

feat(proxy): shared registry, deterministic ports, and new management commands#261

Merged
stevenolen merged 8 commits into
mainfrom
proxy-registry
Apr 21, 2026
Merged

feat(proxy): shared registry, deterministic ports, and new management commands#261
stevenolen merged 8 commits into
mainfrom
proxy-registry

Conversation

@stevenolen
Copy link
Copy Markdown
Contributor

What

Replaces the single proxy.json with a shared proxies.json registry so multiple proxy sessions can run concurrently — unblocking parallel workon/ensure invocations and multi-session Claude usage.

Changes

Registry (lib/proxy/registry.go — new)

  • proxies.json keyed by target name, serialised under syscall.Flock (exclusive write, shared read) to avoid concurrent-write corruption
  • ListRunningProxies, PruneRegistry, StopAll public API
  • RunningProxy.Store() / DeleteFile() updated to upsert/remove from the registry transparently — no changes needed in aws/ or azure/ proxy code

Deterministic ports (lib/proxy/proxy.go)

  • WorkloadPort(name string) int — FNV-32a hash of the workload name, range 10000–10999
  • ensure, workon, and k9s use WorkloadPort(t.Name()) automatically
  • ptd proxy <target> defaults to 1080 for interactive/browser use; --daemon uses the deterministic port so it is reused by ensure/workon

New CLI surface (cmd/proxy.go)

  • ptd proxy --list — tabular output with live running/stopped status
  • ptd proxy --prune — remove stale registry entries (dead PIDs)
  • ptd proxy --stop (no target) — stop all running proxies
  • ptd proxy port <target> — print the deterministic port for a workload

Testing

  • New lib/proxy/registry_test.go: store/list round-trip, delete, prune dead PIDs, empty registry, upsert overwrite
  • Updated lib/proxy/proxy_test.go: WorkloadPort stability and range, DeleteFile registry semantics
  • Updated cmd/proxy_test.go: new flags and port subcommand registration
  • All pre-commit hooks pass (format, lint, test-cmd, test-lib)

@claude
Copy link
Copy Markdown

claude Bot commented Apr 17, 2026

Claude finished @stevenolen's task —— View job


PR Review: feat(proxy): shared registry, deterministic ports, and new management commands

  • Read review guidelines
  • Analyze PR diff and changed files
  • Post review feedback

Overall this is well-structured — the registry file-locking design is sound, the API surface is clean, and tests cover the key paths. A few items to consider:


Important

1. Truncate-then-write is not crash-safelib/proxy/registry.go:83-98

saveRegistryToHandle truncates the file, then encodes. If Encode fails partway (or the process is killed between truncate and write-complete), the file is left empty/corrupt. The next reader hits the slog.Warn("Registry file could not be parsed, starting fresh") path, silently dropping all entries. Consider encoding to a bytes.Buffer first, then truncating+writing only on success:

var buf bytes.Buffer
enc := json.NewEncoder(&buf)
enc.SetIndent("", "  ")
if err := enc.Encode(m); err != nil {
    return fmt.Errorf("encoding registry: %w", err)
}
if err := f.Truncate(0); err != nil { ... }
if _, err := f.Seek(0, io.SeekStart); err != nil { ... }
if _, err := f.Write(buf.Bytes()); err != nil { ... }

2. Port collision detection removed from Preflightlib/proxy/proxy.go:184-192

The old Preflight checked whether the existing proxy was for the same target before reusing it, and errored if a different target occupied the port. The new version looks up by target name only. If two workloads hash to the same deterministic port (collision), Preflight would not find the first target's entry (wrong name), see the port is open, and return an opaque "port X is unavailable" error. Consider scanning the registry for port conflicts to give a more helpful message.

3. syscall.Flock is not portablelib/proxy/registry.go:21

syscall.Flock is Linux/macOS only. If PTD is Linux/macOS-only, this is fine, but a //go:build !windows directive or comment noting the constraint would help future contributors.

4. --stop is PersistentFlags but new flags are localcmd/proxy.go:20-21

--stop and --daemon are PersistentFlags (inherited by subcommands like port), so ptd proxy port foo --stop is valid but nonsensical. Consider whether these should be local flags now that subcommands exist.

5. Daemon mode silently discards stopProxycmd/proxy.go:116-121

In daemon mode, stopSignal() is called and the function returns, but stopProxy is never called. This means the daemon process is intentionally orphaned. Worth a brief comment explaining that daemon proxies are only cleaned up via ptd proxy <target> --stop — the current code could look like a bug to a future reader.


Suggestions

6. Use filepath.Join in RegistryFilePathcmd/internal/helpers.go:30

String concatenation (DataDir() + "/proxies.json") is used, but the rest of the codebase consistently uses filepath.Join. Minor but worth keeping consistent:

func RegistryFilePath() string {
    return filepath.Join(DataDir(), "proxies.json")
}

7. GetRunningProxy does redundant worklib/proxy/proxy.go:46-61

It calls ListRunningProxies (which deserializes the entire registry and builds a slice) then linear-scans for the target name. Since the registry is a map[string]*RunningProxy, a direct withRegistryReadLock + map lookup would be simpler and avoid the extra allocation.

8. Deterministic port rangelib/proxy/proxy.go:19

1000-port range (10000–10999) gives ~40% collision probability at 30 workloads (birthday paradox). This is probably fine for current usage, but if the range can be widened cheaply (e.g., 10000–19999), it'd provide more headroom. At minimum, this ties back to point #2 — collisions should produce a clear error.


Looks Good

  • Registry locking (exclusive for writes, shared for reads) is the right pattern
  • Corrupt-file recovery in loadRegistryFromHandle is a nice touch
  • Test coverage for registry operations (store, delete, prune, upsert, empty) is thorough
  • WorkloadPort stability and range tests are good
  • The Preflight simplification (target-name-keyed lookup instead of single-file) is a clean improvement

@stevenolen
Copy link
Copy Markdown
Contributor Author

@claude re-review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 17, 2026

Claude finished @stevenolen's task —— View job


Re-review: feat(proxy): shared registry, deterministic ports, and new management commands

  • Read review guidelines
  • Analyze current PR diff and changed files
  • Check which previous review items were addressed
  • Post updated review feedback

Previous review items — status

# Item Status
1 Truncate-then-write crash safety FixedsaveRegistryToHandle now encodes to a bytes.Buffer first (registry.go:88-104)
2 Port collision detection in Preflight FixedPreflight now scans all registry entries for port conflicts and gives a clear "port %s is already in use by workload %q (hash collision)" message (proxy.go:201-212)
3 syscall.Flock portability comment Fixed — comment added at registry.go:3
4 PersistentFlags → local flags Fixeddaemon, stop now use proxyCmd.Flags() (cmd/proxy.go:20-21)
5 Daemon mode comment about stopProxy Fixed — comment added at cmd/proxy.go:117-118, stopSignal() is explicitly called
6 filepath.Join in RegistryFilePath Fixedcmd/internal/helpers.go:30
7 GetRunningProxy redundant work Fixed — now does a direct registry read + map lookup (proxy.go:46-68)
8 Port range widened to 10000–19999 FixedWorkloadPort uses %10000 with base 10000 (proxy.go:19), tests verify [10000, 19999]

All 8 items from the first review have been addressed.


Current state — looks good

The implementation is clean. A few minor observations (none blocking):

  1. StopAll doesn't collect errorsregistry.go:158-161 logs individual failures but the function always returns nil after the loop. If callers ever need to know whether all stops succeeded, this would need adjustment. Fine for now since the only caller (cmd/proxy.go:57) also logs.

  2. Preflight swallows ListRunningProxies errorproxy.go:202 uses allProxies, _ := ListRunningProxies(file). If the registry is unreadable, collision detection silently skips. This is consistent with the existing defensive style (the earlier GetRunningProxy error is also swallowed), but worth noting.

  3. --list output for long target names — the %-30s format in cmd/proxy.go:170 will misalign columns if a target name exceeds 30 chars. A tabwriter would handle this better, but this is cosmetic.

LGTM — nice work addressing all the feedback. The registry locking, buffer-then-write pattern, collision detection, and test coverage are all solid.


@stevenolen stevenolen marked this pull request as ready for review April 17, 2026 18:34
@stevenolen stevenolen requested a review from a team as a code owner April 17, 2026 18:34
@stevenolen
Copy link
Copy Markdown
Contributor Author

Follow-up: kubeconfig construction is duplicated across several files

While fixing the hardcoded port 1080 references, I noticed kubeconfig generation is implemented independently in multiple places:

  • lib/kube/setup.go — for workon / k9s (interactive use)
  • lib/steps/clusters_aws.go — for the clusters step (Pulumi)
  • lib/steps/clusters_azure.go — same, Azure
  • lib/steps/sites.go — for the sites step (Pulumi)

Each step builds its own kubeconfig inline because they run as isolated Pulumi invocations and each needs cluster access independently. This is why the hardcoded-1080 bug appeared in five places instead of one — every kubeconfig-building site had to be patched separately.

Worth consolidating into a single lib/kube function that accepts the target, credentials, and proxy port, and have all callers (workon, k9s, and every step) use it. lib/kube/setup.go is already the right shape for this.

@stevenolen
Copy link
Copy Markdown
Contributor Author

Heads up: one-time Pulumi state migration on first apply after merge

Steps that write proxy-url into kubeconfig state (clusters, sites) were previously storing socks5://localhost:1080. They now use the deterministic workload port, so the first ptd ensure after this merges will show a diff like:

~ proxy-url: "socks5://localhost:1080" => "socks5://localhost:13948"

for every affected cluster/site resource. This is expected — apply it and the state stabilises permanently. WorkloadPort is deterministic so the value will always match what PTD generates on every subsequent run.

This only matters in the Pulumi execution context (i.e. ptd ensure), so the specific port number never surfaces outside of automation.

@stevenolen
Copy link
Copy Markdown
Contributor Author

@claude review again

@claude
Copy link
Copy Markdown

claude Bot commented Apr 17, 2026

Claude finished @stevenolen's task —— View job


Re-review: feat(proxy): shared registry, deterministic ports, and new management commands

  • Read review guidelines
  • Analyze current PR diff (all 5 commits)
  • Check which previous review items were addressed
  • Verify hardcoded port 1080 replacement across codebase
  • Post updated review feedback

Previous review items — all 8 addressed ✓

All items from reviews #1 and #2 are addressed. The two new commits (5a6d351, 8331cf0) correctly thread localPort through SetupKubeConfig and replace all hardcoded 1080 references in Go step files.


New findings

1. Python Pulumi code still has hardcoded socks5://localhost:1080 (important)

The Go side is fully migrated, but three Python files still use the old hardcoded port:

  • python-pulumi/src/ptd/aws_workload.py:802"socks5://localhost:1080"
  • python-pulumi/src/ptd/azure_workload.py:253"socks5://localhost:1080"
  • python-pulumi/src/ptd/pulumi_resources/aws_eks_cluster.py:2712"socks5://localhost:1080"

These are used during helm and eks Pulumi steps. Since Go sets ALL_PROXY in the environment when launching Python steps, and the Python kubeconfig proxy-url also needs to match, these should use the deterministic port too. The Go→Python integration passes env vars (per lib/pulumi/python.go), so the Python code could read ALL_PROXY from the environment or accept the port via a new config/env var.

If these Python files are only called during steps where ALL_PROXY is already set by the Go caller, then the kubeconfig proxy-url and ALL_PROXY will disagree (1080 vs the deterministic port). This could cause issues depending on which mechanism the k8s client actually uses to route traffic.

Worth fixing in this PR or as a tracked follow-up — this is the same class of bug the last two commits addressed on the Go side.

2. lib/aws/proxy.go:31 and lib/azure/proxy.go:32 — dead default fallback

Both NewProxySession functions still default to "1080" when localPort == "". After this PR, all callers always pass an explicit port, so this fallback is dead code. Not harmful, but it could mislead future readers into thinking 1080 is still the expected port. Consider removing the fallback or updating the default to match the new behavior.


Looks good

  • SetupKubeConfig now correctly accepts and threads localPort through both AWS and Azure paths
  • All 7 hardcoded 1080 references in Go step files (clusters.go, clusters_aws.go, clusters_azure.go, postgres_config.go, sites.go x2, setup.go x2) are replaced with proxy.WorkloadPort(s.DstTarget.Name())
  • ensure.go, workon.go, and k9s.go all compute the port via proxy.WorkloadPort and pass it consistently
  • The defer stopSignal() move in cmd/proxy.go:113 is correct — ensures cleanup happens regardless of error path

LGTM on the Go side. The Python hardcoded ports (#1 above) are the remaining gap.


@stevenolen
Copy link
Copy Markdown
Contributor Author

@claude review again

@claude
Copy link
Copy Markdown

claude Bot commented Apr 17, 2026

Claude finished @stevenolen's task —— View job


Re-review: feat(proxy): shared registry, deterministic ports, and new management commands

  • Read review guidelines
  • Analyze current PR diff (all 6 commits)
  • Verify Python hardcoded port 1080 replacement
  • Verify dead Go fallback removal
  • Check ALL_PROXY propagation to all Python steps
  • Post updated review feedback

Previous review items — all addressed

All items from reviews #1#3 are addressed:

  • Python files now read os.environ.get("ALL_PROXY", "socks5://localhost:1080") instead of hardcoding
  • Dead Go fallbacks in lib/aws/proxy.go and lib/azure/proxy.go removed
  • import os already present in all three Python files

New finding

helm step does not pass ALL_PROXY to Python subprocess (important)

The clusters and postgres_config steps correctly set ALL_PROXY in their envVars map:

// clusters.go:74
envVars["ALL_PROXY"] = fmt.Sprintf("socks5://localhost:%d", proxy.WorkloadPort(s.DstTarget.Name()))

But helm.go:41 calls prepareEnvVarsForPulumi() (steps.go:178-184) which only copies credential env vars — it does not set ALL_PROXY. The Python helm code calls cluster_kubeconfig() which reads os.environ.get("ALL_PROXY", ...), but since ALL_PROXY is never passed, it falls through to the "socks5://localhost:1080" default.

Since ptd ensure now runs the proxy on the deterministic port (e.g., 13948), the helm step's kubeconfig will point at :1080 while the proxy is on :13948. This will break helm deployments.

Fix: add ALL_PROXY to helm.go the same way clusters.go and postgres_config.go do:

// helm.go, after line 44
if !s.DstTarget.TailscaleEnabled() {
    envVars["ALL_PROXY"] = fmt.Sprintf("socks5://localhost:%d", proxy.WorkloadPort(s.DstTarget.Name()))
}

Alternatively, set it in prepareEnvVarsForPulumi so all Python steps get it automatically, but that would require passing the target to check TailscaleEnabled(), which is already available.


Minor observations (non-blocking)

  1. Docs still reference localhost:1080docs/cli/PTD_CLI_REFERENCE.md and docs/cli/ensure-flow.md have several references to the old fixed port. Not code-breaking, but will confuse readers post-merge.

  2. CLAUDE.md proxy docs say "Interactive use — binds to port 1080" — this is correct for the default interactive mode, just noting it's consistent.


Everything else looks good

  • Python os.environ.get("ALL_PROXY", ...) pattern is clean and backward-compatible
  • Dead Go fallback removal is correct — all callers now pass explicit ports
  • clusters.go and postgres_config.go correctly propagate ALL_PROXY
  • sites.go builds kubeconfig inline in Go (no Python), uses proxy.WorkloadPort directly — correct
  • clusters_aws.go and clusters_azure.go build kubeconfig inline in Go — correct

The helm step gap is the only remaining issue.


@stevenolen
Copy link
Copy Markdown
Contributor Author

gap is the only remaining issue.

not the first time that's been said on this pr 😆

…ment commands

Replace single proxy.json with a shared proxies.json registry (file-locked via
syscall.Flock) so multiple proxies can run concurrently. Assign each workload a
deterministic SOCKS5 port derived from FNV-32a (range 10000–10999); ensure,
workon, and k9s use this port automatically, while `ptd proxy` defaults to 1080
for interactive use and the deterministic port for --daemon mode.

New: ptd proxy --list (with live status), --prune, --stop (no target stops all),
and ptd proxy port <target> to print the deterministic port.
- saveRegistryToHandle: encode to buffer before truncating (crash-safe write)
- Preflight: detect hash collisions by scanning registry for same-port conflicts
- registry.go: add platform comment (syscall.Flock is Linux/macOS only)
- --daemon/--stop: demote from PersistentFlags to Flags (don't bleed to subcommands)
- daemon branch: comment explaining intentional process orphan
- RegistryFilePath: use filepath.Join for consistency
- GetRunningProxy: direct map lookup instead of ListRunningProxies + linear scan
- WorkloadPort: widen range to 10000–19999 (10000 slots vs 1000)
…vives ptd exit

exec.CommandContext kills its child when the context is cancelled. In daemon
mode ptd exits after starting the proxy, which cancelled the context and
immediately killed the subprocess. Use context.Background() for daemon starts
so the proxy process is decoupled from ptd's lifetime.
…ses the correct SOCKS5 port

The kubeconfig proxy-url was hardcoded to socks5://localhost:1080. With
deterministic per-workload ports, SetupKubeConfig now receives the actual
local port and injects it correctly.
ensure steps (clusters, postgres_config, sites, clusters_aws, clusters_azure)
were injecting socks5://localhost:1080 into ALL_PROXY env vars and kubeconfig
proxy-url fields. These now use the deterministic workload port so they work
correctly when multiple proxies are running concurrently.
…ove dead Go fallbacks

Python aws_workload, azure_workload, and aws_eks_cluster now read ALL_PROXY from
the environment (with 1080 fallback) for kubeconfig proxy-url, matching the port
already injected by Go step callers. Dead localPort="1080" fallbacks removed from
aws and azure NewProxySession constructors.
PruneRegistry was removing stale entries without calling KillProcess,
leaving orphaned processes running for Azure dual-PID sessions where one
process had died but the other was still alive.
…ehavior

The helm step was not passing ALL_PROXY to the Python subprocess, causing
cluster_kubeconfig() to fall through to the socks5://localhost:1080 default
while the proxy runs on the deterministic workload port.

Also updates PTD_CLI_REFERENCE.md and ensure-flow.md to reflect the new
port scheme (interactive → 1080, daemon/ensure/workon → 10000–19999) and
the renamed proxies.json registry file.
@stevenolen stevenolen added this pull request to the merge queue Apr 21, 2026
Merged via the queue into main with commit 4303c2d Apr 21, 2026
8 checks passed
@stevenolen stevenolen deleted the proxy-registry branch April 21, 2026 20:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants