Optional cross-worker lock for shared adaptor repo#1416
Conversation
Add `--repo-lock` / `WORKER_REPO_LOCK` to coordinate adaptor installs across multiple workers sharing one repo directory (e.g. an NFS mount or k8s PVC). Uses proper-lockfile per-adaptor plus a sentinel cache, so the cache-hit path stays lock-free. Off by default; requires --repo-dir. Lock retry ceiling (6 min) is set above STALE_MS (5 min) so a dead holder's stale window expires before waiters give up, making cold- start of N pods against an empty repo recoverable rather than fatal.
Drop redundant pre-check in ensureLockTarget (wx already throws EEXIST), parallelise fileExists pairs in handleIsInstalled and the post-lock double-check, collapse trivial if/return, drop unused default export, and fix the worker test harness' mode list comment plus an any-typed logger.
|
ok @stuartc having looked a bit more closely - this implementation MUST move into the engine The engine already has an What is the sentinel stuff all about? |
| const locksDir = path.join(repoDir, '.locks'); | ||
| const sentinelsDir = path.join(repoDir, '.sentinels'); | ||
|
|
||
| const ensureDirs = (async () => { |
There was a problem hiding this comment.
This is weird. Why not remove the wrapper and just await the mkdir calls?
| logger.debug(`acquired install lock for ${specifier}`); | ||
|
|
||
| try { | ||
| const [hasSentinel, hasPkg] = await Promise.all([ |
There was a problem hiding this comment.
As we're going to move this to the engine, the engine already has an isInstalled function, which determines whether an adaptor is installed or not.
Rather than implement that test twice - as we're doing here - I'd like tot ry and re-use that logic if possible.
Short Description
Adds an optional filesystem lock so multiple ws-workers can safely share a single adaptor repo directory (e.g. an NFS mount or k8s PVC), without two workers racing on the same
npm install.Fixes #1414
Implementation Details
When the repo directory is shared between worker pods, two workers can otherwise hit
handleInstallfor the same adaptor at the same time and corrupt each other'snode_modules. This PR adds a new opt-in flag —--repo-lock/WORKER_REPO_LOCK— that wraps the engine's autoinstall handlers with cross-process coordination.How it works:
<repoDir>/.locks/<alias>.lockis acquired viaproper-lockfilebefore any install runs. Different adaptors don't block each other.<repoDir>/.sentinels/<alias>.done.handleIsInstalledrequires BOTH the sentinel ANDnode_modules/<alias>/package.jsonto be present, so a half-finished install left by a crashed worker is correctly re-attempted.handleIsInstalledreturns true) stays lock-free — no syscall fan-out for adaptors that are already installed.installFnthrows, the sentinel isn't written, and the next worker re-runs the install.--repo-lockrequires--repo-dir; if it's set without one we warn and continue without the lock.New dep:
proper-lockfile@4.1.2(withgraceful-fs,retry,signal-exittransitives). All checked for CVEs — clean.QA Notes
The feature is gated behind
WORKER_REPO_LOCK=true. With the flag off, behaviour is identical to before this PR — please confirm that path is unchanged.With the flag on, the interesting cases to exercise:
--repo-dirstarting a run that needs the same adaptor at the same time — only onenpm installshould actually run; the other should see the sentinel and skip.--repo-lock, multi-worker behaviour should be unchanged.There's a multi-process test suite using
child_process.forkthat exercises all of the above deterministically —pnpm test test/util/repo-lock-multiprocess.test.tsfrompackages/ws-worker.K8s deployments using this need NTP/chrony across nodes — stale-lock detection is mtime-based.
AI Usage
You can read more details in our
Responsible AI Policy