Skip to content

fix(git): cap and stream untracked-file line counting#2218

Merged
pauldambra merged 3 commits into
mainfrom
fix/count-file-lines-oom
May 19, 2026
Merged

fix(git): cap and stream untracked-file line counting#2218
pauldambra merged 3 commits into
mainfrom
fix/count-file-lines-oom

Conversation

@pauldambra
Copy link
Copy Markdown
Member

@pauldambra pauldambra commented May 19, 2026

Fixes an OOM in the main process that freezes the app a few seconds after the sidebar paints, on accounts whose selected directory is a monorepo with large untracked files (lockfiles, build artifacts, source maps). countFileLines was reading each file's full contents into V8 memory via fs.readFile, run 16-way concurrent against tens of thousands of paths — a single big file in any of the 16 in-flight reads could push the heap past 4 GB and crash main. Now we stream-count \n bytes with createReadStream, so peak per-stream memory is ~64 KB regardless of file size and the 16-way batch stays bounded around 1 MB total.

Problem

  • countFileLines in packages/git/src/queries.ts was reading each untracked file fully into memory via fs.readFile(filePath, "utf-8"), and the caller (getChangedFilesDetailed) runs it 16-way concurrent against up to 50k untracked paths from streamGitStatus.
  • On a monorepo with multi-MB build artifacts, lockfiles, or other large untracked files, peak heap was 16 × file_bytes × 2 (V8's UTF-16 cost). That blew past Electron's 4 GB main-process heap on real-world repos and OOM'd the main process. The renderer appeared frozen ~2–3 s after the sidebar painted because all its in-flight tRPC calls were waiting on a dead main. Symptom in chromium.log was a silent OOM error in V8: Scavenger: semi-space copy Allocation failed, with no [ipc-rate] warning before it (no IPC storm — just main quietly running out of heap).
  • Regression introduced by feat: virtualized and optimized diff rendering #2113 (c617988), which removed the previous 10k-file cap and switched to concurrent reads when migrating to streamGitStatus. Bisected on a freshly-onboarded data dir pointed at this monorepo.

Fix

  • Stream the file with createReadStream (64 KB highWaterMark) and count \n bytes byte-by-byte. Per-stream memory stays at ~64 KB regardless of file size, so peak across 16 concurrent reads is ~1 MB total — well under any heap pressure threshold.
  • Preserves the original return semantics: trailing-newline-aware count, 0 for empty / non-regular files / errors.
  • No artificial file-size cap — every untracked file gets an accurate line count. (An earlier version of this PR gated at 1 MB, which silently returned linesAdded: 0 for any larger file. That's a correctness regression and the streaming alone is already enough to fix the OOM, so the cap was dropped in the follow-up commit.)

Test plan

  • pnpm --filter @posthog/git test src/queries.test.ts — new getChangedFilesDetailed tests pass (small-file accuracy + correct counting of a multi-MB untracked file that previously OOM'd)
  • pnpm --filter code typecheck
  • pnpm dev against a fresh data dir pointed at this monorepo — previously OOM'd within ~3 s of MainSidebar paint, now stays responsive
  • CI

🤖 Generated with Claude Code

`countFileLines` previously read each untracked file's full content into
memory as UTF-8 via `fs.readFile(filePath, "utf-8")`. The caller in
`getChangedFilesDetailed` runs this 16-way concurrent against every
untracked path returned by `streamGitStatus` (up to 50k entries). On a
monorepo with multi-MB build artifacts or lockfiles, peak heap was
`16 * file_bytes * 2` (V8's UTF-16 cost), easily several GB. That OOM'd
the main process and froze the renderer waiting on a tRPC call that
would never return — the symptom was the app appearing to freeze a
few seconds after the sidebar painted, with no `[ipc-rate]` warnings,
just a silent V8 `Scavenger: semi-space copy Allocation failed`.

Fix: bail out on files larger than 1 MB, and stream the rest with
`createReadStream`, counting `\n` bytes byte-by-byte. Per-stream memory
stays at ~64 KB regardless of file size, so peak across 16 concurrent
reads is ~1 MB total. Preserves the original return semantics
(trailing-newline-aware line count) for files within the cap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 19, 2026

Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
packages/git/src/queries.test.ts:257-268
**Prefer parameterised tests for the two line-count cases**

The team's convention is to always use parameterised tests. The two sub-cases here — file with a trailing newline and file without one — test the same behaviour (accurate line counting) with different inputs and expected values, which is exactly the pattern parameterisation is designed for. A `it.each` table would express the intent more clearly and make it trivial to add further edge-cases (e.g. an empty file, a CRLF file) without repeating the repo-setup boilerplate.

Reviews (1): Last reviewed commit: "fix(git): cap and stream untracked-file ..." | Re-trigger Greptile

Comment thread packages/git/src/queries.test.ts Outdated
Earlier on this PR I gated countFileLines with a 1 MB size cap that
returned `linesAdded: 0` for anything bigger. That was a correctness
regression — a 5 MB untracked lockfile would show 0 lines in the diff
stats instead of its real count.

The streaming line counter is already enough to fix the OOM on its
own: `createReadStream`'s 64 KB highWaterMark means peak per-stream
memory is ~64 KB regardless of file size, so 16-way concurrent reads
top out around 1 MB total whether the files are 5 KB or 5 GB. The
only cost of removing the cap is wall-clock time for genuinely huge
files (a 1 GB untracked file now streams in a few seconds instead of
returning 0), which is the correct trade.

Updated the regression test to assert the real line count for a
~1.6 MB untracked file instead of asserting it gets skipped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Member Author

@pauldambra pauldambra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note

🤖 Automated comment by QA Swarm — not written by a human

Multi-perspective review by qa-team (reliability / performance / data-integrity / compatibility / two generalists), paul-reviewer, and xp-reviewer. All three reviewers converged on the same two MEDIUM findings, posted inline. Six lower-severity findings (Buffer.indexOf perf, dead string-chunk branch, symlink-follow risk, extra fs.stat, lastByte dead branch, edge-case test coverage, magic-number extraction) were noted but are being addressed silently in the next commit.

Verdict: 💬 APPROVE WITH NITS — no blocking findings.

Comment thread packages/git/src/queries.ts Outdated
Comment thread packages/git/src/queries.ts Outdated
Convergent feedback from three reviewers on the streaming line counter:

- thread `AbortSignal` from `getChangedFilesDetailed` through
  `countFileLines` into `createReadStream({ signal })` and check the
  signal between items in `mapWithConcurrency`. A cancelled tRPC call
  no longer streams every untracked file to completion before the
  result is discarded.
- log on `stream.on("error", ...)` via `console.warn` (matches the
  pattern already used in `utils.ts`). Silent zeros in a diff-stats
  field is the exact smell that hid the original OOM bug; now there's
  a breadcrumb when it happens.
- replace the per-byte JS scan with `Buffer.indexOf(0x0a)` — native, an
  order of magnitude faster on multi-MB buffers, which is the exact
  workload this function exists to handle. (The original
  `typeof chunk === "string"` branch was dead at runtime — Node's
  `data` listener types are wider than the actual emit — so it's now
  a focused cast with a comment explaining why.)
- switch `fs.stat` -> `fs.lstat` so an untracked symlink to /dev/zero
  (or any device / dir / fifo through a symlink) short-circuits to 0
  via `isFile()` instead of streaming forever. Symlinks in the
  untracked list are rare; preferring safety over a legitimate-symlink
  line count is the right trade.
- comment the `lastByte === -1` branch so a future cleanup doesn't
  treat it as dead — it's the TOCTOU guard.

Test coverage expanded to a parameterized `it.each` over the line-count
contract (trailing newline, no trailing newline, single byte, lone
newline, consecutive newlines, CRLF), plus an explicit empty-file case.
The multi-MB regression test now uses a named constant so a future
"shrink for speed" PR can't accidentally drop it below the stream's
highWaterMark and silently downgrade the test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@pauldambra pauldambra added the Stamphog This will request an autostamp by stamphog on small changes label May 19, 2026
@pauldambra pauldambra requested a review from a team May 19, 2026 10:48
@jonathanlab jonathanlab added the Create Release This will trigger a new release label May 19, 2026
@pauldambra pauldambra merged commit a017c55 into main May 19, 2026
15 checks passed
@pauldambra pauldambra deleted the fix/count-file-lines-oom branch May 19, 2026 10:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Create Release This will trigger a new release Stamphog This will request an autostamp by stamphog on small changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants