Skip to content

fix: replace O(n²) string concatenation in readLines with array buffering#252

Open
beran-t wants to merge 2 commits intomainfrom
fix-readlines-quadratic-concat
Open

fix: replace O(n²) string concatenation in readLines with array buffering#252
beran-t wants to merge 2 commits intomainfrom
fix-readlines-quadratic-concat

Conversation

@beran-t
Copy link
Copy Markdown

@beran-t beran-t commented Apr 14, 2026

Summary

Fixes #251 — replaces quadratic buffer += chunk string concatenation in readLines() with O(n) array-based buffering.

Problem

readLines() in js/src/utils.ts uses buffer += new TextDecoder().decode(value) followed by buffer.indexOf('\n') on every chunk. The indexOf call forces V8 to flatten its internal cons-string representation, defeating the optimization that normally makes += amortized. For 22 MB of stdout arriving in ~1,400 chunks of ~16 KB:

  • Total bytes copied: ~15.7 GB of string allocations
  • Peak heap: 1.5 GB+
  • Event loop stalls: 10-20 seconds

Fix

Replace with a pending: string[] array buffer:

  • No-newline chunks: O(1) push() — no string copying at all
  • Newline chunks: O(line_length) join(), called once per complete line
  • Worst case (no newlines): single join() at stream end
Metric Before (buffer +=) After (array + join)
Peak heap 211 MB 20 MB
Peak RSS 329 MB 95 MB
Elapsed time 29.8s 4.1s
Memory amplification 9x 0.9x

(Benchmark numbers from #251)

Test plan

  • npm run build passes
  • Existing tests pass (requires code-interpreter template — not available in test account, all failures are template not found)

…ring

The readLines() function used `buffer += chunk` followed by
`buffer.indexOf('\n')` on every iteration. The indexOf forces V8 to
flatten its internal cons-string representation, defeating the
optimization that makes += amortized. This results in O(n²) total
string copies — for 22 MB of stdout (~1,400 chunks), approximately
15.7 GB of allocations, 1.5 GB+ peak heap, and 10-20s event loop
stalls.

Replace with an array-based pending[] buffer:
- No-newline chunks: O(1) array push, no string copying
- Newline chunks: O(line_length) join, once per complete line
- Worst case (no newlines): single join() at stream end

Fixes #251
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9d91e76a1d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 32 to 35
if (done) {
if (buffer.length > 0) {
yield buffer
if (pending.length > 0) {
yield pending.join('')
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Flush TextDecoder state before finishing stream

When done is reached, the generator yields pending.join('') and exits without calling decoder.decode() to flush buffered decoder state. Because this code now uses decoder.decode(value, { stream: true }), a stream that ends with an incomplete UTF-8 sequence will silently drop trailing bytes instead of emitting the replacement character, which is a data-loss regression from the previous behavior. Add a final flush on EOF and append its result before the last yield.

Useful? React with 👍 / 👎.

Comment on lines +34 to 38
yield pending.join('')
}
break
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new code never flushes the TextDecoder before breaking on stream end, silently dropping any incomplete multi-byte UTF-8 sequences (e.g. a 4-byte emoji split across the final chunk). Fix: add const trailing = decoder.decode(); if (trailing) pending.push(trailing) before the pending.length check in the done branch.

Extended reasoning...

What the bug is: The new code creates a single TextDecoder with { stream: true } and reuses it across all chunks. When { stream: true } is in effect, the TextDecoder intentionally holds incomplete multi-byte sequences in its internal buffer, emitting them only once the remaining bytes arrive in a later chunk. When the stream ends (done === true), the decoder's internal buffer must be explicitly flushed by calling decoder.decode() (without { stream: true }). The new code never does this.

Code path that triggers it: Lines 34–38 of the new readLines function:

if (done) {
  if (pending.length > 0) {
    yield pending.join('')
  }
  break  // ← decoder's internal buffer is never flushed
}

After the last chunk is processed with decoder.decode(value, { stream: true }), any incomplete multi-byte sequence is held internally. When done=true arrives, the code checks pending.length (which only tracks already-decoded strings) and exits without ever draining the decoder.

Why existing code doesn't prevent it: The pending array only receives strings that have already been returned by decoder.decode(..., { stream: true }). Bytes held inside the TextDecoder's internal state are invisible to pending. The pending.length > 0 guard therefore cannot detect this situation.

Impact: Any UTF-8 content whose final 1–3 bytes of a multi-byte code point arrive in the last stream chunk will be silently dropped. This is a regression from the old code: the old implementation created a new TextDecoder() per chunk without { stream: true }, which would at minimum emit a U+FFFD replacement character for incomplete sequences (visible corruption) rather than silent data loss. Affected scenarios include any stream whose last few bytes happen to form an emoji or other multi-byte character — a common occurrence in code output or emoji-laden text.

Step-by-step proof:

  1. Stream ends with a chunk containing bytes [0xF0, 0x9F] — the first two bytes of 🎉 (U+1F389, encoded as F0 9F 8E 89).
  2. decoder.decode(Uint8Array([0xF0, 0x9F]), { stream: true }) returns '' — bytes are held in the decoder's internal buffer.
  3. Since chunk === '', chunk.indexOf('\n') === -1, so pending.push('') is called (pending now has one empty string).
  4. Next iteration: done=true. pending.length === 1 (truthy), so yield pending.join('') yields '' — an empty string is emitted to the caller.
  5. break is reached. The two buffered bytes [0xF0, 0x9F] are silently discarded.

How to fix it:

if (done) {
  const trailing = decoder.decode()  // flush internal buffer
  if (trailing) pending.push(trailing)
  if (pending.length > 0) {
    yield pending.join('')
  }
  break
}

This ensures any bytes held inside the TextDecoder are emitted before the generator exits. Note also that the current code pushes an empty string '' onto pending when a chunk decodes to '' (bytes fully buffered internally), which causes the done branch to yield a spurious empty string to callers; the flush call also resolves this.

When using decoder.decode(value, { stream: true }), the TextDecoder may
hold incomplete multi-byte UTF-8 bytes internally. Without flushing on
EOF, those trailing bytes were silently dropped — a regression from the
previous behavior which would at minimum emit replacement characters.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

O(n²) string concatenation in readLines causes OOM on large stdout (>1 MB)

1 participant