fix: replace O(n²) string concatenation in readLines with array buffering#252
fix: replace O(n²) string concatenation in readLines with array buffering#252
Conversation
…ring
The readLines() function used `buffer += chunk` followed by
`buffer.indexOf('\n')` on every iteration. The indexOf forces V8 to
flatten its internal cons-string representation, defeating the
optimization that makes += amortized. This results in O(n²) total
string copies — for 22 MB of stdout (~1,400 chunks), approximately
15.7 GB of allocations, 1.5 GB+ peak heap, and 10-20s event loop
stalls.
Replace with an array-based pending[] buffer:
- No-newline chunks: O(1) array push, no string copying
- Newline chunks: O(line_length) join, once per complete line
- Worst case (no newlines): single join() at stream end
Fixes #251
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9d91e76a1d
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if (done) { | ||
| if (buffer.length > 0) { | ||
| yield buffer | ||
| if (pending.length > 0) { | ||
| yield pending.join('') | ||
| } |
There was a problem hiding this comment.
Flush TextDecoder state before finishing stream
When done is reached, the generator yields pending.join('') and exits without calling decoder.decode() to flush buffered decoder state. Because this code now uses decoder.decode(value, { stream: true }), a stream that ends with an incomplete UTF-8 sequence will silently drop trailing bytes instead of emitting the replacement character, which is a data-loss regression from the previous behavior. Add a final flush on EOF and append its result before the last yield.
Useful? React with 👍 / 👎.
| yield pending.join('') | ||
| } | ||
| break | ||
| } | ||
|
|
There was a problem hiding this comment.
🔴 The new code never flushes the TextDecoder before breaking on stream end, silently dropping any incomplete multi-byte UTF-8 sequences (e.g. a 4-byte emoji split across the final chunk). Fix: add const trailing = decoder.decode(); if (trailing) pending.push(trailing) before the pending.length check in the done branch.
Extended reasoning...
What the bug is: The new code creates a single TextDecoder with { stream: true } and reuses it across all chunks. When { stream: true } is in effect, the TextDecoder intentionally holds incomplete multi-byte sequences in its internal buffer, emitting them only once the remaining bytes arrive in a later chunk. When the stream ends (done === true), the decoder's internal buffer must be explicitly flushed by calling decoder.decode() (without { stream: true }). The new code never does this.
Code path that triggers it: Lines 34–38 of the new readLines function:
if (done) {
if (pending.length > 0) {
yield pending.join('')
}
break // ← decoder's internal buffer is never flushed
}After the last chunk is processed with decoder.decode(value, { stream: true }), any incomplete multi-byte sequence is held internally. When done=true arrives, the code checks pending.length (which only tracks already-decoded strings) and exits without ever draining the decoder.
Why existing code doesn't prevent it: The pending array only receives strings that have already been returned by decoder.decode(..., { stream: true }). Bytes held inside the TextDecoder's internal state are invisible to pending. The pending.length > 0 guard therefore cannot detect this situation.
Impact: Any UTF-8 content whose final 1–3 bytes of a multi-byte code point arrive in the last stream chunk will be silently dropped. This is a regression from the old code: the old implementation created a new TextDecoder() per chunk without { stream: true }, which would at minimum emit a U+FFFD replacement character for incomplete sequences (visible corruption) rather than silent data loss. Affected scenarios include any stream whose last few bytes happen to form an emoji or other multi-byte character — a common occurrence in code output or emoji-laden text.
Step-by-step proof:
- Stream ends with a chunk containing bytes
[0xF0, 0x9F]— the first two bytes of 🎉 (U+1F389, encoded asF0 9F 8E 89). decoder.decode(Uint8Array([0xF0, 0x9F]), { stream: true })returns''— bytes are held in the decoder's internal buffer.- Since
chunk === '',chunk.indexOf('\n') === -1, sopending.push('')is called (pending now has one empty string). - Next iteration:
done=true.pending.length === 1(truthy), soyield pending.join('')yields''— an empty string is emitted to the caller. breakis reached. The two buffered bytes[0xF0, 0x9F]are silently discarded.
How to fix it:
if (done) {
const trailing = decoder.decode() // flush internal buffer
if (trailing) pending.push(trailing)
if (pending.length > 0) {
yield pending.join('')
}
break
}This ensures any bytes held inside the TextDecoder are emitted before the generator exits. Note also that the current code pushes an empty string '' onto pending when a chunk decodes to '' (bytes fully buffered internally), which causes the done branch to yield a spurious empty string to callers; the flush call also resolves this.
When using decoder.decode(value, { stream: true }), the TextDecoder may
hold incomplete multi-byte UTF-8 bytes internally. Without flushing on
EOF, those trailing bytes were silently dropped — a regression from the
previous behavior which would at minimum emit replacement characters.
Summary
Fixes #251 — replaces quadratic
buffer += chunkstring concatenation inreadLines()with O(n) array-based buffering.Problem
readLines()injs/src/utils.tsusesbuffer += new TextDecoder().decode(value)followed bybuffer.indexOf('\n')on every chunk. TheindexOfcall forces V8 to flatten its internal cons-string representation, defeating the optimization that normally makes+=amortized. For 22 MB of stdout arriving in ~1,400 chunks of ~16 KB:Fix
Replace with a
pending: string[]array buffer:push()— no string copying at alljoin(), called once per complete linejoin()at stream endbuffer +=)(Benchmark numbers from #251)
Test plan
npm run buildpassestemplate not found)