Skip to content

fix: incremental append-only parse for events.jsonl cache (#732)#736

Open
microsasa wants to merge 9 commits intomainfrom
fix/732-incremental-events-parse-5a80d2bac754a357
Open

fix: incremental append-only parse for events.jsonl cache (#732)#736
microsasa wants to merge 9 commits intomainfrom
fix/732-incremental-events-parse-5a80d2bac754a357

Conversation

@microsasa
Copy link
Copy Markdown
Owner

Closes #732

Problem

get_cached_events discarded the entire cache and re-parsed every line from the beginning whenever (st_mtime_ns, st_size) changed — even though events.jsonl files are append-only. For a 10,000-event session with 5 new events per refresh, this meant O(10,000) parse cost on every change.

Solution

  • Added end_offset: int to _CachedEvents — stores the byte position after the last parsed event
  • Added _parse_events_from_offset() — seeks to end_offset and parses only newly appended lines
  • Updated get_cached_events() with three paths:
    • Unchanged file (file_id match): cache hit, O(1)
    • Append-only growth (new_size ≥ cached.end_offset): incremental parse, O(new events only)
    • Cold start / truncation: full reparse via parse_events(), O(N)
Scenario Before After
10,000-event session, 5 new events O(10,000) O(5)
Cold start O(N) O(N)
Unchanged file O(1) O(1)

Tests

Added TestIncrementalEventsParsing with 7 tests:

  • Patches SessionEvent.model_validate with a counter to confirm only new events are validated (5,000-line fixture + 10 appends)
  • Verifies all events (original + appended) are returned
  • Truncated file triggers full reparse
  • Incremental path bypasses parse_events entirely
  • end_offset is correctly stored and updated
  • Cold start always calls parse_events

All 1141 existing tests pass. Coverage: 98.60%.

Generated by Issue Implementer · ● 13.6M ·

Extend _CachedEvents to store the byte offset (end_offset) at which
the cached events end.  When the file has grown (append-only pattern),
get_cached_events now seeks to end_offset and parses only the newly
appended lines via _parse_events_from_offset, avoiding a full re-read.

Falls back to a full reparse on cold start or file truncation.

For a 10,000-event session with 5 new events per refresh, parsing cost
drops from O(10,000) to O(5).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@microsasa microsasa added the aw Created by agentic workflow label Apr 4, 2026
Copilot AI review requested due to automatic review settings April 4, 2026 19:37
@microsasa microsasa enabled auto-merge April 4, 2026 19:37
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves performance of src/copilot_usage/parser.py:get_cached_events() by making the parsed events.jsonl cache incrementally update on append-only file growth, instead of fully re-parsing the file on every (st_mtime_ns, st_size) change.

Changes:

  • Add end_offset to _CachedEvents and plumb it through _insert_events_entry() / cache population.
  • Introduce _parse_events_from_offset() and update get_cached_events() to incrementally parse newly appended bytes when possible.
  • Add a dedicated test suite (TestIncrementalEventsParsing) validating the incremental path, end_offset behavior, and truncation fallback.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
src/copilot_usage/parser.py Adds end_offset-based incremental parsing path to avoid full events.jsonl re-reads on append-only growth.
tests/copilot_usage/test_parser.py Adds tests asserting only new appended events are validated/parsed and that end_offset is stored/updated correctly.

@microsasa microsasa added the aw-ci-fix-attempted CI fixer has attempted to fix CI label Apr 4, 2026
Cover the four untested error-handling branches in
_parse_events_from_offset() that caused diff-cover to fail at 78.9%
(< 90% threshold):

- blank/whitespace-only lines skipped (line 163)
- malformed JSON skipped with warning (lines 166-167, 172)
- Pydantic ValidationError skipped with warning (lines 175-176)
- UnicodeDecodeError returns partial results (lines 182-183)

Diff coverage is now 100%.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 4, 2026

Commit pushed: 1993da7

Generated by CI Fixer

@microsasa
Copy link
Copy Markdown
Owner Author

CI fix pushed — the check job was failing because diff-cover reported 78.9% coverage on changed lines (below the 90% threshold).

Root cause: The 4 error-handling branches in _parse_events_from_offset() had no test coverage:

  • Blank line skip (line 163)
  • Malformed JSON skip (lines 166–172)
  • Pydantic ValidationError skip (lines 175–176)
  • UnicodeDecodeError partial return (lines 182–183)

Fix: Added TestParseEventsFromOffset with 4 targeted tests. Diff coverage is now 100%.

Generated by CI Fixer · ● 10.5M ·

@microsasa
Copy link
Copy Markdown
Owner Author

CI fix already attempted once — stopping to prevent loops. Manual intervention needed.

Generated by CI Fixer · ● 431.1K ·

- _parse_events_from_offset now returns (events, safe_end) tuple and
  tracks per-line byte positions for accurate log messages
- Incomplete lines (no trailing newline) with JSONDecodeError stop
  incremental parsing instead of advancing end_offset, preventing
  permanently skipped events during mid-write reads
- get_cached_events cold-start path re-stats after parse_events() to
  capture consistent end_offset, avoiding duplicate events if the file
  grew during parsing
- get_all_sessions cache population re-stats after parsing instead of
  reusing discovery-time file size, fixing the same duplication risk
- Added test_incomplete_line_stops_parsing and
  test_incomplete_event_retried_on_next_refresh

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 4, 2026

Commit pushed: f73a800

Generated by Review Responder

Copilot AI review requested due to automatic review settings April 4, 2026 19:59
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

@microsasa microsasa added the aw-review-response-attempted Responder has attempted to address review comments label Apr 4, 2026
Use _parse_events_from_offset(path, 0) instead of parse_events() + post-parse
stat() in the full-reparse paths of get_cached_events() and get_all_sessions().
This ensures end_offset reflects a safe byte boundary actually consumed by the
parser, rather than a file size from stat() that could overstate what was parsed
if the file grew between parsing and the stat call.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 4, 2026

Commit pushed: 250f21a

Generated by Review Responder

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

@microsasa microsasa added aw-review-response-1 Review response round 1 and removed aw-review-response-attempted Responder has attempted to address review comments labels Apr 4, 2026
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 4, 2026

Commit pushed: e0993c3

Generated by Review Responder

@microsasa microsasa added the aw-review-response-attempted Responder has attempted to address review comments label Apr 4, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 4, 2026

Commit pushed: cc7f96b

Generated by Review Responder

Apply file_id/safe_end consistency fix to the incremental parse path,
matching the cold-start path pattern. When the pre-parse stat size
diverges from safe_end (e.g. incomplete trailing line), clamp the
cached file_id size to safe_end so the next call takes the incremental
path instead of returning a cache hit that skips unconsumed bytes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 4, 2026 20:34
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 4, 2026

Commit pushed: 279ff5b

Generated by Review Responder

@microsasa microsasa added the aw-review-response-attempted Responder has attempted to address review comments label Apr 4, 2026
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 4, 2026

Commit pushed: dc3d64a

Generated by Review Responder

Copilot AI review requested due to automatic review settings April 4, 2026 20:55
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

@microsasa microsasa added aw-review-response-3 Review response round 3 aw-review-response-attempted Responder has attempted to address review comments and removed aw-review-response-attempted Responder has attempted to address review comments labels Apr 4, 2026
- test_truncated_file_triggers_full_reparse: assert offset == 0
- test_incremental_does_not_call_parse_events: spy on _parse_events_from_offset
  and verify non-zero offset instead of vacuous parse_events check
- test_cold_start_full_reparse: assert_called_once_with(p, 0)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 4, 2026

Commit pushed: 8024366

Generated by Review Responder

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

Comment on lines +247 to +257
if cached is not None and file_id is not None:
new_size = file_id[1]
if cached.file_id == file_id:
_EVENTS_CACHE.move_to_end(events_path)
return cached.events
# Append-only growth: new size ≥ cached end_offset → incremental
if new_size >= cached.end_offset:
new_events, safe_end = _parse_events_from_offset(
events_path, cached.end_offset
)
merged = cached.events + tuple(new_events)
Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_cached_events() takes the incremental path whenever new_size >= cached.end_offset. If the file identity changed but the size did not grow (e.g., file was rewritten in-place to the same size, or mtime was touched), this will call _parse_events_from_offset(..., end_offset) (which returns no events) and then refresh the cache with the new file_id while keeping the old events — returning stale data even though the identity changed. Consider only using the incremental path when the file has strictly grown beyond end_offset (append-only), and otherwise falling back to a full reparse (or another safe strategy).

Copilot uses AI. Check for mistakes.
Comment on lines +947 to 963
# Reuse _EVENTS_CACHE for incremental append-only parsing so
# that the session-summary rebuild only validates newly appended
# events instead of re-parsing the entire file from offset 0.
events_cached = _EVENTS_CACHE.get(events_path)
start_offset = 0
prior_events: tuple[SessionEvent, ...] = ()
if (
events_cached is not None
and file_id is not None
and file_id[1] >= events_cached.end_offset
and events_cached.end_offset > 0
):
start_offset = events_cached.end_offset
prior_events = events_cached.events
try:
events = parse_events(events_path)
new_events, safe_end = _parse_events_from_offset(events_path, start_offset)
except OSError as exc:
Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In get_all_sessions(), the decision to reuse cached events and parse only from events_cached.end_offset is based on file_id[1] >= events_cached.end_offset. If the file’s identity changed but its size did not grow (or the file was rewritten in-place to the same size), this logic can prepend prior_events from the old file and parse zero new bytes, producing a summary from stale events. To keep correctness, gate the incremental reuse on a strict append-only growth condition (size increased beyond the cached safe boundary), otherwise parse from offset 0.

Copilot uses AI. Check for mistakes.
Comment on lines +207 to +214
except UnicodeDecodeError as exc:
logger.warning(
"{} — UTF-8 decode error at offset {}; returning {} new events (partial): {}",
events_path,
safe_offset,
len(new_events),
exc,
)
Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The UnicodeDecodeError warning says it occurred "at offset {safe_offset}", but safe_offset is the last fully consumed byte boundary, not necessarily where the decode error actually happened (it will usually be earlier than the failing line). Consider logging the failing line start/current offset (or exc.start relative to the current buffer) to make the warning location accurate.

Copilot uses AI. Check for mistakes.
@microsasa
Copy link
Copy Markdown
Owner Author

❌ Pipeline orchestrator: review-response loop reached 3 rounds. Marking as stuck for human review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

aw Created by agentic workflow aw-ci-fix-attempted CI fixer has attempted to fix CI aw-pr-stuck:review aw-review-response-1 Review response round 1 aw-review-response-2 Review response round 2 aw-review-response-3 Review response round 3 aw-review-response-attempted Responder has attempted to address review comments

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[aw][perf] parser: full re-parse of events.jsonl on every file change instead of incremental append-only read

2 participants