Skip to content

fix(index): bound + source-cap parallel retention with re-read fallback (low-RAM peak RSS)#854

Merged
DeusData merged 1 commit into
mainfrom
distill/685-retention-reshape
Jul 4, 2026
Merged

fix(index): bound + source-cap parallel retention with re-read fallback (low-RAM peak RSS)#854
DeusData merged 1 commit into
mainfrom
distill/685-retention-reshape

Conversation

@DeusData

@DeusData DeusData commented Jul 4, 2026

Copy link
Copy Markdown
Owner

fix(index): bound + source-cap parallel retention with re-read fallback (low-RAM peak RSS)

Distilled from #685 (nguyentamdat), rebased onto current main, with two
research-driven refinements and a genuine reproduce-first guard.

The tradeoff: peak RSS vs. graph quality

During a full index the parallel extract RETAINS each file's source text so the
later fused cross-file LSP resolve can re-parse it without re-opening the file.
That retention is TRANSIENT (freed at run end) but it is a PEAK-RSS driver:
every retained byte is resident at once across the extract→resolve handoff.

On main the caps are flat (PP_RETAIN_PER_FILE_MAX_BYTES 100 MiB,
PP_RETAIN_TOTAL_BUDGET_BYTES 2 GiB) and a file over the per-file cap is
silently not retained and its cross-file LSP resolution is then
skipped — a graph-quality gap (the resolver can't re-read it). Lowering the
cap to save RAM would make that gap worse.

This change breaks the tradeoff: bound retention and keep every cross-file
edge, by re-reading dropped files on demand.

What changed (scope: pass_parallel.c + pipeline_internal.h + tests only)

  • Budget-derived + source-text cap (a cap is a FLOOR too). Retention total
    now defaults to min(cbm_mem_budget()/8, 1 GiB) with a modest per-file cap of
    min(32 MiB, total). Following the rust-analyzer memory model, the RAM-derived
    default is clamped to a small absolute ceiling so a huge-RAM host does not
    HOLD tens of GB of source it would re-read cheaply anyway. Both caps are
    env-overridable via CBM_RETAIN_TOTAL_MB / CBM_RETAIN_PER_FILE_MB
    (limits.c strtol convention); the hard ceilings bound only the auto-derived
    default, never a deliberate operator/caller choice. A dropped file emits a
    single index.retain_capped path=… bytes=… WARN per run.
  • Bounded re-read fallback (the correctness guarantee). When the resolver
    needs an unretained file's source, resolve_worker re-reads it from disk
    (bounded by cbm_max_file_bytes, freed immediately) instead of skipping
    resolution. Wired at every cross-LSP site that consumes source
    (Python / C·C++·CUDA / C# / TS·JS / the per-file cbm_pxc_run_one(_ts) path),
    so lowering the cap only trades retained RAM for a bounded re-read — it never
    loses a cross-file edge.
  • cbm_parallel_extract_ex + opts struct (cbm_parallel_extract is now a
    thin wrapper passing NULL → env-derived defaults) and malloc/calloc
    NULL-check hardening (sorted, pkg_entries).

The production pipeline reaches the new caps + fallback through the existing
cbm_parallel_extract call — no change to pipeline.c / pipeline_incremental.c
is required, keeping this PR atomic and disjoint from the memory-workstream
keystone that owns mem.c / mcp.c / main.c.

Reproduce-first (green ⟺ fixed)

RED/GREEN A — graph quality (parallel_cross_file_reread_preserves_unretained_edges).
A Java↔Kotlin pair with genuinely cross-language calls that ONLY the source-
dependent cross-file LSP resolves (JavaCaller.call → KotlinService.ping,
KotlinService.ping → JavaService.pong, strategy lsp). Three scenarios:
CONTROL (retained), NO-RETAIN (retain_sources=false), OVER-CAP (1-byte
per-file cap). All GREEN with the fallback.

Note: #685's original Python Greeter().hello() red test is a false guard
the per-file py_lsp already resolves those calls, so cross_lsp_eligible is
false and the edge survives even without the fallback (verified: lsp edge
count unchanged with the re-read disabled). This distillation replaces it with
the JVM fixture, which is genuinely source-dependent.

Red-first evidence (fallback disabled to simulate main):

=== fallback ENABLED  ===  24 passed
=== fallback DISABLED ===  23 passed, 1 failed
  FAIL tests/test_parallel.c: strstr(java_to_kotlin->properties_json,
       "\"strategy\":\"lsp") is NULL

The CONTROL (retained) scenario passes both ways = non-vacuity; the drop
scenarios lose the lsp cross-file resolution without the re-read = RED.

Guard B — peak bound (parallel_extract_tiny_source_retention_budget,
parallel_extract_without_source_retention, test_mem.c).
Index a fixture whose
total source exceeds a tiny budget; assert retained_bytes <= total_cap (and
retain_sources=false retains nothing) while defs/nodes still extract. Correctness
of the over-cap files is covered by the re-read exercised in RED/GREEN A.

Caps are forced tiny via the opts/env knobs so the over-cap path is deterministic
(no giant fixtures).

Verification

  • make -f Makefile.cbm cbm — clean (-Werror)
  • make -f Makefile.cbm lint-ci — clean (cppcheck + clang-format + NOLINT)
  • CBM_INDEX_SUPERVISOR=0 ./build/c/test-runner parallel pipeline incremental py_lsp ts_lsp java_lsp kotlin_lsp c_lsp cs_lsp go_lsp rust_lsp mem
    2323 passed, 0 failed (ASan/UBSan)

Closes #685-review
Refs #832 (retention layer of the memory workstream)

…ck (low-RAM peak RSS)

Distilled from #685 (nguyentamdat) rebased onto current main, plus two
research-driven refinements and a genuine reproduce-first guard.

The parallel extract retains each file's source text so the fused cross-file
LSP resolve can re-parse it. That retention is transient but a peak-RSS driver.
On main the caps are flat (100 MiB/file, 2 GiB total) and a file over the cap is
silently unretained AND its cross-file resolution is skipped -- a graph-quality
gap. This change bounds retention AND keeps every cross-file edge.

- Source-text cap as a FLOOR, not just a ceiling: retention total defaults to
  min(cbm_mem_budget()/8, 1 GiB), per-file min(32 MiB, total). Following the
  rust-analyzer memory model, the RAM-derived default is clamped to a small
  absolute ceiling so a huge-RAM host does not hold tens of GB it would re-read
  cheaply. Both caps env-overridable via CBM_RETAIN_TOTAL_MB /
  CBM_RETAIN_PER_FILE_MB (limits.c convention); ceilings bound only the
  auto-derived default, never an explicit operator/caller choice. A dropped file
  emits one index.retain_capped WARN per run.
- Bounded re-read fallback (the correctness guarantee): resolve_worker re-reads
  an unretained file's source on demand (bounded, freed immediately) instead of
  skipping resolution, wired at every cross-LSP site that consumes source. The
  cap now only trades retained RAM for a bounded re-read, never a lost edge.
- cbm_parallel_extract_ex + opts struct (cbm_parallel_extract is now a wrapper
  passing NULL -> env-derived defaults); malloc/calloc NULL-check hardening.

Reproduce-first: parallel_cross_file_reread_preserves_unretained_edges uses a
Java<->Kotlin pair whose cross-file lsp edges are genuinely source-dependent;
the edges are lost when the caller is unretained and the fallback is absent
(RED), present with it (GREEN), with a retained CONTROL scenario proving
non-vacuity. #685's original Python red test was a false guard (per-file py_lsp
already resolves those calls) and is replaced. Peak-bound guards
(retained_bytes <= total_cap; retain_sources=false retains nothing) in test_mem.c.

Verify: make -f Makefile.cbm cbm && make -f Makefile.cbm lint-ci; test-runner
parallel pipeline incremental py_lsp ts_lsp java_lsp kotlin_lsp c_lsp cs_lsp
go_lsp rust_lsp mem -> 2323 passed.

Co-authored-by: nguyentamdat <nguyentamdat@gmail.com>
Signed-off-by: Martin Vogel <martin.vogel.tech@gmail.com>
@DeusData DeusData enabled auto-merge July 4, 2026 16:40
@DeusData DeusData merged commit 6404747 into main Jul 4, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant