Skip to content

Enrich process telemetry with binary memory metrics and fix stack_id in OTEL resource#3621

Merged
alco merged 22 commits into
mainfrom
alco/stack-telemetry-stack-id
Apr 14, 2026
Merged

Enrich process telemetry with binary memory metrics and fix stack_id in OTEL resource#3621
alco merged 22 commits into
mainfrom
alco/stack-telemetry-stack-id

Conversation

@alco
Copy link
Copy Markdown
Member

@alco alco commented Dec 17, 2025

Summary

  • New binary memory metrics: Add separate process.bin_memory.* metrics that track off-heap (refc) binary memory per process group, including total bytes, max/avg binary count, and max/avg reference count. This gives visibility into binary memory pressure that was previously invisible — process heap memory alone doesn't show refc binary accumulation.

  • Fix stack_id in OTEL resource: Stack telemetry was missing stack_id in its OTEL resource attributes, making it impossible to distinguish stacks in the metrics backend. Now merged into the resource so all stack metrics carry the identifier.

  • Configurable stack telemetry init delay: New ELECTRIC_STACK_TELEMETRY_INIT_DELAY env var replaces the hardcoded 30s delay, enabling faster metric emission in tests and tunable startup behavior in production.

  • Richer Processes module: Generalized top_memory_by_type into top_by(sort_key) supporting sorting by either process memory or binary memory. Added {:at_least_bytes, n} limit variant. Each process group now returns proc_mem, binary_mem, max_bin_count, avg_bin_count, max_ref_count, and avg_ref_count.

  • Expanded integration test: Replaced stack-telemetry.lux with otel-export.lux that verifies both application-level metrics (including the new process.bin_memory.* family) and stack-level metrics (including stack_id in the resource and LSN/storage metrics).

  • CI: Cache lux to avoid re-fetching it on every integration test run.

Test plan

  • electric-telemetry unit tests pass (including new tests for top_bin_memory_by_type and {:at_least_bytes, n})
  • sync-service compiles cleanly
  • otel-export.lux integration test via CI

@codecov
Copy link
Copy Markdown

codecov Bot commented Dec 17, 2025

Codecov Report

❌ Patch coverage is 72.00000% with 14 lines in your changes missing coverage. Please review.
✅ Project coverage is 67.86%. Comparing base (0bbb377) to head (5b9694e).
⚠️ Report is 2 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...ry/lib/electric/telemetry/application_telemetry.ex 0.00% 12 Missing ⚠️
...tric-telemetry/lib/electric/telemetry/processes.ex 94.59% 2 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (0bbb377) and HEAD (5b9694e). Click for more details.

HEAD has 14 uploads less than BASE
Flag BASE (0bbb377) HEAD (5b9694e)
unit-tests 5 1
packages/y-electric 1 0
typescript 5 0
packages/experimental 1 0
packages/start 1 0
packages/react-hooks 1 0
packages/typescript-client 1 0
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #3621       +/-   ##
===========================================
- Coverage   89.20%   67.86%   -21.35%     
===========================================
  Files          25       15       -10     
  Lines        2520      557     -1963     
  Branches      641        0      -641     
===========================================
- Hits         2248      378     -1870     
+ Misses        270      179       -91     
+ Partials        2        0        -2     
Flag Coverage Δ
electric-telemetry 67.86% <72.00%> (?)
elixir 67.86% <72.00%> (?)
packages/experimental ?
packages/react-hooks ?
packages/start ?
packages/typescript-client ?
packages/y-electric ?
typescript ?
unit-tests 67.86% <72.00%> (-21.35%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@alco alco mentioned this pull request Dec 17, 2025
@alco alco self-assigned this Dec 18, 2025
@alco alco removed their assignment Mar 2, 2026
@erik-the-implementer erik-the-implementer force-pushed the alco/stack-telemetry-stack-id branch from d419585 to a2322ac Compare March 20, 2026 15:44
@alco alco added the claude label Mar 20, 2026
@claude
Copy link
Copy Markdown

claude Bot commented Mar 20, 2026

.DS_store
.vscode
**/.vitest-temp.json
/dist/
/packages/sync-service/state/
/website/.netlify
/website/.vitepress/cache
/website/.vitepress/dist
/website/public/openapi.html
build
caching/nginx_cache
file.jsonl
integration-tests/lux_logs/
integration-tests/lux/
integration-tests/_storage
json_files
node_modules
shape-data.json
test-dbs
tsconfig.tsbuildinfo
wal
/shapes
.sst
sst-
.log
*/deps/
**/junit
**/coverage
response.tmp
.claude
!website/.claude/commands
!website/.claude/skills
_artifacts

@claude
Copy link
Copy Markdown

claude Bot commented Mar 20, 2026

test body via graphql

@netlify
Copy link
Copy Markdown

netlify Bot commented Mar 23, 2026

Deploy Preview for electric-next ready!

Name Link
🔨 Latest commit 0e3579d
🔍 Latest deploy log https://app.netlify.com/projects/electric-next/deploys/69c16a4e4710b90008919f8d
😎 Deploy Preview https://deploy-preview-3621--electric-next.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@claude
Copy link
Copy Markdown

claude Bot commented Mar 25, 2026

Claude Code Review

Summary

Iteration 7 adds one new commit (5b9694e24): reverts the Dockerfile change from iteration 6, restoring MIX_OS_DEPS_COMPILE_PARTITION_COUNT=4. The revert was prompted by a reviewer question and confirmed by the author as a deliberate choice ahead of merge. No new issues introduced; the three carry-over suggestions remain open (all optional).

What's Working Well

  • The revert is the right call given the CI reliability concern the author noted — restoring the variable avoids flaky CI until the root cause is understood. The inline discussion (magnetised → alco) captures the rationale clearly.
  • All prior fixes (binary memory metrics, stack_id in OTEL resource, configurable ELECTRIC_STACK_TELEMETRY_INIT_DELAY) remain intact and unaffected by the revert.

Issues Found

Critical (Must Fix)

None.

Important (Should Fix)

None.

Suggestions (Nice to Have)

Unused variable in lux test (carry-over from iteration 4)

File: integration-tests/tests/otel-export.lux:27fn i -> should be fn _ -> to silence the Elixir compiler warning.

:infinity as arithmetic sentinel (carry-over from iteration 4)

File: packages/electric-telemetry/lib/electric/telemetry/processes.ex:65 — worth a one-liner explaining that integers always compare less-than atoms, so running_total >= :infinity never fires.

Hardcoded 1.000000 in lux assertion (carry-over from iterations 2–4)

File: integration-tests/tests/otel-export.lux:79 — a comment explaining why avg_ref_count is exactly 1.0 would help future readers.

Issue Conformance

No linked issue — acceptable for this observability improvement.

Previous Review Status

  • ✅ Fixed (iter 3): off_heap_strings OOM.
  • ✅ Fixed (iter 3): Binary double-counting comment at processes.ex:46-48.
  • ✅ Fixed (iter 3): Timeout reset in _macros.luxinc.
  • ✅ New (iter 5): ELECTRIC_STACK_TELEMETRY_INIT_DELAY — correctly implemented.
  • ✅ New (iter 5): stack_id in OTEL resource via Map.put_new — correctly implemented.
  • ✅ New (iter 7): MIX_OS_DEPS_COMPILE_PARTITION_COUNT=4 reverted — sensible CI stability choice, rationale documented in inline PR thread.
  • ⏳ Still open (suggestion): fn _ -> at otel-export.lux:27.
  • ⏳ Still open (suggestion): :infinity comment at processes.ex:65.
  • ⏳ Still open (suggestion): 1.000000 comment at otel-export.lux:79.

Review iteration: 7 | 2026-04-14

@alco alco force-pushed the alco/stack-telemetry-stack-id branch 2 times, most recently from 3392db7 to 6eea74f Compare April 13, 2026 15:42
alco and others added 17 commits April 14, 2026 12:09
put_new would silently drop stack_id when the user provides a custom
:resource map. Use Map.put_new on the existing resource map instead,
so stack_id is always included without overwriting user-provided keys.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The fallback branch when info[:binary] is not a list was returning a map
with only proc_mem, but the reduce function expects binary_mem,
ref_count_sum, and num_binaries keys too. This would cause a KeyError
if the branch were ever reached.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
process.memory.* now only exports total, sorted by aggregated process
memory. The new process.bin_memory.* metric sorts process groups by
referenced binary memory and exports: total, max_bin_count,
avg_bin_count, max_ref_count, avg_ref_count.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Extract shared top_by/3 to deduplicate top_memory and top_bin_memory

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

wip
The new limit includes all process groups whose aggregated memory is at
least n bytes. This is useful for binary memory telemetry where a
percentage-of-total doesn't make sense due to refc binary double-counting.

Also makes take_until_target accept the low cutoff as an argument instead
of using the hardcoded @min_group_memory.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rget

When sorting by binary_mem, the low cutoff and accumulator should compare
against binary_mem, not proc_mem.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
alco and others added 3 commits April 14, 2026 12:54
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The sort order test now independently verifies each list is sorted by
its respective key and asserts the top entries differ.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ETRY_INIT_DELAY

The default remains 30s for production. Integration tests set it to 1s
so stack metrics are exported quickly enough for test assertions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@alco alco force-pushed the alco/stack-telemetry-stack-id branch from 6eea74f to 9db1cc8 Compare April 14, 2026 10:55
@alco alco changed the title Include binary_mem and average ref count in top process metrics Enrich process telemetry with binary memory metrics and fix stack_id in OTEL resource Apr 14, 2026
CI seems to be having trouble with this configuration option
Copy link
Copy Markdown
Contributor

@magnetised magnetised left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

excellent stuff.


COPY mix.* /builder/electric/
RUN mix deps.get
RUN MIX_OS_DEPS_COMPILE_PARTITION_COUNT=4 mix deps.compile
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious, why remove?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason this causes trouble on CI runners occasionally. I can revert it just before merging.

@alco alco merged commit e9db22c into main Apr 14, 2026
16 of 17 checks passed
@alco alco deleted the alco/stack-telemetry-stack-id branch April 14, 2026 13:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants