Skip to content

Created per-worker RSS memory tracker via Prometheus resolved #6814 #7546

Open
SketchRudy wants to merge 35 commits into
Flagsmith:mainfrom
mmaslov007:main
Open

Created per-worker RSS memory tracker via Prometheus resolved #6814 #7546
SketchRudy wants to merge 35 commits into
Flagsmith:mainfrom
mmaslov007:main

Conversation

@SketchRudy
Copy link
Copy Markdown

Thanks for submitting a PR! Please check the boxes below:

  • I have read the Contributing Guide.
  • I have added information to docs/ if required so people know about the feature.
  • I have filled in the "Changes" section below.
  • I have filled in the "How did you test this code" section below.

Changes

Closes Issue #6814

Adds a Prometheus gauge, flagsmith_worker_rss_bytes, that tracks the peak
resident-set size (RSS) of each API worker process, labelled by PID. This
gives operators per-worker memory visibility so leaks can be spotted on a
dashboard before a worker is OOM-killed.

Implementation

  • api/metrics/worker_metrics.py — reads the VmHWM (peak RSS) line from
    /proc/self/status and exposes it via a prometheus_client.Gauge with a
    pid label and multiprocess_mode="liveall", so it aggregates correctly
    across gunicorn workers when PROMETHEUS_MULTIPROC_DIR is set. Fails safe to
    a no-op on platforms where /proc/self/status is unavailable.
  • api/core/middleware/worker_rss.pyWorkerRSSMiddleware updates the
    gauge after each response. The update is isolated so a metrics failure can
    never affect request handling.
  • api/app/settings/common.py — the middleware is registered only when
    PROMETHEUS_ENABLED is true, mirroring the existing
    ENABLE_API_USAGE_TRACKING pattern, so deployments without Prometheus incur
    zero overhead.

Design notes

The metric reports the high-water mark rather than current RSS. It is cheap
and robust to read, and sufficient for leak detection: a flat line is a
healthy worker that has stabilised, a continuously climbing line indicates a
leak, and recovery is observed via PID rotation when a worker is recycled.
These trade-offs and the interpretation guidance are documented for operators.

Documentation

  • New operator guide:
    docs/docs/deployment-self-hosting/observability/worker-rss-monitoring.md
    (enabling, PromQL examples, a Grafana panel, and interpretation notes).
  • Cross-linked from the metrics index (metrics.mdx).
  • Catalogue entry added for flagsmith_worker_rss_bytes.

How did you test this code?

Automated tests

  • Unit tests for the RSS helper and the gauge update/clear functions
    (api/tests/unit/metrics/test_unit_worker_metrics.py) — success, missing
    data, and unsupported-platform paths.
  • Unit tests for the middleware
    (api/tests/unit/core/middleware/test_unit_core_middleware_worker_rss.py) —
    call ordering, response pass-through, and failure isolation.
  • An integration test
    (api/tests/integration/core/test_integration_core_worker_rss_metric.py)
    drives a real request through the Django middleware stack and asserts the
    gauge appears in the Prometheus exposition output.

Run with:

bash cd api uv sync --extra dev uv run pytest tests/unit/metrics tests/unit/core/middleware tests/integration/core/test_integration_core_worker_rss_metric.py -n0 ​

Manual verification

Built the OSS API image from this branch and ran the full stack with
PROMETHEUS_ENABLED=true. After sending traffic, scraping /metrics returned
the metric for each live gunicorn worker, alongside the built-in Flagsmith
metrics

image image

Confirmed one series per worker PID, with the expected VmHWM-based
description in the metric HELP text.

Worked on this with @mmaslov007 @HumaGitGud @AAshGray

mmaslov007 and others added 30 commits April 27, 2026 20:05
Add worker max RSS helper with tests
Add Prometheus gauge metric for worker process RSS memory
Wired middleware to worker RSS gauge for per-request metric updates
fixed hardcoded line into if statement
Story #4 closes out the documentation and verification work for the
flagsmith_worker_rss_bytes gauge (Flagsmith#6814).

- New operator guide at docs/docs/deployment-self-hosting/observability/worker-rss-monitoring.md
  covering enabling, PromQL examples, Grafana panel suggestions, and
  high-water-mark interpretation notes.
- Cross-link added to metrics.mdx so the guide is discoverable from the
  metrics index page.
- Corrected the stale catalogue description for flagsmith_worker_rss_bytes
  to match the post-PR-#3 Python docstring (high-water mark / VmHWM).
- Integration test in api/tests/integration/core/ exercises the full path:
  request through WorkerRSSMiddleware, gauge update, registry scrape.
  Satisfies story #3 AC #5 escape hatch.
- Temporary scaffold note at docs/development/ documents Windows
  limitations encountered and follow-ups for the team.
@SketchRudy SketchRudy requested review from a team as code owners May 19, 2026 20:17
@SketchRudy SketchRudy requested review from Holmus and emyller and removed request for a team May 19, 2026 20:17
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@vercel
Copy link
Copy Markdown

vercel Bot commented May 19, 2026

@AAshGray is attempting to deploy a commit to the Flagsmith Team on Vercel.

A member of the Team first needs to authorize it.

@github-actions github-actions Bot added api Issue related to the REST API docs Documentation updates labels May 19, 2026
@matthewelwell
Copy link
Copy Markdown
Contributor

Hi @SketchRudy , thanks for the PR. Please can you review the linting failure here, and make sure that the title of the PR adheres to the conventional commit format?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api Issue related to the REST API docs Documentation updates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Track per-worker RSS memory via Prometheus

5 participants