Created per-worker RSS memory tracker via Prometheus resolved #6814 #7546
Open
SketchRudy wants to merge 35 commits into
Open
Created per-worker RSS memory tracker via Prometheus resolved #6814 #7546SketchRudy wants to merge 35 commits into
SketchRudy wants to merge 35 commits into
Conversation
Add worker max RSS helper with tests
Add Prometheus gauge metric for worker process RSS memory
for more information, see https://pre-commit.ci
Mypy fixes
Wired middleware to worker RSS gauge for per-request metric updates
fixed hardcoded line into if statement
Story #4 closes out the documentation and verification work for the flagsmith_worker_rss_bytes gauge (Flagsmith#6814). - New operator guide at docs/docs/deployment-self-hosting/observability/worker-rss-monitoring.md covering enabling, PromQL examples, Grafana panel suggestions, and high-water-mark interpretation notes. - Cross-link added to metrics.mdx so the guide is discoverable from the metrics index page. - Corrected the stale catalogue description for flagsmith_worker_rss_bytes to match the post-PR-#3 Python docstring (high-water mark / VmHWM). - Integration test in api/tests/integration/core/ exercises the full path: request through WorkerRSSMiddleware, gauge update, registry scrape. Satisfies story #3 AC #5 escape hatch. - Temporary scaffold note at docs/development/ documents Windows limitations encountered and follow-ups for the team.
docs: document worker RSS metric and add integration test
chore: remove sprint scaffold notes ahead of upstream submission
|
@AAshGray is attempting to deploy a commit to the Flagsmith Team on Vercel. A member of the Team first needs to authorize it. |
for more information, see https://pre-commit.ci
Contributor
|
Hi @SketchRudy , thanks for the PR. Please can you review the linting failure here, and make sure that the title of the PR adheres to the conventional commit format? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Thanks for submitting a PR! Please check the boxes below:
docs/if required so people know about the feature.Changes
Closes Issue #6814
Adds a Prometheus gauge,
flagsmith_worker_rss_bytes, that tracks the peakresident-set size (RSS) of each API worker process, labelled by PID. This
gives operators per-worker memory visibility so leaks can be spotted on a
dashboard before a worker is OOM-killed.
Implementation
api/metrics/worker_metrics.py— reads theVmHWM(peak RSS) line from/proc/self/statusand exposes it via aprometheus_client.Gaugewith apidlabel andmultiprocess_mode="liveall", so it aggregates correctlyacross gunicorn workers when
PROMETHEUS_MULTIPROC_DIRis set. Fails safe toa no-op on platforms where
/proc/self/statusis unavailable.api/core/middleware/worker_rss.py—WorkerRSSMiddlewareupdates thegauge after each response. The update is isolated so a metrics failure can
never affect request handling.
api/app/settings/common.py— the middleware is registered only whenPROMETHEUS_ENABLEDis true, mirroring the existingENABLE_API_USAGE_TRACKINGpattern, so deployments without Prometheus incurzero overhead.
Design notes
The metric reports the high-water mark rather than current RSS. It is cheap
and robust to read, and sufficient for leak detection: a flat line is a
healthy worker that has stabilised, a continuously climbing line indicates a
leak, and recovery is observed via PID rotation when a worker is recycled.
These trade-offs and the interpretation guidance are documented for operators.
Documentation
docs/docs/deployment-self-hosting/observability/worker-rss-monitoring.md(enabling, PromQL examples, a Grafana panel, and interpretation notes).
metrics.mdx).flagsmith_worker_rss_bytes.How did you test this code?
Automated tests
(
api/tests/unit/metrics/test_unit_worker_metrics.py) — success, missingdata, and unsupported-platform paths.
(
api/tests/unit/core/middleware/test_unit_core_middleware_worker_rss.py) —call ordering, response pass-through, and failure isolation.
(
api/tests/integration/core/test_integration_core_worker_rss_metric.py)drives a real request through the Django middleware stack and asserts the
gauge appears in the Prometheus exposition output.
Run with:
bash cd api uv sync --extra dev uv run pytest tests/unit/metrics tests/unit/core/middleware tests/integration/core/test_integration_core_worker_rss_metric.py -n0 Manual verification
Built the OSS API image from this branch and ran the full stack with
PROMETHEUS_ENABLED=true. After sending traffic, scraping/metricsreturnedthe metric for each live gunicorn worker, alongside the built-in Flagsmith
metrics
Confirmed one series per worker PID, with the expected
VmHWM-baseddescription in the metric HELP text.
Worked on this with @mmaslov007 @HumaGitGud @AAshGray