[PERF] Reduce memory allocations through pre-sizing collections and batch allocation#5020
Conversation
421f42c to
83ccd8d
Compare
|
I believe the added benchmarks for every single commit are probably overkill, but I've left them voluntarily Let me know if you'd prefer:
|
siavashs
left a comment
There was a problem hiding this comment.
Thank you!
There a some formatting issues failing in the CI checks.
83ccd8d to
33ece32
Compare
Fixed it, relica of the refactor 👍🏼 |
There are more, please rebase your branch (we just merged an updated |
33ece32 to
e196685
Compare
My bad ! Should be all good now: |
Spaceman1701
left a comment
There was a problem hiding this comment.
Thanks for the contribution!
I'm certainly see no problem accepting the changes which per-allocate capacity, but I have some concerns about some of the more sophisticated optimizations.
I think any optimization which trades code clarity for performance will need some justification explaining why performance is a superseding priority for that code. I have no doubt that the changes improve performance, but I'm not sure we should increase complexity unless we have observed performance issues in real world conditions.
| alertCopies := make([]types.Alert, len(alerts)) | ||
|
|
||
| for i, alert := range alerts { | ||
| alertCopies[i] = *alert |
There was a problem hiding this comment.
Is the idea here to batch the allocations of the copied alerts? I guess otherwise line 786 will result in one allocation per alert...
I'm a touch worried that this will leak memory - if anybody keeps a reference to any alert that's part of this slice we cannot deallocate the entire slice.
Have you observed a performance issue with the existing logic that justifies the additional complexity?
There was a problem hiding this comment.
I might be mistaken but it seems the old code did the same here https://github.com/prometheus/alertmanager/pull/5020/changes#diff-2ec4e2e390946f66324eb00bb9e735280a4d018e65de41e64bebac8bebebd23bL782
There was a problem hiding this comment.
Thanks for the review and for raising this concern!
I've updated the commit to inline the change directly in flush() rather than extracting a separate function, hopefully this makes the diff easier to follow. The initial extraction was mainly to allow dedicated unit tests for the alert preparation logic, but it made the actual optimization harder to see.
Regarding the memory concernm I think you're right that if something held a reference to one alert, it would retain the whole array. Looking at the flush() flow though, alertsSlice is passed to notify(), processed, and then released when flush() returns. However, if there is a code path where individual alerts are retained longer, that would have been a concern with the original code as well, isn't it ? Does the introduction of alertCopies changes that risk ?
Happy to revert if you feel the complexity isn't worth it. About your initial question on the performances, we've been seeing quite a bit of memory pressure on the alertmanager from many small issues, so any low-hanging fruit that reduces allocations feels worth picking up 😄
There was a problem hiding this comment.
Well in the original copy the type of AlertSlice was pointers to alert. So holding onto an individual alert would keep that one alert in memory, not the whole slice. Only keeping a reference to the whole slice would keep the whole slice. In the new version the alerts are all part of one slice, so having a pointer to one alert unfortunately holds the whole slice...
If we see it's a problem to allocate and GC all these alerts, would we consider sync.Pool? Though not sure if the sync aspect would slow us down then. Unfortunately it might be hard to guarantee that no code will ever hold onto an alert/store it in a map "for a while" and such ....
There was a problem hiding this comment.
@Spaceman1701 what would you think if we zeroed out alertCopies after use? This way any pointers to it would clearly be useless, and any user that mistakenly holds onto them would see the bug and realize they need copy+keep instead... So this would make it safe through documenting the contract (don't hold on to these alert pointers) + making any reference kept useless by clearing the slice.
There was a problem hiding this comment.
@Spaceman1701 what would you think if we zeroed out alertCopies after use, or if we put it as part of a struct that we keep in a pool, and reset it before putting it in the pool? This way any pointers to it would break, and we wouldn't be able to hold on to them, so a user needing to keep a cache would need to do copy+cache... ?
I think my preferred solution would be to encapsulate all this logic into a struct that owns the memory, but I suppose in practice it's a minor difference. I also do feel a little more comfortable with the in-lined function since it makes things harder to misuse.
If I'm the only one with this concern, I'm also happy to be outvoted - It's not a strong concern.
There was a problem hiding this comment.
Let's go for slice clear for now... It's a little unclear to me how "a struct that owns the memory" would help if the notifiers will still every get access to *types.Alert without a copy each time they do, which would defeat the purpose and duplicate the copies? But maybe we can do that as a future improvement and/or we can chat about it
e196685 to
e792779
Compare
ultrotter
left a comment
There was a problem hiding this comment.
Approved with a slice clear when we copy to the slice, so we are sure to invalidate any held references and thus enforce the contract with the user not to hold onto it even in the future.
| alertCopies := make([]types.Alert, len(alerts)) | ||
|
|
||
| for i, alert := range alerts { | ||
| alertCopies[i] = *alert |
There was a problem hiding this comment.
Let's go for slice clear for now... It's a little unclear to me how "a struct that owns the memory" would help if the notifiers will still every get access to *types.Alert without a copy each time they do, which would defeat the purpose and duplicate the copies? But maybe we can do that as a future improvement and/or we can chat about it
The flush path copies each alert before sending to avoid mutation. Previously each copy was a separate heap allocation. Now all copies share a single backing array. Benchmark results: 10 alerts: Before: 691 ns/op, 12 allocs → After: 435 ns/op, 3 allocs (37% faster, 75% fewer allocs) 50 alerts: Before: 3003 ns/op, 52 allocs → After: 1714 ns/op, 3 allocs (43% faster, 94% fewer allocs) 100 alerts: Before: 5976 ns/op, 102 allocs → After: 3383 ns/op, 3 allocs (43% faster, 97% fewer allocs) Signed-off-by: Titouan Chary <titouan.chary@aiven.io>
Pre-allocate the LabelSet map to the exact size needed (len of Equal labels) instead of letting it grow dynamically. Called on every inhibition check to compute the fingerprint of matching labels. Avoids map growth overhead when Equal contains multiple labels. Benchmark results: 3 labels: Before: 357 ns/op → After: 314 ns/op (12% faster) 10 labels: Before: 1519 ns/op → After: 1365 ns/op (10% faster) Signed-off-by: Titouan Chary <titouan.chary@aiven.io>
Use len(GroupBy) or len(alert.Labels) for GroupByAll as the initial map capacity. Avoids internal map growth when grouping by many labels. Most impactful with GroupByAll on alerts with 10+ labels where map would otherwise grow multiple times during iteration. Benchmark results (10 labels): group_by_all: Before: 714 ns/op → After: 585 ns/op (18% faster) Signed-off-by: Titouan Chary <titouan.chary@aiven.io>
Extract alert partitioning into a function and pre-allocate slices and maps to len(alerts) capacity. Called on every DedupStage.Exec. Avoids slice/map growth when processing groups with many alerts. Benchmark results: 50 alerts: Before: 15.8 µs/op, 126 allocs → After: 13.1 µs/op, 110 allocs (17% faster, 13% fewer allocs) 100 alerts: Before: 29.0 µs/op, 232 allocs → After: 25.7 µs/op, 210 allocs (11% faster, 9% fewer allocs) Signed-off-by: Titouan Chary <titouan.chary@aiven.io>
Pre-allocate the alerts slice with the known length to avoid slice growth allocations during API alert conversion. Benchmark with 100 alerts (PreAllocated vs AppendGrowth): 50736 B/op vs 51768 B/op (-2% memory) 404 allocs vs 408 allocs (-1% allocations) Signed-off-by: Titouan Chary <titouan.chary@aiven.io>
The Data() method was calling types.Alerts() twice - once for Status() and once for iteration. Each call allocates a new slice. Caching the result avoids the duplicate allocation. Benchmark with 50 alerts (SingleCall vs DuplicateCall): 4416 B/op vs 8832 B/op (-50% memory) 51 allocs vs 102 allocs (-50% allocations) Signed-off-by: Titouan Chary <titouan.chary@aiven.io>
This is so a caller that may try to hold on to one of the pointers can realize that they shouldn't do that, by having the data held blanked out and invalidated. Signed-off-by: Guido Trotter <ultrotter@gmail.com>
Revert the batch allocation optimization (b6f01d7) and the subsequent clear(alertCopies) addition (ab5b200). The batch allocation aimed to reduce per-alert heap allocations by sharing a backing array. However, this introduced a memory retention concern: holding one alert reference would retain the entire array. The clear() was added to "enforce the contract" that notifiers should not retain alert pointers. However, this approach breaks both tests and any real code that legitimately accesses alerts during or after the notify callback - it zeroes the alert data, causing fingerprint calculations to return invalid values. The complexity and test failures outweigh the allocation savings. Reverting to per-alert allocations which are simpler and have clear ownership semantics. Signed-off-by: Titouan Chary <titouan.chary@aiven.io>
|
Revert the batch allocation optimization (b6f01d7) and the subsequent clear(alertCopies) addition (ab5b200). The batch allocation aimed to reduce per-alert heap allocations by sharing a backing array. However, this introduced a memory retention concern: holding one alert reference would retain the entire array. The clear() was added to "enforce the contract" that notifiers should not retain alert pointers. However, this approach breaks both tests and any real code that legitimately accesses alerts during or after the notify callback - it zeroes the alert data, causing fingerprint calculations to return invalid values. The complexity and test failures outweigh the allocation savings. Reverting to per-alert allocations which are simpler and have clear ownership semantics. |
|
Thanks for the contribution! |
…atch allocation (prometheus#5020) * perf: batch allocate alert copies in flush The flush path copies each alert before sending to avoid mutation. Previously each copy was a separate heap allocation. Now all copies share a single backing array. Benchmark results: 10 alerts: Before: 691 ns/op, 12 allocs → After: 435 ns/op, 3 allocs (37% faster, 75% fewer allocs) 50 alerts: Before: 3003 ns/op, 52 allocs → After: 1714 ns/op, 3 allocs (43% faster, 94% fewer allocs) 100 alerts: Before: 5976 ns/op, 102 allocs → After: 3383 ns/op, 3 allocs (43% faster, 97% fewer allocs) Signed-off-by: Titouan Chary <titouan.chary@aiven.io> * perf: pre-size map in fingerprintEquals Pre-allocate the LabelSet map to the exact size needed (len of Equal labels) instead of letting it grow dynamically. Called on every inhibition check to compute the fingerprint of matching labels. Avoids map growth overhead when Equal contains multiple labels. Benchmark results: 3 labels: Before: 357 ns/op → After: 314 ns/op (12% faster) 10 labels: Before: 1519 ns/op → After: 1365 ns/op (10% faster) Signed-off-by: Titouan Chary <titouan.chary@aiven.io> * perf: pre-size map in getGroupLabels Use len(GroupBy) or len(alert.Labels) for GroupByAll as the initial map capacity. Avoids internal map growth when grouping by many labels. Most impactful with GroupByAll on alerts with 10+ labels where map would otherwise grow multiple times during iteration. Benchmark results (10 labels): group_by_all: Before: 714 ns/op → After: 585 ns/op (18% faster) Signed-off-by: Titouan Chary <titouan.chary@aiven.io> * perf: pre-size collections in DedupStage alert partitioning Extract alert partitioning into a function and pre-allocate slices and maps to len(alerts) capacity. Called on every DedupStage.Exec. Avoids slice/map growth when processing groups with many alerts. Benchmark results: 50 alerts: Before: 15.8 µs/op, 126 allocs → After: 13.1 µs/op, 110 allocs (17% faster, 13% fewer allocs) 100 alerts: Before: 29.0 µs/op, 232 allocs → After: 25.7 µs/op, 210 allocs (11% faster, 9% fewer allocs) Signed-off-by: Titouan Chary <titouan.chary@aiven.io> * perf: pre-allocate slice in OpenAPIAlertsToAlerts Pre-allocate the alerts slice with the known length to avoid slice growth allocations during API alert conversion. Benchmark with 100 alerts (PreAllocated vs AppendGrowth): 50736 B/op vs 51768 B/op (-2% memory) 404 allocs vs 408 allocs (-1% allocations) Signed-off-by: Titouan Chary <titouan.chary@aiven.io> * perf: cache types.Alerts() result in Template.Data The Data() method was calling types.Alerts() twice - once for Status() and once for iteration. Each call allocates a new slice. Caching the result avoids the duplicate allocation. Benchmark with 50 alerts (SingleCall vs DuplicateCall): 4416 B/op vs 8832 B/op (-50% memory) 51 allocs vs 102 allocs (-50% allocations) Signed-off-by: Titouan Chary <titouan.chary@aiven.io> * Clear the alertCopies slice after notification This is so a caller that may try to hold on to one of the pointers can realize that they shouldn't do that, by having the data held blanked out and invalidated. Signed-off-by: Guido Trotter <ultrotter@gmail.com> * revert: remove batch allocation and clear in flush Revert the batch allocation optimization (b6f01d7) and the subsequent clear(alertCopies) addition (ab5b200). The batch allocation aimed to reduce per-alert heap allocations by sharing a backing array. However, this introduced a memory retention concern: holding one alert reference would retain the entire array. The clear() was added to "enforce the contract" that notifiers should not retain alert pointers. However, this approach breaks both tests and any real code that legitimately accesses alerts during or after the notify callback - it zeroes the alert data, causing fingerprint calculations to return invalid values. The complexity and test failures outweigh the allocation savings. Reverting to per-alert allocations which are simpler and have clear ownership semantics. Signed-off-by: Titouan Chary <titouan.chary@aiven.io> --------- Signed-off-by: Titouan Chary <titouan.chary@aiven.io> Signed-off-by: Guido Trotter <ultrotter@gmail.com> Co-authored-by: Guido Trotter <ultrotter@gmail.com>
* [CHANGE] `go get github.com/prometheus/alertmanager/ui` will now fail as compiled UI assets are no longer checked into the repository. Downstream builds that rely on these assets being present in the source tree must now build the UI from source. prometheus#5113 * [CHANGE] The '--enable-feature=auto-gomaxprocs' option is deprecated and will be removed in v0.33. This flag currently has no effect and can be safely removed from any startup scripts. prometheus#5090 * [CHANGE] Update internal function signatures across multiple packages. This affects any project that integrates `Alertmanager` code. * [ENHANCEMENT] Add static asset caching. prometheus#5113 * [ENHANCEMENT] Reduce memory allocations through pre-sizing collections and batch allocation. prometheus#5020 * [ENHANCEMENT] Replace help with documentation in navigation bar. prometheus#4943 * [ENHANCEMENT] docs(ha): Update high availability documentation. prometheus#5136 * [ENHANCEMENT] docs: Add `auth_secret_file` for smtp in document. prometheus#5036 * [ENHANCEMENT] docs: Add description for global `telegram_bot_token`. prometheus#5114 * [ENHANCEMENT] docs: Add note about notifier timeouts. prometheus#5077 * [ENHANCEMENT] docs: Fix `force_implicit_tls` config field name. prometheus#5030 * [ENHANCEMENT] docs: Link community supported integrations. prometheus#4978 * [ENHANCEMENT] docs: Remove duplicate header. prometheus#5034 * [ENHANCEMENT] docs: Update mutual tls reference in high availability documentation. prometheus#5120 * [ENHANCEMENT] tracing: Use noop spans when tracing disabled. prometheus#5118 * [FEATURE] Add silence annotations. prometheus#4965 * [FEATURE] Add silence logging option. prometheus#4163 * [FEATURE] Add support for multiple matcher set silences. prometheus#4957 * [FEATURE] Add the reason for notifying in dedup stage. prometheus#4971 * [FEATURE] mattermost: Flatten attachments into top-level config. prometheus#5009 * [FEATURE] mattermost: Support global webhook url. prometheus#4998 * [FEATURE] slack: Add default color from template. prometheus#5014 * [FEATURE] slack: Allow receiver to edit existing messages. prometheus#5007 * [FEATURE] template: Add dict, map and append functions. prometheus#5093 * [FEATURE] webhook: Add full payload templating support for notifier. prometheus#5011 * [BUGFIX] config: Check for empty cluster tls client config. prometheus#5126 * [BUGFIX] config: Don't crash upon reading empty config for notifier. prometheus#4979 * [BUGFIX] config: Fix ipv6 address handling in hostport.string(). prometheus#5040 * [BUGFIX] mattermost: Omit empty text field in notifications. prometheus#4985 * [BUGFIX] telegram: Send fallback message when notification exceeds character limit. prometheus#5074 * [BUGFIX] ui: Fix escaping for matcher values with quotes. prometheus#4862 * [BUGFIX] ui: Handle special chars in silence regex-matchers. prometheus#4942 * [BUGFIX] ui: Support utf-8 label names in matchers. prometheus#5089 Signed-off-by: Solomon Jacobs <solomonjacobs@protonmail.com>
* [CHANGE] `go get github.com/prometheus/alertmanager/ui` will now fail as compiled UI assets are no longer checked into the repository. Downstream builds that rely on these assets being present in the source tree must now build the UI from source. #5113 * [CHANGE] The '--enable-feature=auto-gomaxprocs' option is deprecated and will be removed in v0.33. This flag currently has no effect and can be safely removed from any startup scripts. #5090 * [CHANGE] Update internal function signatures across multiple packages. This affects any project that integrates `Alertmanager` code. * [ENHANCEMENT] Add static asset caching. #5113 * [ENHANCEMENT] Reduce memory allocations through pre-sizing collections and batch allocation. #5020 * [ENHANCEMENT] Replace help with documentation in navigation bar. #4943 * [ENHANCEMENT] docs(ha): Update high availability documentation. #5136 * [ENHANCEMENT] docs: Add `auth_secret_file` for smtp in document. #5036 * [ENHANCEMENT] docs: Add description for global `telegram_bot_token`. #5114 * [ENHANCEMENT] docs: Add note about notifier timeouts. #5077 * [ENHANCEMENT] docs: Fix `force_implicit_tls` config field name. #5030 * [ENHANCEMENT] docs: Link community supported integrations. #4978 * [ENHANCEMENT] docs: Remove duplicate header. #5034 * [ENHANCEMENT] docs: Update mutual tls reference in high availability documentation. #5120 * [ENHANCEMENT] tracing: Use noop spans when tracing disabled. #5118 * [FEATURE] Add silence annotations. #4965 * [FEATURE] Add silence logging option. #4163 * [FEATURE] Add support for multiple matcher set silences. #4957 * [FEATURE] Add the reason for notifying in dedup stage. #4971 * [FEATURE] mattermost: Flatten attachments into top-level config. #5009 * [FEATURE] mattermost: Support global webhook url. #4998 * [FEATURE] slack: Add default color from template. #5014 * [FEATURE] slack: Allow receiver to edit existing messages. #5007 * [FEATURE] template: Add dict, map and append functions. #5093 * [FEATURE] webhook: Add full payload templating support for notifier. #5011 * [BUGFIX] config: Check for empty cluster tls client config. #5126 * [BUGFIX] config: Don't crash upon reading empty config for notifier. #4979 * [BUGFIX] config: Fix ipv6 address handling in hostport.string(). #5040 * [BUGFIX] mattermost: Omit empty text field in notifications. #4985 * [BUGFIX] telegram: Send fallback message when notification exceeds character limit. #5074 * [BUGFIX] ui: Fix escaping for matcher values with quotes. #4862 * [BUGFIX] ui: Handle special chars in silence regex-matchers. #4942 * [BUGFIX] ui: Support utf-8 label names in matchers. #5089 Signed-off-by: Solomon Jacobs <solomonjacobs@protonmail.com>
Pull Request Checklist
Please check all the applicable boxes.
benchstatto compare benchmarksWhich user-facing changes does this PR introduce?
[PERF] Reduce memory allocations in alert processing hot paths through pre-sizing collections and batch allocation.
Summary
This PR reduces memory allocations in several hot paths by pre-sizing maps/slices and batching allocations. All changes are internal optimizations with no API changes.
Benchmarks
Batch allocate alert copies in flush:
10 alerts: Before: 691 ns/op, 12 allocs → After: 435 ns/op, 3 allocs (37% faster, 75% fewer allocs)
50 alerts: Before: 3003 ns/op, 52 allocs → After: 1714 ns/op, 3 allocs (43% faster, 94% fewer allocs)
100 alerts: Before: 5976 ns/op, 102 allocs → After: 3383 ns/op, 3 allocs (43% faster, 97% fewer allocs)
DedupStage alert partitioning:
50 alerts: Before: 15.8 µs/op, 126 allocs → After: 13.1 µs/op, 110 allocs (17% faster, 13% fewer allocs)
100 alerts: Before: 29.0 µs/op, 232 allocs → After: 25.7 µs/op, 210 allocs (11% faster, 9% fewer allocs)
fingerprintEquals pre-sizing:
3 labels: Before: 357 ns/op → After: 314 ns/op (12% faster)
10 labels: Before: 1519 ns/op → After: 1365 ns/op (10% faster)
getGroupLabels pre-sizing:
group_by_all (10 labels): Before: 714 ns/op → After: 585 ns/op (18% faster)
Template.Data caching:
50 alerts: 4416 B/op vs 8832 B/op (-50% memory), 51 vs 102 allocs (-50%)