Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for deadlock issue #3682 #3715

Merged
merged 1 commit into from
Mar 1, 2024
Merged

Conversation

rajagopalanand
Copy link
Contributor

@rajagopalanand rajagopalanand commented Feb 8, 2024

What does this PR do?

This PR is a fix for #3682. In some instances, mem.Alerts.Subscribe() and store.gc() can get deadlocked

  1. Lock acquisition in store.Alerts.gc():
    • The method store.Alerts.gc() acquires a lock on its internal mutex store.Alerts.mtx
    • While holding the lock, it then calls the callback function which will try to acquire a lock on mem.Alerts.mtx
  2. Concurrent Execution:
    • mem.Alerts.Subscribe() acquires a lock on its internal mutex mem.Alerts.mtx and calls store.Alerts.List()
  3. Deadlock situation:
    • Callback function tries to acquire a lock on mem.Alerts.mtx. However this lock is already being held by mem.Subscribe()
    • Similarly mem.Subscribe() cannot proceed because store.List() cannot acquire lock (store.Alerts.mtx) because it is being held by store.gc()

Another way of summarizing this is store.Alerts.gc() was holding the lock until callback function completed which in turn was waiting to acquire the lock. Callback function could not acquire the lock because Subscribe() was holding the lock. Subscribe() cannot progress because it calls store.Alerts.List() which was waiting for lock acquisition which was being held by store.Alerts.gc(). This fix releases the lock held by store.Alerts.gc() prior to calling the callback function

AM_Deadlock

@rajagopalanand rajagopalanand marked this pull request as ready for review February 8, 2024 20:04
@rajagopalanand rajagopalanand marked this pull request as draft February 8, 2024 20:09
@rajagopalanand rajagopalanand force-pushed the deadlock-fix branch 2 times, most recently from 8f78ea6 to df280a4 Compare February 8, 2024 21:12
@rajagopalanand rajagopalanand marked this pull request as ready for review February 8, 2024 22:54
@gotjosh
Copy link
Member

gotjosh commented Feb 13, 2024

@rajagopalanand thank you very much for your contribution - can you please fix the linter?

@gotjosh
Copy link
Member

gotjosh commented Feb 13, 2024

This is a duplicate of https://github.com/prometheus/alertmanager/pull/3648/files, right?

@rajagopalanand
Copy link
Contributor Author

This is a duplicate of https://github.com/prometheus/alertmanager/pull/3648/files, right?

Was not aware of this but I will take a look

@rajagopalanand
Copy link
Contributor Author

rajagopalanand commented Feb 14, 2024

This is a duplicate of https://github.com/prometheus/alertmanager/pull/3648/files, right?

Was not aware of this but I will take a look

The other PR addresses issues more than just the deadlock issue mentioned in #3682. If this can be reviewed and merged, then when the other PR gets reviewed and merged it can address race conditions and can retain a test from this PR

Copy link
Member

@gotjosh gotjosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for the detailed response and PR description.

This LGTM, but there's one more thing we need to do. Please take a look at my comment.

@@ -71,15 +71,14 @@ func (a *Alerts) Run(ctx context.Context, interval time.Duration) {

func (a *Alerts) gc() {
a.Lock()
defer a.Unlock()

var resolved []*types.Alert
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to stop passing []*types.Alert and instead now send a slice of values to the callback.

Suggested change
var resolved []*types.Alert
var resolved []types.Alert

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can submit a separate PR for this

@gotjosh gotjosh merged commit 1eb83c2 into prometheus:main Mar 1, 2024
11 checks passed
th0th pushed a commit to th0th/alertmanager that referenced this pull request Mar 23, 2024
alanprot pushed a commit to amazon-contributing/alertmanager that referenced this pull request Apr 4, 2024
alanprot pushed a commit to amazon-contributing/alertmanager that referenced this pull request Apr 4, 2024
yeya24 pushed a commit to amazon-contributing/alertmanager that referenced this pull request Apr 9, 2024
qinxx108 pushed a commit to amazon-contributing/alertmanager that referenced this pull request Apr 9, 2024
alanprot pushed a commit to amazon-contributing/alertmanager that referenced this pull request Apr 11, 2024
qinxx108 pushed a commit to amazon-contributing/alertmanager that referenced this pull request Apr 19, 2024
emanlodovice pushed a commit to amazon-contributing/alertmanager that referenced this pull request Apr 23, 2024
mustafain117 pushed a commit to amazon-contributing/alertmanager that referenced this pull request Nov 8, 2024
mustafain117 pushed a commit to amazon-contributing/alertmanager that referenced this pull request Nov 8, 2024
rajagopalanand added a commit to amazon-contributing/alertmanager that referenced this pull request Nov 21, 2024
rajagopalanand added a commit to amazon-contributing/alertmanager that referenced this pull request Nov 21, 2024
rajagopalanand added a commit to amazon-contributing/alertmanager that referenced this pull request Nov 22, 2024
justinjung04 pushed a commit to amazon-contributing/alertmanager that referenced this pull request Nov 28, 2024
anna-tran pushed a commit to amazon-contributing/alertmanager that referenced this pull request Nov 28, 2024
anna-tran pushed a commit to amazon-contributing/alertmanager that referenced this pull request Nov 28, 2024
anna-tran pushed a commit to amazon-contributing/alertmanager that referenced this pull request Nov 28, 2024
alanprot pushed a commit to amazon-contributing/alertmanager that referenced this pull request Dec 2, 2024
justinjung04 pushed a commit to amazon-contributing/alertmanager that referenced this pull request Dec 4, 2024
alexqyle pushed a commit to amazon-contributing/alertmanager that referenced this pull request Dec 5, 2024
alexqyle pushed a commit to amazon-contributing/alertmanager that referenced this pull request Dec 6, 2024
alanprot pushed a commit to amazon-contributing/alertmanager that referenced this pull request Dec 10, 2024
yeya24 pushed a commit to amazon-contributing/alertmanager that referenced this pull request Dec 11, 2024
yeya24 pushed a commit to amazon-contributing/alertmanager that referenced this pull request Dec 19, 2024
SuperQ added a commit that referenced this pull request Dec 19, 2024
* [CHANGE] Templating errors in the SNS integration now return an error. #3531 #3879
* [CHANGE] Adopt log/slog, drop go-kit/log #4089
* [FEATURE] Add a new Microsoft Teams integration based on Flows #4024
* [FEATURE] Add a new Rocket.Chat integration #3600
* [FEATURE] Add a new Jira integration #3590 #3931
* [FEATURE] Add support for `GOMEMLIMIT`, enable it via the feature flag `--enable-feature=auto-gomemlimit`. #3895
* [FEATURE] Add support for `GOMAXPROCS`, enable it via the feature flag `--enable-feature=auto-gomaxprocs`. #3837
* [FEATURE] Add support for limits of silences including the maximum number of active and pending silences, and the maximum size per silence (in bytes). You can use the flags `--silences.max-silences` and `--silences.max-silence-size-bytes` to set them accordingly #3852 #3862 #3866 #3885 #3886 #3877
* [FEATURE] Muted alerts now show whether they are suppressed or not in both the `/api/v2/alerts` endpoint and the Alertmanager UI. #3793 #3797 #3792
* [ENHANCEMENT] Add support for `content`, `username` and `avatar_url` in the Discord integration. `content` and `username` also support templating. #4007
* [ENHANCEMENT] Only invalidate the silences cache if a new silence is created or an existing silence replaced - should improve latency on both `GET api/v2/alerts` and `POST api/v2/alerts` API endpoint. #3961
* [ENHANCEMENT] Add image source label to Dockerfile. To get changelogs shown when using Renovate #4062
* [ENHANCEMENT] Build using go 1.23 #4071
* [ENHANCEMENT] Support setting a global SMTP TLS configuration. #3732
* [ENHANCEMENT] The setting `room_id` in the WebEx integration can now be templated to allow for dynamic room IDs. #3801
* [ENHANCEMENT] Enable setting `message_thread_id` for the Telegram integration. #3638
* [ENHANCEMENT] Support the `since` and `humanizeDuration` functions to templates. This means users can now format time to more human-readable text. #3863
* [ENHANCEMENT] Support the `date` and `tz` functions to templates. This means users can now format time in a specified format and also change the timezone to their specific locale. #3812
* [ENHANCEMENT] Latency metrics now support native histograms. #3737
* [ENHANCEMENT] Add timeout option for webhook notifier. #4137
* [BUGFIX] Fix the SMTP integration not correctly closing an SMTP submission, which may lead to unsuccessful dispatches being marked as successful. #4006
* [BUGFIX]  The `ParseMode` option is now set explicitly in the Telegram integration. If we don't HTML tags had not been parsed by default. #4027
* [BUGFIX] Fix a memory leak that was caused by updates silences continuously. #3930
* [BUGFIX] Fix hiding secret URLs when the URL is incorrect. #3887
* [BUGFIX] Fix a race condition in the alerts - it was more of a hypothetical race condition that could have occurred in the alert reception pipeline. #3648
* [BUGFIX] Fix a race condition in the alert delivery pipeline that would cause a firing alert that was delivered earlier to be deleted from the aggregation group when instead it should have been delivered again. #3826
* [BUGFIX] Fix version in APIv1 deprecation notice. #3815
* [BUGFIX] Fix crash errors when using `url_file` in the Webhook integration. #3800
* [BUGFIX] fix `Route.ID()` returns conflicting IDs. #3803
* [BUGFIX] Fix deadlock on the alerts memory store. #3715
* [BUGFIX] Fix `amtool template render` when using the default values. #3725
* [BUGFIX] Fix `webhook_url_file` for both the Discord and Microsoft Teams integrations. #3728 #3745
* [BUGFIX] Fix wechat api link #4084
* [BUGFIX] Fix build info metric #4166

Signed-off-by: SuperQ <[email protected]>
@SuperQ SuperQ mentioned this pull request Dec 19, 2024
rajagopalanand added a commit to amazon-contributing/alertmanager that referenced this pull request Dec 19, 2024
anna-tran pushed a commit to amazon-contributing/alertmanager that referenced this pull request Dec 23, 2024
alanprot pushed a commit to amazon-contributing/alertmanager that referenced this pull request Dec 27, 2024
alanprot pushed a commit to amazon-contributing/alertmanager that referenced this pull request Dec 31, 2024
alexqyle pushed a commit to amazon-contributing/alertmanager that referenced this pull request Jan 2, 2025
alexqyle pushed a commit to amazon-contributing/alertmanager that referenced this pull request Jan 3, 2025
yeya24 pushed a commit to amazon-contributing/alertmanager that referenced this pull request Jan 7, 2025
yeya24 pushed a commit to amazon-contributing/alertmanager that referenced this pull request Jan 8, 2025
harry671003 pushed a commit to amazon-contributing/alertmanager that referenced this pull request Jan 8, 2025
harry671003 pushed a commit to amazon-contributing/alertmanager that referenced this pull request Jan 9, 2025
harry671003 pushed a commit to amazon-contributing/alertmanager that referenced this pull request Jan 10, 2025
alanprot pushed a commit to amazon-contributing/alertmanager that referenced this pull request Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants