-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for deadlock issue #3682 #3715
Conversation
c2d8ae2
to
2491a15
Compare
8f78ea6
to
df280a4
Compare
df280a4
to
0db2e83
Compare
0db2e83
to
cbe0520
Compare
@rajagopalanand thank you very much for your contribution - can you please fix the linter? |
This is a duplicate of https://github.com/prometheus/alertmanager/pull/3648/files, right? |
…rometheus#3682 Signed-off-by: Anand Rajagopal <[email protected]>
cbe0520
to
7d53a90
Compare
Was not aware of this but I will take a look |
The other PR addresses issues more than just the deadlock issue mentioned in #3682. If this can be reviewed and merged, then when the other PR gets reviewed and merged it can address race conditions and can retain a test from this PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for the detailed response and PR description.
This LGTM, but there's one more thing we need to do. Please take a look at my comment.
@@ -71,15 +71,14 @@ func (a *Alerts) Run(ctx context.Context, interval time.Duration) { | |||
|
|||
func (a *Alerts) gc() { | |||
a.Lock() | |||
defer a.Unlock() | |||
|
|||
var resolved []*types.Alert |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to stop passing []*types.Alert
and instead now send a slice of values to the callback.
var resolved []*types.Alert | |
var resolved []types.Alert |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can submit a separate PR for this
…rometheus#3682 (prometheus#3715) Signed-off-by: Anand Rajagopal <[email protected]> Signed-off-by: Gokhan Sari <[email protected]>
…rometheus#3682 (prometheus#3715) Signed-off-by: Anand Rajagopal <[email protected]>
…rometheus#3682 (prometheus#3715) Signed-off-by: Anand Rajagopal <[email protected]>
…rometheus#3682 (prometheus#3715) Signed-off-by: Anand Rajagopal <[email protected]>
…rometheus#3682 (prometheus#3715) Signed-off-by: Anand Rajagopal <[email protected]>
…rometheus#3682 (prometheus#3715) Signed-off-by: Anand Rajagopal <[email protected]>
…rometheus#3682 (prometheus#3715) Signed-off-by: Anand Rajagopal <[email protected]>
…rometheus#3682 (prometheus#3715) Signed-off-by: Anand Rajagopal <[email protected]>
…rometheus#3682 (prometheus#3715) Signed-off-by: Anand Rajagopal <[email protected]>
…rometheus#3682 (prometheus#3715) Signed-off-by: Anand Rajagopal <[email protected]>
…rometheus#3682 (prometheus#3715) Signed-off-by: Anand Rajagopal <[email protected]>
…rometheus#3682 (prometheus#3715) Signed-off-by: Anand Rajagopal <[email protected]>
…rometheus#3682 (prometheus#3715) Signed-off-by: Anand Rajagopal <[email protected]>
…rometheus#3682 (prometheus#3715) Signed-off-by: Anand Rajagopal <[email protected]>
…rometheus#3682 (prometheus#3715) Signed-off-by: Anand Rajagopal <[email protected]>
…rometheus#3682 (prometheus#3715) Signed-off-by: Anand Rajagopal <[email protected]>
…rometheus#3682 (prometheus#3715) Signed-off-by: Anand Rajagopal <[email protected]>
…rometheus#3682 (prometheus#3715) Signed-off-by: Anand Rajagopal <[email protected]>
…rometheus#3682 (prometheus#3715) Signed-off-by: Anand Rajagopal <[email protected]>
…rometheus#3682 (prometheus#3715) Signed-off-by: Anand Rajagopal <[email protected]>
…rometheus#3682 (prometheus#3715) Signed-off-by: Anand Rajagopal <[email protected]>
…rometheus#3682 (prometheus#3715) Signed-off-by: Anand Rajagopal <[email protected]>
…rometheus#3682 (prometheus#3715) Signed-off-by: Anand Rajagopal <[email protected]>
…rometheus#3682 (prometheus#3715) Signed-off-by: Anand Rajagopal <[email protected]>
* [CHANGE] Templating errors in the SNS integration now return an error. #3531 #3879 * [CHANGE] Adopt log/slog, drop go-kit/log #4089 * [FEATURE] Add a new Microsoft Teams integration based on Flows #4024 * [FEATURE] Add a new Rocket.Chat integration #3600 * [FEATURE] Add a new Jira integration #3590 #3931 * [FEATURE] Add support for `GOMEMLIMIT`, enable it via the feature flag `--enable-feature=auto-gomemlimit`. #3895 * [FEATURE] Add support for `GOMAXPROCS`, enable it via the feature flag `--enable-feature=auto-gomaxprocs`. #3837 * [FEATURE] Add support for limits of silences including the maximum number of active and pending silences, and the maximum size per silence (in bytes). You can use the flags `--silences.max-silences` and `--silences.max-silence-size-bytes` to set them accordingly #3852 #3862 #3866 #3885 #3886 #3877 * [FEATURE] Muted alerts now show whether they are suppressed or not in both the `/api/v2/alerts` endpoint and the Alertmanager UI. #3793 #3797 #3792 * [ENHANCEMENT] Add support for `content`, `username` and `avatar_url` in the Discord integration. `content` and `username` also support templating. #4007 * [ENHANCEMENT] Only invalidate the silences cache if a new silence is created or an existing silence replaced - should improve latency on both `GET api/v2/alerts` and `POST api/v2/alerts` API endpoint. #3961 * [ENHANCEMENT] Add image source label to Dockerfile. To get changelogs shown when using Renovate #4062 * [ENHANCEMENT] Build using go 1.23 #4071 * [ENHANCEMENT] Support setting a global SMTP TLS configuration. #3732 * [ENHANCEMENT] The setting `room_id` in the WebEx integration can now be templated to allow for dynamic room IDs. #3801 * [ENHANCEMENT] Enable setting `message_thread_id` for the Telegram integration. #3638 * [ENHANCEMENT] Support the `since` and `humanizeDuration` functions to templates. This means users can now format time to more human-readable text. #3863 * [ENHANCEMENT] Support the `date` and `tz` functions to templates. This means users can now format time in a specified format and also change the timezone to their specific locale. #3812 * [ENHANCEMENT] Latency metrics now support native histograms. #3737 * [ENHANCEMENT] Add timeout option for webhook notifier. #4137 * [BUGFIX] Fix the SMTP integration not correctly closing an SMTP submission, which may lead to unsuccessful dispatches being marked as successful. #4006 * [BUGFIX] The `ParseMode` option is now set explicitly in the Telegram integration. If we don't HTML tags had not been parsed by default. #4027 * [BUGFIX] Fix a memory leak that was caused by updates silences continuously. #3930 * [BUGFIX] Fix hiding secret URLs when the URL is incorrect. #3887 * [BUGFIX] Fix a race condition in the alerts - it was more of a hypothetical race condition that could have occurred in the alert reception pipeline. #3648 * [BUGFIX] Fix a race condition in the alert delivery pipeline that would cause a firing alert that was delivered earlier to be deleted from the aggregation group when instead it should have been delivered again. #3826 * [BUGFIX] Fix version in APIv1 deprecation notice. #3815 * [BUGFIX] Fix crash errors when using `url_file` in the Webhook integration. #3800 * [BUGFIX] fix `Route.ID()` returns conflicting IDs. #3803 * [BUGFIX] Fix deadlock on the alerts memory store. #3715 * [BUGFIX] Fix `amtool template render` when using the default values. #3725 * [BUGFIX] Fix `webhook_url_file` for both the Discord and Microsoft Teams integrations. #3728 #3745 * [BUGFIX] Fix wechat api link #4084 * [BUGFIX] Fix build info metric #4166 Signed-off-by: SuperQ <[email protected]>
…rometheus#3682 (prometheus#3715) Signed-off-by: Anand Rajagopal <[email protected]>
…rometheus#3682 (prometheus#3715) Signed-off-by: Anand Rajagopal <[email protected]>
…rometheus#3682 (prometheus#3715) Signed-off-by: Anand Rajagopal <[email protected]>
…rometheus#3682 (prometheus#3715) Signed-off-by: Anand Rajagopal <[email protected]>
…rometheus#3682 (prometheus#3715) Signed-off-by: Anand Rajagopal <[email protected]>
…rometheus#3682 (prometheus#3715) Signed-off-by: Anand Rajagopal <[email protected]>
…rometheus#3682 (prometheus#3715) Signed-off-by: Anand Rajagopal <[email protected]>
…rometheus#3682 (prometheus#3715) Signed-off-by: Anand Rajagopal <[email protected]>
…rometheus#3682 (prometheus#3715) Signed-off-by: Anand Rajagopal <[email protected]>
…rometheus#3682 (prometheus#3715) Signed-off-by: Anand Rajagopal <[email protected]>
…rometheus#3682 (prometheus#3715) Signed-off-by: Anand Rajagopal <[email protected]>
…rometheus#3682 (prometheus#3715) Signed-off-by: Anand Rajagopal <[email protected]>
What does this PR do?
This PR is a fix for #3682. In some instances,
mem.Alerts.Subscribe()
andstore.gc()
can get deadlockedstore.Alerts.gc()
:store.Alerts.gc()
acquires a lock on its internal mutexstore.Alerts.mtx
mem.Alerts.mtx
mem.Alerts.Subscribe()
acquires a lock on its internal mutexmem.Alerts.mtx
and callsstore.Alerts.List()
mem.Alerts.mtx
. However this lock is already being held bymem.Subscribe()
mem.Subscribe()
cannot proceed becausestore.List()
cannot acquire lock (store.Alerts.mtx
) because it is being held bystore.gc()
Another way of summarizing this is
store.Alerts.gc()
was holding the lock until callback function completed which in turn was waiting to acquire the lock. Callback function could not acquire the lock becauseSubscribe()
was holding the lock.Subscribe()
cannot progress because it callsstore.Alerts.List()
which was waiting for lock acquisition which was being held bystore.Alerts.gc()
. This fix releases the lock held bystore.Alerts.gc()
prior to calling the callback function