Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigation: Migrate the existing alerts to mimir alertmanager #3746

Closed
9 tasks done
Tracked by #3743
Rotfuks opened this issue Oct 29, 2024 · 16 comments
Closed
9 tasks done
Tracked by #3743

Investigation: Migrate the existing alerts to mimir alertmanager #3746

Rotfuks opened this issue Oct 29, 2024 · 16 comments
Assignees
Labels
team/atlas Team Atlas

Comments

@Rotfuks
Copy link
Contributor

Rotfuks commented Oct 29, 2024

Motivation

We already have a long set of alerts that are managed by the prometheus alertmanager. We need to make sure they will also work in the mimir alertmanager.

Todo

  • Investigate what we need to do
    • Enable Mimir Alertmanager
    • Setup object storage
    • Configure the ruler to use the new alert manager
    • Configure Grafana datasource
    • Load config and templates (observability-operator)
    • Configure silence-operator
    • Add anonymous tenant where needed (observability-operator and silence-operator config)
    • Update Mimir app with Mimir 2.15

Outcome

  • we know how to migrate the existing alerts to the new alertmanager
@github-project-automation github-project-automation bot moved this to Inbox 📥 in Roadmap Oct 29, 2024
@Rotfuks Rotfuks added the team/atlas Team Atlas label Oct 29, 2024
@Rotfuks Rotfuks changed the title Migrate the existing alerts to mimir alertmanager Investigation: Migrate the existing alerts to mimir alertmanager Oct 29, 2024
@hervenicol hervenicol assigned hervenicol and unassigned hervenicol Nov 8, 2024
@TheoBrigitte TheoBrigitte self-assigned this Nov 13, 2024
@TheoBrigitte
Copy link
Member

I did manage to deploy Mimir's Alertmanager and configure it in Grafana, but I have not yet been able to load alerts in it.

Here are the steps taken so far:

  • Enabled Mimir's Alertmanager in the chart
  • Configure object storage
  • Update ruler alertmanager url
  • Configure datasource in Grafana

Mimir values diff

--- values-mimir.original.yaml	2024-11-13 10:59:57.206365528 +0100
+++ values-mimir.yaml	2024-11-14 09:13:24.947664850 +0100
@@ -1,3 +1,4 @@
+USER-SUPPLIED VALUES:
 hpa:
   distributor:
     enabled: true
@@ -36,6 +37,8 @@
     enabled: true
     image:
       repository: gsoci.azurecr.io/giantswarm/mimir-continuous-test
+  alertmanager:
+    enabled: true
   distributor:
     replicas: 1
     resources:
@@ -102,6 +105,36 @@
         value: golem
       - key: name
         value: giantswarm-golem-mimir-ruler
+  - apiVersion: objectstorage.giantswarm.io/v1alpha1
+    kind: Bucket
+    metadata:
+      annotations:
+        meta.helm.sh/release-name: mimir
+        meta.helm.sh/release-namespace: mimir
+      labels:
+        app.kubernetes.io/instance: mimir-common
+        app.kubernetes.io/managed-by: Helm
+        app.kubernetes.io/name: mimir-common
+        application.giantswarm.io/team: atlas
+      name: giantswarm-golem-mimir-common
+      namespace: mimir
+    spec:
+      accessRole:
+        extraBucketNames:
+        - giantswarm-golem-mimir
+        roleName: giantswarm-golem-mimir
+        serviceAccountName: mimir
+        serviceAccountNamespace: mimir
+      expirationPolicy:
+        days: 100
+      name: giantswarm-golem-mimir-common
+      tags:
+      - key: app
+        value: mimir
+      - key: installation
+        value: golem
+      - key: name
+        value: giantswarm-golem-mimir-common
   gateway:
     autoscaling:
       enabled: true
@@ -186,6 +219,7 @@
         storage:
           backend: s3
           s3:
+            bucket_name: giantswarm-golem-mimir-common
             endpoint: s3.eu-west-2.amazonaws.com
             region: eu-west-2
       distributor:
@@ -208,7 +242,7 @@
         ruler_max_rule_groups_per_tenant: 0
         ruler_max_rules_per_rule_group: 0
       ruler:
-        alertmanager_url: http://alertmanager-operated.monitoring:9093
+        alertmanager_url: "http://mimir-alertmanager.mimir.svc:8080/alertmanager"
       ruler_storage:
         s3:
           bucket_name: giantswarm-golem-mimir-ruler

Grafana values diff

--- values-grafana.original.yaml	2024-11-13 09:36:10.332740268 +0100
+++ values-grafana.yaml	2024-11-14 10:52:40.833635135 +0100
@@ -72,11 +72,14 @@
         - name: Mimir Alertmanager
           type: alertmanager
           uid: mimir-alertmanager
-          url: http://mimir-alertmanager.mimir.svc/alertmanager
+          url: http://mimir-alertmanager.mimir.svc:8080/alertmanager
           access: proxy
           jsonData:
             handleGrafanaManagedAlerts: false
-            implementation: mimir
+            implementation: prometheus
+            httpHeaderName1: X-Scope-OrgID
+          secureJsonData:
+            httpHeaderValue1: 1
     kind: ConfigMap
     metadata:
       annotations:

@QuantumEnigmaa
Copy link

QuantumEnigmaa commented Nov 25, 2024

We decided to use a dedicated service account for mimir-alertmanager and thus also update the structuredConfig in the values with the following :

    alertmanager_storage:
      s3:
        bucket_name: 'giantswarm-golem-mimir-alertmanager'

... As well as the alertmanager field as such :

  alertmanager:
    enabled: true
    serviceAccount:
      create: true
      name: "mimir-alertmanager"
      annotations:
        # We use arn:aws-cn:iam for china and arn:aws:iam for the rest
        eks.amazonaws.com/role-arn: arn:aws:iam::<aws-account-id>:role/giantswarm-golem-mimir-alertmanager

However we encountered 2 issues :

In order for us to be able to go on in that direction without using too much workarounds, we'll need to wait for mimir helm chart's next release.

@QuentinBisson
Copy link

Do we need a custom bucket for this? I think we could use the ruler bucket right?

@QuantumEnigmaa
Copy link

QuantumEnigmaa commented Nov 27, 2024

I think we could use the ruler bucket right?

The issue is the same : we need the next mimir release because even the possibility of choosing the serviceAccount for the mimir-alertmanager was only added recently and is not yet released. This means that with the mimir version we currently use, mimir-alertmanager can only use the default mimir service account and the associated bucket.

Do we need a custom bucket for this?

I think it's better to put some data segregation, as much on a logical level as on a security point of view.

Concerning the tests done on golem, I managed to have the mimir-alertmanager run without any errors as a statefulset by creating its dedicated service account from the extraObjects section of mmir's values and manually editing the sts.

Ujnfortunately, even though the pod runs flawlessly, no notification policies nor any contact points are displayed on grafana for the mimir-alertmanager. The ruler has been redirected towards the mimir-alertmanager and isn't logging any error so I'm not sure what's blocking here.

Image

Image

@QuentinBisson
Copy link

Did you check thé datasource works? Maybe thé grafana logs Can help?

@QuantumEnigmaa
Copy link

So after checking on Grafana UI, the mimir-alertmanager datasource isn't working so I manually created one that's working. However even with this working datasource, there are still no contact points or notification policies associated with it and neither the mimir-alertmanager pod nor the grafana one give useful insight on why.

@QuentinBisson
Copy link

This is not going to work :D

ruler:
alertmanager_url: http://alertmanager-operated.monitoring:9093
enable_api: true
rule_path: /data

Not it makes sense that we have no contact points as we currently do not have an alert template configured

See: alertmanager_fallback_config.yaml: |
receivers:
- name: default-receiver
route:
receiver: default-receiver

@QuantumEnigmaa
Copy link

Yeah I noticed that in the meantime and moved the ruler.alertmanager_url under the structuredConfig section.
However I'm struggling with the fallback config as there are almost no examples or doc explaining how to write it :/

@QuentinBisson
Copy link

It's a default alertmanager config so the one on the old alertmanager should work

@TheoBrigitte
Copy link
Member

TheoBrigitte commented Dec 2, 2024

Waiting on the new mimir-distributed chart release grafana/mimir#10078 because we need this grafana/mimir#10016

@QuantumEnigmaa
Copy link

So I based myself on the current alertmanager config to write mimir-alertmanager's fallback config, giving the following one :

receivers:
  - name: root
  - name: heartbeat
    webhook_configs:
    - send_resolved: false
      http_config:
        authorization:
          type: GenieKey
          credentials: ed95e51a-0145-4de7-8601-a1326b5c3dae
        follow_redirects: true
        enable_http2: true
      url: https://api.opsgenie.com/v2/heartbeats/golem/ping 
route:
  group_by: [alertname, cluster_id, installation, status, team]
  group_interval: 15m
  group_wait: 5m
  repeat_interval: 4h
  receiver: root
  routes:
  - receiver: heartbeat
    matchers:
    - alertname="Heartbeat"
    - type="mimir-heartbeat"
    continue: true
    group_wait: 30s
    group_interval: 30s
    repeat_interval: 15m

Currently this didn't change anything as it looks like the ruler is unable to access mimir-alertmanager. When its alertmanager_url field is defined as "http://mimir-alertmanager.mimir.svc:8080" the ruler sends 404 error logs and it's defined as http://mimir-alertmanager.mimir.svc:8080/alertmanager, it sends the following logs :

ts=2024-12-02T11:39:00.006105446Z caller=notifier.go:612 level=error user=anonymous alertmanager=http://mimir-alertmanager.mimir.svc:8080/alertmanager/api/v2/alerts count=14 msg="Error sending alert" err="Post \"http://mimir-alertmanager.mimir.svc:8080/alertmanager/api/v2/alerts\": dial tcp 172.31.69.20:8080: connect: connection refused"
ts=2024-12-02T11:39:00.139994639Z caller=tcp_transport.go:440 level=warn component="memberlist TCPTransport" msg="WriteTo failed" addr=100.64.187.248:7946 err="dial tcp 100.64.187.248:7946: connect: no route to host"

So I'm a bit puzzled on what address to put here

@QuantumEnigmaa
Copy link

Following suggestions from people on the Grafana Slack channel, I edited the ruler's alertmanager_url to http://mimir-alertmanager.mimir.svc.cluster.local:8080 (and also tried the same url with the /alertmanager suffix even though this shouldn't be necessary as the http.alertmanager-http-prefix setting defaults to /alertmanager) but in the end this also lead to "Connection refused" errors.

@TheoBrigitte
Copy link
Member

I managed to setup Mimir's Alertmanager and the correct datasource in Grafana, everything is there https://github.com/giantswarm/shared-configs/pull/185. I did see the fallback config from Grafana when selecting Mimir's Alertmanager

Even though everything looked I could not see any active notifications (alerts) being sent to alertmanager.

@TheoBrigitte
Copy link
Member

TheoBrigitte commented Dec 10, 2024

Finally managed to ship the Alertmanager configuration into Mimir's Alertmanager anonymous tenant using observability-operator 🎉

observability-operator's log
Image
mimirtool
Image

This was done using a lot of things:

@QuentinBisson
Copy link

nice work

@TheoBrigitte
Copy link
Member

We are done here. We have all we need to migrate to Mimir Alertmanager, all the configuration was figured out and implemented. Releases for Mimir and observability-operator are out.

@github-project-automation github-project-automation bot moved this from Inbox 📥 to Validation ☑️ in Roadmap Jan 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team/atlas Team Atlas
Projects
Archived in project
Development

No branches or pull requests

5 participants