- 1. Intro
- 2. Synthetic Testing
- 3. Monitors
- 4. Creating an Incident
- 5. Updating an Incident
- 6. Resolving an Incident
- 7.Blameless Postmortem
- 8. Links and Docs
- 9. Markdown Fun!
In this quick section we'll set up a test, assign a metric as a monitor, then go through an incident.
If you're reading this as a readme on GitHub, you can download the notebook's JSON file from json_import.json
file in the repository, and import it.
You'll want do so for live graphs!
Here's how:
- First create a new notebook in Datadog. Name it "Azure Incident Response".
- After creating a new notebook, import it from the top-right share icon:
Note: if ever you need to export a notebook, from the same menu you can download as PDF or markdown (.md) or export the JSON file.
In order to test the site we created earlier in the lab, we'll set up a synthetic monitor.
You can do that by following this link to create a multi-step synthetic test.
In a different tab, we'll need to retrieve the URL for our app service from the Azure App Services page.
After running a few tests you can see we have successes and failures based on the test settings:
You can export the metrics or graphs to a dashboard (new or existing) as seen below:
You can also add to a notebook (as we've been doing in order to create this Notebook!):
Once you see the Network Timings graph below fill up, please proceed to the next section.
Next we'll head over to the Monitors section of Datadog in order to have a look at the automatically-created monitor from our synthetic test.
It might be red, but don't panic - we're only setting things up in our development environment 😅
Because we set our alert to @all
, everyone in our company would have received this alert. That could have been via Teams, email, or other services you've set up in order to receive alerts or notifications.
In the following live graph embedded in the Notebook note that we're able to add checkpoints for specific periods of time (the following is Synthetics Response Time by URL for the past 4 hours):
From the dashboard we created by exporting the Synthetics metric, we'll declare an incident.
In the incident declaration, you can set a title/summary, the severity level, pick an audience for notifications as well as context and signals (ours will be pre-filled as we created the incident from a graph).
Once the incident has been created, it will appear in the following graph of Active Incidents:
Throughout the incident lifecycle, we'll want to update the status in order to keep team team and stakeholders up-to-date on the progress.
Note that you can also link to both live chat as well as video chat - say for example you've set up a new Teams channel programmatically in order to deal with the incident, but also a live Teams video meeting muster point (or "war room"). Both can be added as links so that others can join and get updated with one click from the header on the incident's page.
Next we'll go over adding an update, as well as sending out a notification from within the Incident Response section of Datadog.
To do so, first add an update from the Incident Response timeline.
Once it has been updated, on the top-right of the Incident Response page, we'll send out a notification:
Next we'll add a task to the incident. This is like a to-do list for the team. You can assign tasks to team members as well as add a deadline, if required.
After adding a task to the incident, the Datadog Events query should show an entry for the Incident update.
Once we've addressed the causes of the incident, for example via a subsequent deployment, we can then resolve it, via the status on the top-left of the Incident Resolution page.
Once the incident has been resolved, you would normally start the blameless postmortem process.
This means collecting a timeline of events, things that were tried, any dashboards related to the incident, etc. Since we collected these as we went along, when the postmortem is created in Datadog, it will collect all of these for you, and collect it in a notebook. Once the postmortem notebook has been created, you can then export it as markdown, JSON and/or PDF.
While working with Datadog notebooks, sometimes having a cheat-sheet handy can be helpful for those unfamiliar with Markdown, as well as to serve as a quick reminder to you while on-call.
Note
Use either _
or *
:
bold / bold
italics / italics
Single line / inline
code
Multiline
for i in {1..100};
do echo "hi from Datadog!";
done
Azure Service | Monitor |
---|---|
App Service | Throughput |
VM | Uptime |
- Checklist Item
- Unchecked
- Bullet
- List
- This is the first item
- This is the second!