Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create MonitorDaemon docker container for ol-www0 to monitor HTTP status codes #10267

Open
mekarpeles opened this issue Jan 3, 2025 · 5 comments
Assignees
Labels
Lead: @mekarpeles Issues overseen by Mek (Staff: Program Lead) [managed] Needs: Breakdown This big issue needs a checklist or subissues to describe a breakdown of work. [managed] Priority: 2 Important, as time permits. [managed] Type: Feature Request Issue describes a feature or enhancement we'd like to implement. [managed]

Comments

@mekarpeles
Copy link
Member

mekarpeles commented Jan 3, 2025

Proposal

A general purpose container called something like MonitorDaemon that can be added to any VM and configured with a list of monitoring operations that run on that host.

First, for ol-www0 this entails the IP and status aggregation scripts defined in #8795:

check-node ol-www0 && scripts/nginx_http_status_monitor.py
check-node ol-home0 && scripts/monitoring/solr_updater_lag.py

Justification

Problem

What problem does this proposal address & for what audience(s)?

Currently stats easily interrupted when #8795 scripts run on ol-www0 via tmux are interrupted
Screenshot 2025-01-03 at 2 08 26 PM

Breakdown

Can be closed once #8795 is evolved into a docker container approach that can go into our deploy and, initially, run on ol-www0

monitoring:
    profile: ["ol-www0", "ol-home0"]

Related files

Stakeholders


Instructions for Contributors

Please run these commands to ensure your repository is up to date before creating a new branch to work on this issue and each time after pushing code to Github, because the pre-commit bot may add commits to your PRs upstream.

@mekarpeles mekarpeles added Type: Feature Request Issue describes a feature or enhancement we'd like to implement. [managed] Needs: Breakdown This big issue needs a checklist or subissues to describe a breakdown of work. [managed] Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] Needs: Lead labels Jan 3, 2025
@mekarpeles mekarpeles changed the title Create ServiceMonitorDaemon docker container for ol-www0 to monitor HTTP status codes Create MonitorDaemon docker container for ol-www0 to monitor HTTP status codes Jan 3, 2025
@mekarpeles mekarpeles added Priority: 2 Important, as time permits. [managed] Lead: @mekarpeles Issues overseen by Mek (Staff: Program Lead) [managed] and removed Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] Needs: Lead labels Jan 4, 2025
@mekarpeles mekarpeles added this to the Sprint 2025-01 milestone Jan 4, 2025
@itsBaivab
Copy link

I would love to work on this. Could you please assign this to me?

@github-actions github-actions bot added the Needs: Response Issues which require feedback from lead label Jan 5, 2025
@mekarpeles
Copy link
Member Author

@itsBaivab I think this one should go to @cdrini on staff for now as he's already built most of the infrastructure

@mekarpeles
Copy link
Member Author

Any new monitoring should use python instead of bash for writing to graphite

@mekarpeles
Copy link
Member Author

mekarpeles commented Jan 6, 2025

This issue requires adding a new container that only runs on production (compose.production.yml) and gets deployed to every host, however the container will only run the jobs relating to the hosts the container is on.

For this issue, the only container with jobs should be ol-www0 and should be those jobs defined by:

This issue can be closed once this new docker instance for prod-only is running these two scripts on ol-www0

We should explore an alternative to watch as the command to run so the container doesn't prematurely die.

@cdrini also needs to stop the legacy tmux flow that's currently on prod for the old approach

@mekarpeles
Copy link
Member Author

@itsBaivab if this is enough to go on, feel free to give it a try and ask questions

@mekarpeles mekarpeles added Priority: 1 Do this week, receiving emails, time sensitive, . [managed] and removed Priority: 2 Important, as time permits. [managed] labels Jan 6, 2025
@mekarpeles mekarpeles removed the Needs: Response Issues which require feedback from lead label Jan 12, 2025
@mekarpeles mekarpeles added Priority: 2 Important, as time permits. [managed] and removed Priority: 1 Do this week, receiving emails, time sensitive, . [managed] labels Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Lead: @mekarpeles Issues overseen by Mek (Staff: Program Lead) [managed] Needs: Breakdown This big issue needs a checklist or subissues to describe a breakdown of work. [managed] Priority: 2 Important, as time permits. [managed] Type: Feature Request Issue describes a feature or enhancement we'd like to implement. [managed]
Projects
None yet
Development

No branches or pull requests

3 participants