-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Latency going up as more hosts are added #156
Comments
No, that is message is from an older version of the smokeping_prober. The message was removed when we added dynamic reload support. Reported latency may be going up because the prober is being starved for CPU and unable to process response packets fast enough. |
I'm looking at needing hundreds (probably 500+) hosts to monitor... |
How often do you ping per second and how many hosts? I am pinging a few hundred (200-300 hosts) but with different intervals. some I ping every 200ms and others every 5s. I noticed that at the beginning the CPU load is higher than at later times - maybe the load is distributed. Running "top" I sometimes see smokeping_prober consume 1100% CPU and then other times only 300-500%. The scrape interval of prometheus defines how the bucket lengt which means each buckt contains all ping results of the scrape interval. If you ping a host every 1s and scrape every 60s you have 60 results in that bucket. This may be "ok" for you but if you have some pings with high latency you do not know if they are at the beginning or the end or spread in the bucket, So it depends on the use case. I scrape every 15s which contains at least 3 pings for the "every 5s ping" targets. So back to yout question - I would check your CPU consumtion - maybe - if possible - just add a few more CPU cores and check how the behaviour changes. |
Anyone here find a solution? Facing the same issue here. Increasing the number of GOMAXPROC seems to help, but only a little bit. Using latest version. Trying to ping ~500 hosts every second. 24byte icmp. Is this some issue with Prometheus's |
No, the timing is calculated entirely outside of the metric manipulation. There is no difference in performance between metrics and labels in Prometheus monitoring. My guess right now is this has to do with the way the pingers are structured. Every target creates a new UDP listener, which means that we now have a lot of socket listeners all trying to read the packets off the receive queue. This is creating a lot of contention and delay. What we should do is create a single small pool of UDP receivers which timestamp the packets and send them to correct metric. |
FYI, this PR should improve performance: |
Thanks for smokeping! I am a prometheus newb, so please bear with me. Smokeping was working fine for me at first with a single host, then I tried adding about 100 additional hosts to ping, and the reported ICMP latency went up significantly. I dropped back down to 21 hosts, and latency dropped, but not back to the same level as with 1 target host.
is it correct config to have
Is the purpose of different (multiple) "hosts" section merely to have different variables such as interval and size, for different hosts?
If smokeping is creating and tracking and reporting buckets to prometheus, is there a valid reason to scrape smokeping from prometheus any more often than say 1min?
My prometheus config is as yet very simple:
From the prometheus log, I see a message like this when I have a single ICMP target:
but with 21 targets I get:
so it is clearly dividing the number of targets into 1000ms, but I cannot find this in the smokeping code, so I guess it is prometheus doing this? I was looking at this trying to figure out why reported latency is going up higher and higher the more ICMP target hosts I add.
Thanks for your help.
The text was updated successfully, but these errors were encountered: