Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latency going up as more hosts are added #156

Open
modena01 opened this issue Jun 11, 2024 · 8 comments
Open

Latency going up as more hosts are added #156

modena01 opened this issue Jun 11, 2024 · 8 comments

Comments

@modena01
Copy link

modena01 commented Jun 11, 2024

Thanks for smokeping! I am a prometheus newb, so please bear with me. Smokeping was working fine for me at first with a single host, then I tried adding about 100 additional hosts to ping, and the reported ICMP latency went up significantly. I dropped back down to 21 hosts, and latency dropped, but not back to the same level as with 1 target host.
is it correct config to have

targets:
 - hosts:
  - my.one.host
  - my.two.host
 - hosts:
  - my.three.host

Is the purpose of different (multiple) "hosts" section merely to have different variables such as interval and size, for different hosts?
If smokeping is creating and tracking and reporting buckets to prometheus, is there a valid reason to scrape smokeping from prometheus any more often than say 1min?

My prometheus config is as yet very simple:

- job_name: 'smokeping_prober'
   scrape_interval: 60s
   static_configs:
   - targets: ['localhost:9374']

From the prometheus log, I see a message like this when I have a single ICMP target:

"Waiting 1s between starting pingers" 

but with 21 targets I get:

"Waiting 47.619047ms between starting pingers"

so it is clearly dividing the number of targets into 1000ms, but I cannot find this in the smokeping code, so I guess it is prometheus doing this? I was looking at this trying to figure out why reported latency is going up higher and higher the more ICMP target hosts I add.

Thanks for your help.

@SuperQ
Copy link
Owner

SuperQ commented Jun 11, 2024

No, that is message is from an older version of the smokeping_prober. The message was removed when we added dynamic reload support.

Reported latency may be going up because the prober is being starved for CPU and unable to process response packets fast enough.

@modena01
Copy link
Author

modena01 commented Jun 13, 2024

Thanks SuperQ, I have now updated to the latest version, here is an example of what happens when I went from 21 hosts, to around 100.

image

do I need to run multiple smokeping instances and split the hosts out per instance? Increasing the interval period does not seem to help.

@modena01
Copy link
Author

I'm looking at needing hundreds (probably 500+) hosts to monitor...

@Nachtfalkeaw
Copy link

How often do you ping per second and how many hosts?
what packet size for icmp packet?
How many CPU cores do you have?

I am pinging a few hundred (200-300 hosts) but with different intervals. some I ping every 200ms and others every 5s. I noticed that at the beginning the CPU load is higher than at later times - maybe the load is distributed. Running "top" I sometimes see smokeping_prober consume 1100% CPU and then other times only 300-500%.

The scrape interval of prometheus defines how the bucket lengt which means each buckt contains all ping results of the scrape interval. If you ping a host every 1s and scrape every 60s you have 60 results in that bucket. This may be "ok" for you but if you have some pings with high latency you do not know if they are at the beginning or the end or spread in the bucket,

So it depends on the use case. I scrape every 15s which contains at least 3 pings for the "every 5s ping" targets.

So back to yout question - I would check your CPU consumtion - maybe - if possible - just add a few more CPU cores and check how the behaviour changes.

@Alb0t
Copy link

Alb0t commented Nov 1, 2024

Anyone here find a solution? Facing the same issue here. Increasing the number of GOMAXPROC seems to help, but only a little bit. Using latest version.

Trying to ping ~500 hosts every second. 24byte icmp. Is this some issue with Prometheus's addToBucket concurrency? Maybe we need more distinct metrics instead of label pairs?

@Alb0t
Copy link

Alb0t commented Nov 1, 2024

pprof001

@SuperQ
Copy link
Owner

SuperQ commented Nov 2, 2024

Is this some issue with Prometheus's addToBucket concurrency? Maybe we need more distinct metrics instead of label pairs?

No, the timing is calculated entirely outside of the metric manipulation. There is no difference in performance between metrics and labels in Prometheus monitoring.

My guess right now is this has to do with the way the pingers are structured. Every target creates a new UDP listener, which means that we now have a lot of socket listeners all trying to read the packets off the receive queue. This is creating a lot of contention and delay.

What we should do is create a single small pool of UDP receivers which timestamp the packets and send them to correct metric.

@SuperQ
Copy link
Owner

SuperQ commented Jan 21, 2025

FYI, this PR should improve performance:

#178

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants