Investigate options to speed up reconciliation up for large amount of GWs and Listeners and policies #1085

trepel · 2024-12-18T14:16:06Z

Overview

Recently a scale test has been implemented:
https://github.com/Kuadrant/testsuite/tree/main/scale_test

I did run a few scale test runs. It took quite some time for Kuadrant operator to reconcile all the policies. Many AuthPolicies and RLPs did not get any status for quite some time. After they got some status it often was:
'AuthPolicy waiting for the following components to sync: [AuthConfig (0cbc22a687a9ff2a57c54007e8ad9b6bc17de3744144196b9b8286fb1593f495)]'
and
'RateLimitPolicy waiting for the following components to sync: [Limitador]'

Everything got reconciled successfully and policies got enforced eventually, but it took quite some time:
1GW 16 Listeners -> 16s to get status for all policies
1 GW 32 Listeners -> 120s to get status for all policies
1 GW 48 Listeners -> 7 min to get status for all policies
1 GW 63 Listeners -> 30 min to get status for all policies

In the operator log there were a lot of entries complaining about invalid paths. So I made the HTTPRoutes target specific Listener rather than the whole Gateway. This make the results much nicer:

1 GW 32 Listeners -> 18s to get status for all policies
1 GW 48 Listeners -> 60s to get status for all policies
1 GW 63 Listeners -> 120s to get status for all policies

I tried with 2 GWs as well:
2 GW 16 Listeners -> 18s to get status for all policies
2 GW 32 Listeners -> 76s to get status for all policies

However, this was still too much:
2 GW 63 Listeners -> 16 min to get status for all policies

Initial Investigation

Be aware that certificate generation and DNS record creation might affect the results. It takes some for certificates to get created (the scale test uses self signed cluster issuer) and it also takes time for cloud provider to issue that many DNS records.

It seems reasonable that wasm config (only one per GW) is a contention point that said it should eventually get there.
There are repeated log entries of “failed to update the object has been modified; please apply your changes to the latest version” in Kuadrant operator pod.

Also entries like "failed to create SOMETHING, SOMETHING already exists" appeared in Kuadrant operator pod log (SOMETHING being Certificate/AuthConfig typically) - not sure if this indicates some issue or not.

Questions / Investigation required

Does it make sense that having many invalid paths is so expensive?
What can be done to improve on that 16 minutes? 2 Gateways and 63 Listeners on each are not super high numbers.

Steps to reproduce

Basically follow the readme of the scale test:
https://github.com/Kuadrant/testsuite/tree/main/scale_test
I used OCP on AWS (6 worker nodes) and DNS setup on the same AWS account
This was done against Kuadrant v1.0.1, OCP V4.17.7

It is enough details here I believe, for even more detais see (Red Hat only, sorry):
https://docs.google.com/document/d/1ATH2aZJ7-qlYTV3jF_rZduMC1MTPKoWCD-N4LmmcaMA/edit?tab=t.0

The text was updated successfully, but these errors were encountered:

trepel · 2024-12-18T15:15:23Z

PR with added sectionName:
Kuadrant/testsuite#610

Boomatang · 2025-01-07T11:41:32Z

@trepel I have found on issue with the reconcile of the ratelimit policies. I expect it was causing a lot of the slow down that you were seen. The testing I did was done using kind locally and I could see a large slow down with as little as 10 listeners.

It would be nice if you could run #1100 through the load test, and maybe grab the logs from the kuadrant operator also. I seen there was a lot of noisy logs which I hope a lot have being removed now.

trepel · 2025-01-07T12:03:11Z

@Boomatang great news! Thanks. I will look into this - I don't want to promise today but no later than tomorrow hopefully.

trepel · 2025-01-10T08:14:07Z

A way to improve the time is to increase the vCPU requests/limits for Kuadrant operator pod (big thanks to @Boomatang to point at this). Currently both is set to 200m. I tried "2 Gateway 32 Listerners each scenario" with requests/limits set to 1500m and the time AuthP/RLPs waiting for .status decreased from 76s-120s to 13s. The peak vCPU consumption was sligthly over 900m during the scale test run.

Q: The fact that Kuadrant operator was throttled was not obvious to me, do we have any alerting in place or plan to have any?

Another thing to look at is the scale test itself. It might make sense to split it so that issuing Certificates and issuing dnsrecords in DNS cloud provider are done in separate step. Performance of this is not directly related to Kuadrant. Furthermore there is separate DNSOperator scale test on DNSOperator level under development:
Kuadrant/dns-operator#326

Yet another thing is that kube-burner creates resources one by one (in relatively quick succession but not really at once). Given how our reconciliation flow logic works we might get better results if everything is created at once.

trepel added this to Kuadrant Dec 18, 2024

Boomatang self-assigned this Jan 2, 2025

Boomatang mentioned this issue Jan 7, 2025

Limitador sort function refactor #1100

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate options to speed up reconciliation up for large amount of GWs and Listeners and policies #1085

Investigate options to speed up reconciliation up for large amount of GWs and Listeners and policies #1085

trepel commented Dec 18, 2024

trepel commented Dec 18, 2024

Boomatang commented Jan 7, 2025

trepel commented Jan 7, 2025

trepel commented Jan 10, 2025

Investigate options to speed up reconciliation up for large amount of GWs and Listeners and policies #1085

Investigate options to speed up reconciliation up for large amount of GWs and Listeners and policies #1085

Comments

trepel commented Dec 18, 2024

Overview

Initial Investigation

Questions / Investigation required

Steps to reproduce

trepel commented Dec 18, 2024

Boomatang commented Jan 7, 2025

trepel commented Jan 7, 2025

trepel commented Jan 10, 2025