Support for high availability redundancy #1015
Labels
enhancement
New feature or request
enterprise
Feature will be delivered as a part of Enterprise binary
Milestone
Description
Currently CTS does not natively support high availability. When CTS becomes unavailable, it relies on an external system or process to intervene and restart it.
We want to support redundancy in CTS by allowing multiple CTS instances to form a cluster of 1 leader and any number of followers. The leader instance would be responsible for executing tasks. The follower instances would be backups that are ready to take over task execution when the leader becomes unavailable. This creates a reliable failover that minimizes the time that the network infrastructure out-of-date and no longer requires external intervention.
Use Cases
Keeping network infrastructure up-to-date is mission critical for application delivery. Without high availability, CTS can become the single point of failure for network automation workflows that it is responsible for.
Alternative Solutions
Currently, there are workarounds that minimize CTS downtime, such as using an orchestrator like Nomad or Kubernetes. However, this puts the burden on users to make CTS more highly available.
Additional context
Some highly available systems distribute load amongst cluster members. For example, CTS followers could potentially share task execution responsibilities. This would be a task distribution type feature and is separate enhancement from the redundancy backup feature described in this issue
New Cluster Status Endpoint
To support high availability, a new API endpoint will be added,
GET /status/cluster
. This will allow users to get information about the members in the CTS cluster, including health information and leadership status.The text was updated successfully, but these errors were encountered: