Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP: Multi-cluster workload scheduling & balancing #31

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
220 changes: 220 additions & 0 deletions enhancements/sig-architecture/18-workload-scheduling/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,220 @@
# Multi-Cluster Workload Scheduling

## Release Signoff Checklist

- [ ] Enhancement is `implementable`
- [ ] Design details are appropriately documented from clear requirements
- [ ] Test plan is defined
- [ ] Graduation criteria for dev preview, tech preview, GA
- [ ] User-facing documentation is created in [website](https://github.com/open-cluster-management-io/open-cluster-management-io.github.io/)

## Summary

This proposal will be adding a new mutli-cluster workload functionality to OCM
platform as either a built-in module or a pluggable addon and a new multi-cluster
workload API will be added under a new API group `workload.open-cluster-management.io`
as the manipulating interface for users. Note that the only requirement for
the adopted local workload (e.g. Deployment, ReplicaSet) in the spoke cluster will
be implementing the generic [scale](https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#scale-subresource)
subresource, so the new multi-cluster workload controller will be scaling up/down
the local workloads regardless of whether it's a built-in workload API or custom
workload developed via CRD.


## Motivation

### Goals

#### Controlling Replicas Distribution

In some cases, we may want to specify a total number of replicas for a multi-cluster
workload and let the controller do the rest of replicas distribution for us according
to different strategies such as (1) even (max-min) distribution (2) weighted
(proportional) distribution. The distribution should be updated reactively by
watching the selected list of clusters via the output from `PlacementDecision` API.
Note that the computed distribution here will be an "expected" number while the
actual distribution may diverge from the expectation depending on the allocatable
resource or the liveness of the replicas which is elaborated in the following section.

#### Dynamic Replicas Balancing

The term "balance" or "re-schedule" here infers the process of transferring a replicas
temporarily from one cluster to another. There are some cases when we need to trigger
the process of replicas balancing:

- When the distributed local workload fails to provision effective instances over a
period of time.
- When the distributed local workload is manually scaled down on purpose.

The process of replicas transferring can be either "bursty" or "conservative":

- __Bursty__: Increasing the replicas for one cluster then decrease the other.
- __Conservative__: Decrease first then increase.

#### Adopting Arbitrary Workload Types

Given the fact that more and more third-party extended workload API are emerging beyond
the kubernetes community, our multi-cluster workload controller should not raise any
additional requirement on the managing workload API except for enabling the standard
[scale](https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#scale-subresource)
subresource via the CRD. Hence, to scale up or down the local workload, the controller
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the issue is manifestwork is to apply "Any" resources, and most of them dost not scale. To support his, we probably need to have field in the manifestwork to override the replica path in the manifests of manifestwork.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

am not sure what kind of built-in support we want from the manifestwork api in the current phase. a random idea i can think of is to add a new types of remediation type e.g. UpdateScaleSubresource which optionally updates local delivered resources via /scale subresource iff the replicas is the only difference from the existing state and expectation.

will be simply updating/patching the subresource regardless of its concrete types.

### Non Goals

- This KEP will not cover the distribution of special workloads such as `Job`.
- This KEP will not cover the distribution of satellite resources around the workload
such as `ConfigMap`, `ServiceAccount`.

## Proposal

### Abstraction

To understand the functionalities of the multi-cluster workload easier, we can start by
defining the boundary of the controller's abstraction as a black box.

#### "ClusterSet" and "Placement"

#### Controller "Input"

The source of information input for the multi-cluster workload controller will be:

- __Cluster Topology__: The `PlacementDesicision` is dynamically computed according to
the following knowledge from OCM's built-in APIs:
- The existence and availability of the managed clusters.
- The "hub-defined" attributes attached to the cluster model via labelling. e.g. the
clusterset, the feature label or other custom labels which will be read by the
placement controller.
- The "spoke-reported" attributes i.e. "ClusterClaim" which is collected and reported
by the spoke agent.

- __API Prescription__: There will be a new API named "ElasticWorkload" or
"ManagedWorkload" that prescribes the necessary information for workload distribution
such as the content of the target workload, the expected total number of replicas, etc.

- __Effective Distributed Local Workload__: The new controller also need to capture the
events from local clusters so that it can take actions e.g. when the instance crushes
or tainted unexpectedly.

#### Controller "Output"

The new controller will be applying the latest state of the workload towards the selected
clusters and tuning its replicas on demand. As a matter of implementation, the workload
applying will be executed via the stable `ManifestWork` api.

### API Spec

```yaml
apiVerion: scheduling.open-cluster-management.io/v1
kind: ElasticWorkload
spec:
# The target namespace to deploy the workload in the spoke cluster.
spokeNamespace: default
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we just let this workload to be deployed on spoke in the same ns of this resource on hub?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

practically i think that will work for most cases b/c we are usually managing one namespace per application. but am not sure if that will apply to all cases.

# The content of target workload, supporting:
# - Inline: Embedding a static manifest.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should limit the type of allowed resource here? For example, it can only be resources that can scale

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

u mean clarify the limit in the comment/doc? in the implementation, we can check whether a resource has /scale by api-discovery, the RESTMapper in the native client library requires only group-version-kind to verify the precondition so i guess it's not necessary to assert the resource metadata explicitly in the api spec?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have an admission control for allowed resources, or what it a user specify a resource here that cannot scale?

# - Import: Referencing an existing workload resource. (Note that
# the replicas should always be set to 0 to avoid wasting
# capacities in the hub cluster.)
target:
type: [ Inline | Import ]
inline: ...
import: ...
# Referencing an OCM's placement policy in the same namespace as where
# this elastic workload resource lives.
placementRef:
name: ...
# DistributionStrategy controls the expected replicas distribution
# across the selected clusters from the placement api above. The supported
# distributing strategy will be:
# - Even: Filling the min replicas upon every round. i.e. max-min
# - Weighted: Setting a default weight and overriding the weight for a
# few clusters on demand.
distributionStrategy:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a cluster is added into or removed from the decision of the related placement, will the distribution be recalculated?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes i think so

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how can we ensure that limitRange.min is satisfied in this case? I think the API can only ensure how evenly the replicas are distributed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my original idea is the final distribution is calculated in two phases (1) initial-distribution i.e. distributionStrategy (2) second-time re-distribution i.e. balanceStrategy. so it the initial result from even distribution strategy doesnt conform to the requirement by .limitRange.min. the final distributed result will be round up to .limitRange.min. and additionally if .limitRange.min * selectedClusters >= the expected total replicas, the reconcile loop should be returning w/o applying any actual changes.

totalReplicas: 10
type: [ Even | Weighted ]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how to specify cluster weight, at "Placement.prioritizerPolicy"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for clarification "Placement.prioritizerPolicy" only takes effect during cluster selection, while the Weighted distribution indicates the distribution of replicas for the workload. the a sample of weighted distribution will be something like:

spec:
  distributionStrategy:
    totalReplicas: 10
    type: Weighted
    weighted:
      defaultWeight: 10
      overrides:
      - clusterName: xx
         weight: 100

# BalanceStrategy prescribes the balancing/re-scheduling behavior of the
# controller when the effective distributed replicas doesn't meet the
# expectation within a period of "hesitation" time. The supported types
# will be:
# - None: Do not reschedule at any time.
# - LimitRange: The reschedule is allowed within a range of numbers. The
# replicas scheduler will be trying the best to control the
# managed replicas within the range:
# * "min": when the controller is attempting to transfer a replicas,
# those clusters under the "min" will be the primary choices.
# * "max": the controller will exclude the cluster exceeding the "max"
# from the list of candidates upon re-scheduling.
# - Classful: A classful prioritized rescheduling policy.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems Classful cover all the cases in LimitRange. Why we need the two options?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think in the first stage, we will leave the Classful unimplemented, just None and LimitRange in the alpha api should be sufficient

# * "assured": similar to "min" above
# * "softLimit": those cluster (assured < # of replicas <= softLimit)
# will be considered as secondary choice in the candicates.
# Generally the "softLimit" can be considered as a
# recommended watermark of replicas upon re-scheduling.
# * "hardLimit": similar to "max" above.
balanceStrategy:
type: [ None | LimitRange | Classful ]
limitRange:
min: ...
max: ...
classful:
assured: ...
softLimit: ...
hardLimit: ...
status:
# list of distributed resources
manifestWorks: ...
```

### Details

#### When "Distribution strategy" and "Balance strategy" conflicts

The "Distribution strategy" works prior to "Balance strategy", so the latter can
be considered as an overriding patch upon the former. The controller will always
be honoring the balance strategy. The following list is a few possible examples
when the two fields conflicts when combining "Weighted" distributor and
"LimitRange" re-balancer:

###### Some expected replicas exceeds the max watermark
- Conditions:
- Selected # of Clusters: 2
- Distribution: 1:2 weighted distribution under 6 total replicas.
- Balance: LimitRange within 2-3

Result: The initial expected distribution shall be 2:4 while the re-balancing will
reset the distribution to 3:3 in the end.

###### All expected replicas exceeds the max watermark
- Conditions:
- Selected # of Clusters: 2
- Distribution: 1:2 weighted distribution under 6 total replicas.
- Balance: LimitRange within 1-2

Result: The initial expected distribution shall be 2:4 while the re-balancing will
reset the distribution to 2:2 even if the sum can't reach the total replicas.

###### All expected replicas doesn't reach the min watermark
- Conditions:
- Selected # of Clusters: 2
- Distribution: 1:2 weighted distribution under 6 total replicas.
- Balance: LimitRange within 5-10

Result: The initial expected distribution shall be 2:4 while the re-balancing will
reset the distribution to 5:5 regardless of distribution strategy.

#### Workload Manifest Status Collection

Overall in OCM, there are 3 feasible ways of collecting the status from the spoke
clusters:

1. List-Watching: The kubefed fashion status collection. It violates with OCM's
pull-based architecture and will be likely the bottleneck of
scalability when the # of managed clusters grows.
2. Polling Get: Getting resources at a fixed interval which is costing less but losing
promptness on the other hand.
3. Delegate to `ManifestWork`: See the new status collection functionalities WIP in
[#30](https://github.com/open-cluster-management-io/enhancements/pull/30)

This proposal will be supporting (2) and (3) in the end and leave the choice to users.