Skip to content

Commit

Permalink
Merge pull request #31 from Ortec-Finance/remove-job-deadline
Browse files Browse the repository at this point in the history
Remove job deadline
  • Loading branch information
ZeidH committed Feb 26, 2024
1 parent ae6896e commit fb5505b
Show file tree
Hide file tree
Showing 5 changed files with 55 additions and 2 deletions.
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
# Changelog
## v0.13.0
Removed `activeDeadlineSeconds` from base configuration as it does not comply with the Job Paradigm as we intend it. Added Documentation that explains how the Job Paradigm is used in Sailfish.

## v0.12.0
Added kustomization.yaml in `k8s/observability` so it works with kustomize remote ref
Expand Down
21 changes: 21 additions & 0 deletions docs/features/broker-scale-to-zero.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Broker Scale to Zero
By enabling this component you will have all services of Sailfish scaled to zero, when no message is received

## Prerequistes
- `sailfish-gateway` component
- Not use `ephemeral-broker` component

The ScaledObject enabled by the `broker-scale-to-zero` component triggers a scaleup of the broker when it detects the `sailfish-gateway` pod!

Do not use the `ephemeral-broker` component as that might result in data loss.


## Configuring your workloads
### The Gateway
The Gateway workload must be configured to wait for the broker to be up and running. This can be done by simply pinging the broker in a loop until successful.

### Additional Queues
When you have additional queues, this must be considered when using this component. The `sailfish-amq-broker-autoscaler` `ScaledObject` triggers are designed to keep the broker up after the gateway is finished and scaled down.

The ScaledJobs outside of the runner and run-manager must be added to the `triggers` of the `ScaledObject` as otherwise the broker might be scaled down when these queues are needed to be accessed.

32 changes: 32 additions & 0 deletions docs/the-job-paradigm.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# The Job Paradigm
Sailfish uses ScaledJobs to scale compute based on an Queue.
For your workloads to comply with this paradigm we need to consider a few symptoms

---

## The Problems

### Overshoot
The ScaledJobs tends to overshoot the need of jobs, this is due to delays between a job being picked up and the AMQ Broker signaling it via its Prometheus Metrics. This can sometimes result in more instances of Runners spawning per Task. Additionally, if your workloads are configured to not terminate after the completion of one Task, it can amplify this issue

### The Nature of a Job
A Kubernetes Job, is not supposed to be terminated from the outside. It's meant to run to completion and Kubernetes respects that by never terminating it unless it is evicted.

### Keeping Runners warm
For some workloads it can be beneficial to keep the Runners warm as the initialization can be time-consuming.

---

## The Solution
To comply with these symptoms you have to design your workloads to have a stop condition, so that they can terminate gracefully. You can do this by after each computation trigger a self-destruct timer with a short grace period of ~30s.

With this grace-period, we can have a Runner capable of picking up multiple tasks which prevents the initialization time penalty.


### Python
TODO: Code Examples

### C#
TODO: Code Examples


1 change: 0 additions & 1 deletion k8s/sailfish/base/foundation/run-manager-autoscaler.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,6 @@ spec:
restartPolicy: Never
backoffLimit: 4
parallelism: 1
activeDeadlineSeconds: 60
pollingInterval: 10
maxReplicaCount: 20 # Optional. Default: 100
successfulJobsHistoryLimit: 1 # Optional. Default: 100. How many completed jobs should be kept.
Expand Down
1 change: 0 additions & 1 deletion k8s/sailfish/base/foundation/runner-autoscaler.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,6 @@ spec:
restartPolicy: Never
backoffLimit: 4
parallelism: 1
activeDeadlineSeconds: 130
successfulJobsHistoryLimit: 1 # Optional. Default: 100. How many completed jobs should be kept.
pollingInterval: 2
maxReplicaCount: 100 # Optional. Default: 100
Expand Down

0 comments on commit fb5505b

Please sign in to comment.